Python Automated Machine Learning library for tabular data.

Overview

Read the Docs Lines of code GitHub issues GitHub Repo stars GitHub contributors


Logo

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Science tasks.
πŸ“š Explore the docs Β»

🐞 Report Bug Β· πŸ†• Request Feature

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact

About the project

Disclaimer

This library is an open-source research project and is not part of any official SAP products.

What's this?

This is a simple but accurate Automated Machine Learning library. Based on SAP HANA powerful in-memory algorithms, it provides high accuracy in multiple machine learning tasks. Our library also uses numerous data preprocessing functions to automate routine data cleaning tasks. So, hana_automl goes through all AutoML steps and makes Data Science work easier.

What is SAP HANA?

From www.sap.com: SAP HANA is a high-performance in-memory database that speeds data-driven, real-time decisions and actions.

Web app

https://share.streamlit.io/dan0nchik/sap-hana-automl/main/web.py

Documentation

https://sap-hana-automl.readthedocs.io/en/latest/index.html

Benchmarks

https://github.com/dan0nchik/SAP-HANA-AutoML/blob/main/comparison_openml.ipynb

ML tasks:

  • Binary classification
  • Regression
  • Multiclass classification
  • Forecasting

Steps automated:

  • Data exploration
  • Data preparation
  • Feature engineering
  • Model selection
  • Model training
  • Hyperparameter tuning

πŸ‘‡ By the end of summer 2021, blue part will be fully automated by our library Logo

Clients

Streamlit client Streamlit client

Built With

Getting Started

To get a package up and running, follow these simple steps.

Prerequisites

Make sure you have the following:

  1. βœ… Setup SAP HANA (skip this step if you have an instance with PAL enabled). There are 2 ways to do that.
    In HANA Cloud:

    • Create a free trial account
    • Setup an instance
    • Enable PAL - Predictive Analysis Library. It is vital to enable it because we use their algorithms.

    In Virtual Machine:

    • Rent a virtual machine in Azure, AWS, Google Cloud, etc.
    • Install HANA instance there or on your PC (if you have >32 Gb RAM).
    • Enable PAL - Predictive Analysis Library. It is vital to enable it because we use their algorithms.
  2. βœ… Installed software

  • Python > 3.6
    Skip this step if python --version returns > 3.6
  • Cython
    pip3 install Cython

Installation

There are 2 ways to install the library

  • Stable: from pypi
    pip3 install hana_automl
  • Latest: from the repository
    pip3 install https://github.com/dan0nchik/SAP-HANA-AutoML/archive/dev.zip
    Note: latest version may contain bugs, be careful!

After installation

Check that PAL (Predictive Analysis Library) is installed and roles are granted

  • Read docs section about that.
  • If you don't want to read docs, run this code
    from hana_automl.utils.scripts import setup_user
    from hana_ml.dataframe import ConnectionContext
    
    cc = ConnectionContext(address='address', user='user', password='password', port=39015)
    
    # replace with credentials of user that will be created or granted a role to run PAL.
    setup_user(connection_context=cc, username='user', password="password")

Usage

From code

Our library in a few lines of code

Connect to database.

from hana_ml.dataframe import ConnectionContext

cc = ConnectionContext(address='address',
                     user='username',
                     password='password',
                     port=1234)

Create AutoML model and fit it.

from hana_automl.automl import AutoML

model = AutoML(cc)
model.fit(
  file_path='path to training dataset', # it may be HANA table/view, or pandas DataFrame
  steps=10, # number of iterations
  target='target', # column to predict
  time_limit=120 # time limit in seconds
)

Predict.

model.predict(
file_path='path to test dataset',
id_column='ID',
verbose=1
)

For more examples, please refer to the Documentation

How to run Streamlit client

  1. Clone repository: git clone https://github.com/dan0nchik/SAP-HANA-AutoML.git
  2. Install dependencies: pip3 install -r requirements.txt
  3. Run GUI: streamlit run ./web.py

Roadmap

See the open issues for a list of proposed features (and known issues). Feel free to report any bugs :)

Contributing

Any contributions you make are greatly appreciated πŸ‘ !

  1. Fork the Project

  2. Create your Feature Branch (git checkout -b feature/NewFeature)

  3. Install dependencies

    pip3 install Cython
    pip3 install -r requirements.txt
  4. Create credentials.py file in tests directory Your files should look like this:

    SAP-HANA-AutoML
    β”‚   README.md
    β”‚   all other files   
    β”‚   .....
    |
    └───tests
        β”‚   test files...
        β”‚   credentials.py
    

    Copy and paste this piece of code there and replace it with your credentials:

    host = "host"
    user = "username"
    password = "password"
    port = 39015 # or any port you need
    schema = "your schema"

    Don't worry, this file is in .gitignore, so your credentials won't be seen by anyone.

  5. Make some changes

  6. Write tests that cover your code in tests directory

  7. Run tests (under SAP-HANA-AutoML directory)

    pytest
  8. Commit your changes (git commit -m 'Add some amazing features')

  9. Push to the branch (git push origin feature/AmazingFeature)

  10. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.
Don't really understand license? Check out the MIT license summary.

Contact

Authors: @While-true-codeanything, @DbusAI, @dan0nchik

Project Link: https://github.com/dan0nchik/SAP-HANA-AutoML

Owner
Daniel Khromov
Learning Swift, C#, and Data Science
Daniel Khromov
Covid-polygraph - a set of Machine Learning-driven fact-checking tools

Covid-polygraph, a set of Machine Learning-driven fact-checking tools that aim to address the issue of misleading information related to COVID-19.

1 Apr 22, 2022
Tools for diffing and merging of Jupyter notebooks.

nbdime provides tools for diffing and merging of Jupyter Notebooks.

Project Jupyter 2.3k Jan 03, 2023
Can a machine learning project be implemented to estimate the salaries of baseball players whose salary information and career statistics for 1986 are shared?

END TO END MACHINE LEARNING PROJECT ON HITTERS DATASET Can a machine learning project be implemented to estimate the salaries of baseball players whos

Pinar Oner 7 Dec 18, 2021
Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

Ross Taylor 2k Jan 02, 2023
Code for the TCAV ML interpretability project

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Been Kim, Martin Wattenberg, Justin Gilmer, C

552 Dec 27, 2022
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Spark Python Notebooks This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, fro

Jose A Dianes 1.5k Jan 02, 2023
LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

126 Dec 28, 2022
Falken provides developers with a service that allows them to train AI that can play their games

Falken provides developers with a service that allows them to train AI that can play their games. Unlike traditional RL frameworks that learn through rewards or batches of offline training, Falken is

Google Research 223 Jan 03, 2023
Programming assignments and quizzes from all courses within the Machine Learning Engineering for Production (MLOps) specialization offered by deeplearning.ai

Machine Learning Engineering for Production (MLOps) Specialization on Coursera (offered by deeplearning.ai) Programming assignments from all courses i

Aman Chadha 173 Jan 05, 2023
ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart β†’ ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

Salesforce 77 Jan 06, 2023
Turns your machine learning code into microservices with web API, interactive GUI, and more.

Turns your machine learning code into microservices with web API, interactive GUI, and more.

Machine Learning Tooling 2.8k Jan 02, 2023
A single Python file with some tools for visualizing machine learning in the terminal.

Machine Learning Visualization Tools A single Python file with some tools for visualizing machine learning in the terminal. This demo is composed of t

Bram Wasti 35 Dec 29, 2022
A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

pyUpSet A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al. Contents Purpose How to install How it work

288 Jan 04, 2023
A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Currently a Beta-Version lucidmode is an open-source, low-code and lightweight Python framework for transparent and interpretable machine learning mod

lucidmode 15 Aug 12, 2022
ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions

ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions, in particular, the posterior distributions of Bayesian models in

Computational Data Science Lab 182 Dec 31, 2022
CinnaMon is a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system

CinnaMon is a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system

Zelros 67 Dec 28, 2022
Self Organising Map (SOM) for clustering of atomistic samples through unsupervised learning.

Self Organising Map for Clustering of Atomistic Samples - V2 Description Self Organising Map (also known as Kohonen Network) implemented in Python for

Franco Aquistapace 0 Nov 16, 2021
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
Both social media sentiment and stock market data are crucial for stock price prediction

Relating-Social-Media-to-Stock-Movement-Public - We explore the application of Machine Learning for predicting the return of the stock by using the information of stock returns. A trading strategy ba

Vishal Singh Parmar 15 Oct 29, 2022
This jupyter notebook project was completed by me and my friend using the dataset from Kaggle

ARM This jupyter notebook project was completed by me and my friend using the dataset from Kaggle. The world Happiness 2017, which ranks 155 countries

1 Jan 23, 2022