Machine learning template for projects based on sklearn library.

Overview

Scikit-learn-project-template

About the project

  • Folder structure suitable for many machine learning projects. Especially for those with small amount of available training data.
  • .json config file support for convenient parameter tuning.
  • Customizable command line options for more convenient parameter tuning.
  • Abstract base classes for faster development:
    • BaseOptimizer handles execution of grid search, saving and loading of models and formation of test and train reports.
    • BaseDataLoader handles splitting of training and testing data. Spilt is performed depending on settings provided in config file.
    • BaseModel handles construction of consecutive steps defined in config file.

Getting Started

To get a local copy up and running follow steps below.

Requirements

  • Python >= 3.7
  • Packages included in requirements.txt file
  • (Anaconda for easy installation)

Install dependencies

Create and activate virtual environment:

conda create -n yourenvname python=3.7
conda activate yourenvname

Install packages:

python -m pip install -r requirements.txt

Folder Structure

sklearn-project-template/
│
├── main.py - main script to start training and (optionally) testing
│
├── base/ - abstract base classes
│   ├── base_data_loader.py
│   ├── base_model.py
│   └── base_optimizer.py
│
├── configs/ - holds configuration for training and testing
│   ├── config_classification.json
│   ├── config_regression.json
│
├── data/ - default directory for storing input data
│
├── data_loaders/ - anything about data loading goes here
│   └── data_loaders.py
│
├── models/ - models
│   ├── __init__.py - defined models by name
│   └── models.py
│
├── optimizers/ - optimizers
│   └── optimizers.py
│
├── saved/ - config, model and reports are saved here
│   ├── Classification
│   └── Regression
│
├── utils/ - utility functions
│   └── parse_config.py - class to handle config file and cli options
│   ├── utils.py
│
├── wrappers/ - wrappers of modified sklearn models or self defined transforms
│   ├── data_transformations.py
│   └── wrappers.py

Usage

Models in this repo are trained on two well-known datasets: iris and boston. First is used for classification and second for regression problem.

Run classification:

python main.py -c configs/config_classification.json

Run regression:

python main.py -c configs/config_regression.json

Config file format

Config files are in .json format. Example of such config is shown below:

{
    "name": "Classification",   // session name

    "model": {
        "type": "Model",    // model name
        "args": {
            "pipeline": ["scaler", "PLS", "pf", "SVC"]     // pipeline of methods
        }
    },

    "tuned_parameters":[{   // parameters to be tuned with search method
                        "SVC__kernel": ["rbf"],
                        "SVC__gamma": [1e-5, 1e-6, 1],
                        "SVC__C": [1, 100, 1000],
                        "PLS__n_components": [1,2,3]
                    }],

    "optimizer": "OptimizerClassification",    // name of optimizer

    "search_method":{
        "type": "GridSearchCV",    // method used to search through parameters
        "args": {
            "refit": false,
            "n_jobs": -1,
            "verbose": 2,
            "error_score": 0
        }
    },

    "cross_validation": {
        "type": "RepeatedStratifiedKFold",     // type of cross-validation used
        "args": {
            "n_splits": 5,
            "n_repeats": 10,
            "random_state": 1
        }
    },

    "data_loader": {
        "type": "Classification",      // name of dataloader class
        "args":{
            "data_path": "data/path-to-file",    // path to data
            "shuffle": true,    // if data shuffled before optimization
            "test_split": 0.2,  // use split method for model testing
            "stratify": true,   // if data stratified before optimization
            "random_state":1    // random state for repeaded output
        }
    },

    "score": "max balanced_accuracy",     // mode and metrics used for scoring
    "test_model": true,     // if model is tested after training
    "save_dir": "saved/"    // directory of saved reports, models and configs

}

Additional parameters can be added to config file. See SK-learn documentation for description of tuned parameters, search method and cross validation. Possible metrics for model evaluation could be found here.

Pipeline

Methods added to config pipeline must be first defined in models/__init__.py file. For previous example of config file the following must be added:

from wrappers import *
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

methods_dict = {
  'pf': PolynomialFeatures,
  'scaler': StandardScaler,
  'PLS':PLSRegressionWrapper,
  'SVC':SVC,
}

Majority of algorithms implemented in SK-learn library can be directly imported and used. Some algorithms need a little modification before usage. Such an example is Partial least squares (PLS). Modification is implemented in wrappers/wrappers.py. In case you want to implement your own method it can be done as well. An example wrapper for Savitzky golay filter is shown in wrappers/data_transformations.py. Implementation must satisfy standard method calls, eg. fit(), tranform() etc.

Customization

Custom CLI options

Changing values of config file is a clean, safe and easy way of tuning hyperparameters. However, sometimes it is better to have command line options if some values need to be changed too often or quickly.

This template uses the configurations stored in the json file by default, but by registering custom options as follows you can change some of them using CLI flags.

# simple class-like object having 3 attributes, `flags`, `type`, `target`.
CustomArgs = collections.namedtuple('CustomArgs', 'flags type target')
options = [
      CustomArgs(['-cv', '--cross_validation'], type=int, target='cross_validation;args;n_repeats'),
    # options added here can be modified by command line flags.
]

target argument should be sequence of keys, which are used to access that option in the config dict. In this example, target number of repeats in cross validation option is ('cross_validation', 'args', 'n_repeats') because config['cross_validation']['args']['n_repeats'] points to number of repeats.

Data Loader

  • Writing your own data loader
  1. Inherit BaseDataLoader

    BaseDataLoader handles:

    • Train/test procedure
    • Data shuffling
  • Usage

    Loaded data must be assigned to data_handler (dh) in appropriate manner. If dh.X_data_test and dh.y_data_test are not assigned in advance, train/test split could be created by base data loader. In case "test_split":0.0 is set in config file, whole dataset is used for training. Another option is to assign both train and test sets as shown below. In this case train data will be used for optimization and test data will be used for evaluation of a model.

    data_handler.X_data = X_train
    data_handler.y_data = y_train
    data_handler.X_data_test = X_test
    data_handler.y_data_test = y_test
  • Example

    Please refer to data_loaders/data_loaders.py for data loading example.

Optimizer

  • Writing your own optimizer
  1. Inherit BaseOptimizer

    BaseOptimizer handles:

    • Optimization procedure
    • Model saving and loading
    • Report saving
  2. Implementing abstract methods

    You need to implement fitted_model() which must return fitted model. Optionally you can implement format of train/test reports with create_train_report() and create_test_report().

  • Example

    Please refer to optimizers/optimizers.py for optimizer example.

Model

  • Writing your own model
  1. Inherit BaseModel

    BaseModel handles:

    • Initialization defined in config pipeline
    • Modification of steps
  2. Implementing abstract methods

    You need to implement created_model() which must return created model.

  • Usage

    Initialization of pipeline methods is performed with create_steps(). Steps can be later modified with the use of change_step(). An example on how to change a step is shown bellow where Sequential feature selector is added to the pipeline.

    def __init__(self, pipeline):
        steps = self.create_steps(pipeline)
    
        rf = RandomForestRegressor(random_state=1)
        clf = TransformedTargetRegressor(regressor=rf,
                                        func=np.log1p,
                                        inverse_func=np.expm1)
        sfs = SequentialFeatureSelector(clf, n_features_to_select=2, cv=3)
    
        steps = self.change_step('sfs', sfs, steps)
    
        self.model = Pipeline(steps=steps)

    Beware that in this case 'sfs' needs to be added to pipeline in config file. Otherwise, no step in the pipeline is changed.

  • Example

    Please refer to models/models.py model example.

Roadmap

See open issues to request a feature or report a bug.

Contribution

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

How to start with contribution:

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Feel free to contribute any kind of function or enhancement.

License

This project is licensed under the MIT License. See LICENSE for more details.

Acknowledgements

This project is inspired by the project pytorch-template by Victor Huang. I would like to confess that some functions, architecture and some parts of readme were directly copied from this repo. But to be honest, what should I do - the project is absolutely amazing!

Consider supporting

Do you feel generous today? I am still a student and would make a good use of some extra money :P

Owner
Janez Lapajne
Janez Lapajne
The Fuzzy Labs guide to the universe of open source MLOps

Open Source MLOps This is the Fuzzy Labs guide to the universe of free and open source MLOps tools. Contents What is MLOps, anyway? Data version contr

Fuzzy Labs 352 Dec 29, 2022
李航《统计学习方法》复现

本项目复现李航《统计学习方法》每一章节的算法 特点: 笔记摘要:在每个文件开头都会有一些核心的摘要 pythonic:这里会用尽可能规范的方式来实现,包括编程风格几乎严格按照PEP8 循序渐进:前期的算法会更list的方式来做计算,可读性比较强,后期几乎完全为numpy.array的计算,并且辅助详

58 Oct 22, 2021
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 76 Dec 15, 2022
Python 3.6+ toolbox for submitting jobs to Slurm

Submit it! What is submitit? Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps

Facebook Incubator 768 Jan 03, 2023
Distributed Evolutionary Algorithms in Python

DEAP DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data stru

Distributed Evolutionary Algorithms in Python 4.9k Jan 05, 2023
AutoOED: Automated Optimal Experiment Design Platform

AutoOED is an optimal experiment design platform powered with automated machine learning to accelerate the discovery of optimal solutions. Our platform solves multi-objective optimization problems an

Yunsheng Tian 107 Jan 03, 2023
Bodywork deploys machine learning projects developed in Python, to Kubernetes.

Bodywork deploys machine learning projects developed in Python, to Kubernetes. It helps you to: serve models as microservices execute batch jobs run r

Bodywork Machine Learning 409 Jan 01, 2023
This repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

B DEVA DEEKSHITH 1 Nov 03, 2021
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

2.5k Dec 28, 2022
Reproducibility and Replicability of Web Measurement Studies

Reproducibility and Replicability of Web Measurement Studies This repository holds additional material to the paper "Reproducibility and Replicability

6 Dec 31, 2022
Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

Priyansh Sharma 7 Nov 09, 2022
Bayesian optimization in JAX

Bayesian optimization in JAX

Predictive Intelligence Lab 26 May 11, 2022
stability-selection - A scikit-learn compatible implementation of stability selection

stability-selection - A scikit-learn compatible implementation of stability selection stability-selection is a Python implementation of the stability

185 Dec 03, 2022
Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Tangram Website | Discord Tangram makes it easy for programmers to train, deploy, and monitor machine learning models. Run tangram train to train a mo

Tangram 1.4k Jan 05, 2023
A model to predict steering torque fully end-to-end

torque_model The torque model is a spiritual successor to op-smart-torque, which was a project to train a neural network to control a car's steering f

Shane Smiskol 4 Jun 03, 2022
A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

⚡ funk-svd funk-svd is a Python 3 library implementing a fast version of the famous SVD algorithm popularized by Simon Funk during the Neflix Prize co

Geoffrey Bolmier 171 Dec 19, 2022
To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Astitva Veer Garg 1 Jan 11, 2022
Project to deploy a machine learning model based on Titanic dataset from Kaggle

kaggle_titanic_deploy Project to deploy a machine learning model based on Titanic dataset from Kaggle In this project we used the Titanic dataset from

Vivian Yamassaki 8 May 23, 2022
Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them.

Anirudh Edpuganti 3 Apr 03, 2022
Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational model)

Sum-Square_Error-Business-Analytical-Tool- Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational m

om Podey 1 Dec 03, 2021