A benchmark of data-centric tasks from across the machine learning lifecycle.

Overview
banner

GitHub Workflow Status GitHub Documentation Status pre-commit PyPI - Python Version codecov

A benchmark of data-centric tasks from across the machine learning lifecycle.

Getting Started | What is dcbench? | Docs | Contributing | Website | About

⚡️ Quickstart

pip install dcbench

Optional: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like pip install dcbench[dev] instead. See setup.py for a full list of optional dependencies.

Installing from dev: pip install "dcbench[dev] @ git+https://github.com/data-centric-ai/[email protected]"

Using a Jupyter notebook or some other interactive environment, you can import the library and explore the data-centric problems in the benchmark:

import dcbench
dcbench.tasks

To learn more, follow the walkthrough in the docs.

💡 What is dcbench?

This benchmark evaluates the steps in your machine learning workflow beyond model training and tuning. This includes feature cleaning, slice discovery, and coreset selection. We call these “data-centric” tasks because they're focused on exploring and manipulating data – not training models. dcbench supports a growing list of them:

dcbench includes tasks that look very different from one another: the inputs and outputs of the slice discovery task are not the same as those of the minimal data cleaning task. However, we think it important that researchers and practitioners be able to run evaluations on data-centric tasks across the ML lifecycle without having to learn a bunch of different APIs or rewrite evaluation scripts.

So, dcbench is designed to be a common home for these diverse, but related, tasks. In dcbench all of these tasks are structured in a similar manner and they are supported by a common Python API that makes it easy to download data, run evaluations, and compare methods.

✉️ About

dcbench is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)

You might also like...
Data science, Data manipulation and Machine learning package.
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

Data Version Control or DVC is an open-source tool for data science and machine learning projects
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

A library of extension and helper modules for Python's data analysis and machine learning libraries.
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Comments
  •  No module named 'dcbench.tasks.budgetclean.cpclean'

    No module named 'dcbench.tasks.budgetclean.cpclean'

    After installing dcbench in Google colab environment, the above error was thrown for import dcbench. Full error traceback,

    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-8-a1030f6d7ef9> in <module>()
          1 
    ----> 2 import dcbench
          3 dcbench.tasks
    
    2 frames
    /usr/local/lib/python3.7/dist-packages/dcbench/__init__.py in <module>()
         13 )
         14 from .config import config
    ---> 15 from .tasks.budgetclean import BudgetcleanProblem
         16 from .tasks.minidata import MiniDataProblem
         17 from .tasks.slice_discovery import SliceDiscoveryProblem
    
    /usr/local/lib/python3.7/dist-packages/dcbench/tasks/budgetclean/__init__.py in <module>()
          3 from ...common import Task
          4 from ...common.table import Table
    ----> 5 from .baselines import cp_clean, random_clean
          6 from .common import Preprocessor
          7 from .problem import BudgetcleanProblem, BudgetcleanSolution
    
    /usr/local/lib/python3.7/dist-packages/dcbench/tasks/budgetclean/baselines.py in <module>()
          6 from ...common.baseline import baseline
          7 from .common import Preprocessor
    ----> 8 from .cpclean.algorithm.select import entropy_expected
          9 from .cpclean.algorithm.sort_count import sort_count_after_clean_multi
         10 from .cpclean.clean import CPClean, Querier
    
    ModuleNotFoundError: No module named 'dcbench.tasks.budgetclean.cpclean'
    

    !pip install dcbench gave the following log

    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 
    flask 1.1.4 requires click<8.0,>=5.1, but you have click 8.0.3 which is incompatible.
    datascience 0.10.6 requires coverage==3.7.1, but you have coverage 6.2 which is incompatible.
    datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
    coveralls 0.5 requires coverage<3.999,>=3.6, but you have coverage 6.2 which is incompatible.
    Successfully installed SecretStorage-3.3.1 aiohttp-3.8.1 aiosignal-1.2.0 antlr4-python3-runtime-4.8 async-timeout-4.0.2 asynctest-0.13.0 black-21.12b0 cfgv-3.3.1 click-8.0.3 colorama-0.4.4 commonmark-0.9.1 coverage-6.2 cryptography-36.0.1 cytoolz-0.11.2 dataclasses-0.6 datasets-1.17.0 dcbench-0.0.4 distlib-0.3.4 docformatter-1.4 flake8-4.0.1 frozenlist-1.2.0 fsspec-2021.11.1 future-0.18.2 fuzzywuzzy-0.18.0 fvcore-0.1.5.post20211023 huggingface-hub-0.2.1 identify-2.4.1 importlib-metadata-4.2.0 iopath-0.1.9 isort-5.10.1 jeepney-0.7.1 jsonlines-3.0.0 keyring-23.4.0 livereload-2.6.3 markdown-3.3.4 mccabe-0.6.1 meerkat-ml-0.2.3 multidict-5.2.0 mypy-extensions-0.4.3 nbsphinx-0.8.8 nodeenv-1.6.0 omegaconf-2.1.1 parameterized-0.8.1 pathspec-0.9.0 pkginfo-1.8.2 platformdirs-2.4.1 pluggy-1.0.0 portalocker-2.3.2 pre-commit-2.16.0 progressbar-2.5 pyDeprecate-0.3.1 pycodestyle-2.8.0 pyflakes-2.4.0 pytest-6.2.5 pytest-cov-3.0.0 pytorch-lightning-1.5.7 pyyaml-6.0 readme-renderer-32.0 recommonmark-0.7.1 requests-toolbelt-0.9.1 rfc3986-1.5.0 sphinx-autobuild-2021.3.14 sphinx-rtd-theme-1.0.0 torchmetrics-0.6.2 twine-3.7.1 typed-ast-1.5.1 ujson-5.1.0 untokenize-0.1.1 virtualenv-20.12.1 xxhash-2.0.2 yacs-0.1.8 yarl-1.7.2
    WARNING: The following packages were previously imported in this runtime:
      [pydevd_plugins]
    You must restart the runtime in order to use newly installed versions.
    

    python version : 3.7.12 platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic

    opened by mathav95raj 2
  • Slice discovery problem p_72411 misses files

    Slice discovery problem p_72411 misses files

    Hi,

    Thanks for this great tool!

    I'm loading slice discovery problems, however, the problem p_72411 misses files. Can you fix this SD problem?

    FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.dcbench/slice_discovery/problem/artifacts/p_72411/test_predictions.mk/meta.yaml'
    
    opened by duguyue100 0
Releases(v-0.0.1-beta)
Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

sklearn-porter Transpile trained scikit-learn estimators to C, Java, JavaScript and others. It's recommended for limited embedded systems and critical

Darius Morawiec 1.2k Jan 05, 2023
PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

PyNNDescent PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbo

Leland McInnes 699 Jan 09, 2023
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 07, 2023
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
customer churn prediction prevention in telecom industry using machine learning and survival analysis

Telco Customer Churn Prediction - Plotly Dash Application Description This dash application allows you to predict telco customer churn using machine l

Benaissa Mohamed Fayçal 3 Nov 20, 2021
Python package for concise, transparent, and accurate predictive modeling

Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. 📚 docs • 📖 demo notebooks Modern

Chandan Singh 983 Jan 01, 2023
Crunchdao - Python API for the Crunchdao machine learning tournament

Python API for the Crunchdao machine learning tournament Interact with the Crunc

3 Jan 19, 2022
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 09, 2023
Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas.

Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas. Its objective is to ex

Taylor G Smith 54 Aug 20, 2022
A repository to index and organize the latest machine learning courses found on YouTube.

📺 ML YouTube Courses At DAIR.AI we ❤️ open education. We are excited to share some of the best and most recent machine learning courses available on

DAIR.AI 9.6k Jan 01, 2023
A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

Unit8 5.2k Jan 04, 2023
Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them.

Anirudh Edpuganti 3 Apr 03, 2022
This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform.

Zillow-Houses This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform. Pipeline is consists of 10

2 Jan 09, 2022
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

Blue Yonder GmbH 7k Jan 06, 2023
Learn how to responsibly deliver value with ML.

Made With ML Applied ML · MLOps · Production Join 30K+ developers in learning how to responsibly deliver value with ML. 🔥 Among the top MLOps reposit

Goku Mohandas 32k Dec 30, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 03, 2023
A Python implementation of GRAIL, a generic framework to learn compact time series representations.

GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

3 Nov 24, 2021
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

David Kundih 3 Oct 19, 2022
Data from "Datamodels: Predicting Predictions with Training Data"

Data from "Datamodels: Predicting Predictions with Training Data" Here we provid

Madry Lab 51 Dec 09, 2022
This is my implementation on the K-nearest neighbors algorithm from scratch using Python

K Nearest Neighbors (KNN) algorithm In this Machine Learning world, there are various algorithms designed for classification problems such as Logistic

sonny1902 1 Jan 08, 2022