An easy-to-use feature store

Overview

ByteHub PyPI Latest Release Issues Issues Code style: black

ByteHub logo

An easy-to-use feature store.

💾 What is a feature store?

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

Feature stores allow data scientists and engineers to be more productive by organising the flow of data into models.

The Bytehub Feature Store is designed to:

  • Be simple to use, with a Pandas-like API;
  • Require no complicated infrastructure, running on a local Python installation or in a cloud environment;
  • Be optimised towards timeseries operations, making it highly suited to applications such as those in finance, energy, forecasting; and
  • Support simple time/value data as well as complex structures, e.g. dictionaries.

It is built on Dask to support large datasets and cluster compute environments.

🦉 Features

  • Searchable feature information and metadata can be stored locally using SQLite or in a remote database.
  • Timeseries data is saved in Parquet format using Dask, making it readable from a wide range of other tools. Data can reside either on a local filesystem or in a cloud storage service, e.g. AWS S3.
  • Supports timeseries joins, along with filtering and resampling operations to make it easy to load and prepare datasets for ML training.
  • Feature engineering steps can be implemented as transforms. These are saved within the feature store, and allows for simple, resusable preparation of raw data.
  • Time travel can retrieve feature values based on when they were created, which can be useful for forecasting applications.
  • Simple APIs to retrieve timeseries dataframes for training, or a dictionary of the most recent feature values, which can be used for inference.

Also available as ☁️ ByteHub Cloud: a ready-to-use, cloud-hosted feature store.

📖 Documentation and tutorials

See the ByteHub documentation and notebook tutorials to learn more and get started.

🚀 Quick-start

Install using pip:

pip install bytehub

Create a local SQLite feature store by running:

import bytehub as bh
import pandas as pd

fs = bh.FeatureStore()

Data lives inside namespaces within each feature store. They can be used to separate projects or environments. Create a namespace as follows:

fs.create_namespace(
    'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)

Create a feature inside this namespace which will be used to store a timeseries of pre-prepared data:

fs.create_feature('tutorial/numbers', description='Timeseries of numbers')

Now save some data into the feature store:

dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})

fs.save_dataframe(df, 'tutorial/numbers')

The data is now stored, ready to be transformed, resampled, merged with other data, and fed to machine-learning models.

We can engineer new features from existing ones using the transform decorator. Suppose we want to define a new feature that contains the squared values of tutorial/numbers:

@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
    # This transform function receives dataframe input, and defines a transform operation
    return df ** 2 # Square the input

Now both features are saved in the feature store, and can be queried using:

df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2021-01-01', to_date='2021-01-31'
)

To connect to ByteHub Cloud, first register for an account, then use:

fs = bh.FeatureStore("https://api.bytehub.ai")

This will allow you to store features in your own private namespace on ByteHub Cloud, and save datasets to an AWS S3 storage bucket.

🐾 Roadmap

  • Tasks to automate updates to features using orchestration tools like Airflow
Owner
ByteHub AI
ByteHub AI
Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

711 Dec 26, 2022
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

Ashish Patel 55 Dec 16, 2022
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

Matrix Profile Foundation 302 Dec 29, 2022
Employee Turnover Analysis

Employee Turnover Analysis Submission to the DataCamp competition "Can you help reduce employee turnover?"

Jannik Wiedenhaupt 1 Feb 13, 2022
INFO-H515 - Big Data Scalable Analytics

INFO-H515 - Big Data Scalable Analytics Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group Exercise clas

Yann-Aël Le Borgne 58 Dec 11, 2022
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 07, 2021
First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

Alexander Butler 150 Jan 06, 2023
A meta plugin for processing timelapse data timepoint by timepoint in napari

napari-time-slicer A meta plugin for processing timelapse data timepoint by timepoint. It enables a list of napari plugins to process 2D+t or 3D+t dat

Robert Haase 2 Oct 13, 2022
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
Python Project on Pro Data Analysis Track

Udacity-BikeShare-Project: Python Project on Pro Data Analysis Track Basic Data Exploration with pandas on Bikeshare Data Basic Udacity project using

Belal Mohammed 0 Nov 10, 2021
Monitor the stability of a pandas or spark dataframe ⚙︎

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

ING Bank 403 Dec 07, 2022
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 09, 2023
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

pgmpy 2.2k Dec 25, 2022
Automated Exploration Data Analysis on a financial dataset

Automated EDA on financial dataset Just a simple way to get automated Exploration Data Analysis from financial dataset (OHLCV) using Streamlit and ta.

Darío López Padial 28 Nov 27, 2022
🌍 Create 3d-printable STLs from satellite elevation data 🌏

mapa 🌍 Create 3d-printable STLs from satellite elevation data Installation pip install mapa Usage mapa uses numpy and numba under the hood to crunch

Fabian Gebhart 13 Dec 15, 2022
Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Elicited Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations. Credit to Brett Hoove

Ryan McGeehan 3 Nov 04, 2022
Investigating EV charging data

Investigating EV charging data Introduction: Got an opportunity to work with a home monitoring technology company over the last 6 months whose goal wa

Yash 2 Apr 07, 2022