An easy-to-use feature store

Overview

ByteHub PyPI Latest Release Issues Issues Code style: black

ByteHub logo

An easy-to-use feature store.

💾 What is a feature store?

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

Feature stores allow data scientists and engineers to be more productive by organising the flow of data into models.

The Bytehub Feature Store is designed to:

  • Be simple to use, with a Pandas-like API;
  • Require no complicated infrastructure, running on a local Python installation or in a cloud environment;
  • Be optimised towards timeseries operations, making it highly suited to applications such as those in finance, energy, forecasting; and
  • Support simple time/value data as well as complex structures, e.g. dictionaries.

It is built on Dask to support large datasets and cluster compute environments.

🦉 Features

  • Searchable feature information and metadata can be stored locally using SQLite or in a remote database.
  • Timeseries data is saved in Parquet format using Dask, making it readable from a wide range of other tools. Data can reside either on a local filesystem or in a cloud storage service, e.g. AWS S3.
  • Supports timeseries joins, along with filtering and resampling operations to make it easy to load and prepare datasets for ML training.
  • Feature engineering steps can be implemented as transforms. These are saved within the feature store, and allows for simple, resusable preparation of raw data.
  • Time travel can retrieve feature values based on when they were created, which can be useful for forecasting applications.
  • Simple APIs to retrieve timeseries dataframes for training, or a dictionary of the most recent feature values, which can be used for inference.

Also available as ☁️ ByteHub Cloud: a ready-to-use, cloud-hosted feature store.

📖 Documentation and tutorials

See the ByteHub documentation and notebook tutorials to learn more and get started.

🚀 Quick-start

Install using pip:

pip install bytehub

Create a local SQLite feature store by running:

import bytehub as bh
import pandas as pd

fs = bh.FeatureStore()

Data lives inside namespaces within each feature store. They can be used to separate projects or environments. Create a namespace as follows:

fs.create_namespace(
    'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)

Create a feature inside this namespace which will be used to store a timeseries of pre-prepared data:

fs.create_feature('tutorial/numbers', description='Timeseries of numbers')

Now save some data into the feature store:

dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})

fs.save_dataframe(df, 'tutorial/numbers')

The data is now stored, ready to be transformed, resampled, merged with other data, and fed to machine-learning models.

We can engineer new features from existing ones using the transform decorator. Suppose we want to define a new feature that contains the squared values of tutorial/numbers:

@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
    # This transform function receives dataframe input, and defines a transform operation
    return df ** 2 # Square the input

Now both features are saved in the feature store, and can be queried using:

df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2021-01-01', to_date='2021-01-31'
)

To connect to ByteHub Cloud, first register for an account, then use:

fs = bh.FeatureStore("https://api.bytehub.ai")

This will allow you to store features in your own private namespace on ByteHub Cloud, and save datasets to an AWS S3 storage bucket.

🐾 Roadmap

  • Tasks to automate updates to features using orchestration tools like Airflow
Owner
ByteHub AI
ByteHub AI
A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

isleki 3 Aug 22, 2022
An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

5ePy is an experimental project I'm undertaking for the sole purpose of increasing my Python knowledge. #Goals Goal: Create a working, albeit lightwei

Hayden Covington 1 Nov 24, 2021
Find exposed data in Azure with this public blob scanner

BlobHunter A tool for scanning Azure blob storage accounts for publicly opened blobs. BlobHunter is a part of "Hunting Azure Blobs Exposes Millions of

CyberArk 250 Jan 03, 2023
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A part of HyRiver software stack that provides access to NHD+ V2 data through NLDI and WaterData web services

Taher Chegini 23 Dec 14, 2022
This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

1 Dec 28, 2021
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
PyClustering is a Python, C++ data mining library.

pyclustering is a Python, C++ data mining library (clustering algorithm, oscillatory networks, neural networks). The library provides Python and C++ implementations (C++ pyclustering library) of each

Andrei Novikov 1k Jan 05, 2023
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Overview OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase

Tom 3 Feb 12, 2022
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
WAL enables programmable waveform analysis.

This repro introcudes the Waveform Analysis Language (WAL). The initial paper on WAL will appear at ASPDAC'22 and can be downloaded here: https://www.

Institute for Complex Systems (ICS), Johannes Kepler University Linz 40 Dec 13, 2022
a tool that compiles a csv of all h1 program stats

h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati

Evan 40 Oct 27, 2022
Template for a Dataflow Flex Template in Python

Dataflow Flex Template in Python This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build D

STOIX 5 Apr 28, 2022
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022