lightweight, fast and robust columnar dataframe for data analytics with online update

Last update: May 19, 2022

Related tags

Overview

streamdf

Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competition.

Key Features

Fast and robust insertion
- The insertion of row can be performed with amortized constant time (much faster than np.append)
- Automatically falls back to the default value when an abnormal value is inserted
Time-travel
- Get the past state of the data as a slice of the original dataframe without copying
Null/empty-safe aggregations
- Provides a set of aggregation methods that can be safely called when an element has nan or is empty.
Columnar layout
- Internal data is stored in a simple columnar format, which is easier to use for analysis than numpy's structured array

Example

import pandas as pd
from streamdf import StreamDf

df = pd.read_csv('test.csv')
sdf = StreamDf.from_pandas(df)

# extend
sdf.extend({
    'x': 1,
    'y': 2
})

assert len(sdf) == len(df) + 1

# access
print(sdf['x'])

# aggregate
sdf.last_value('x')

import numpy as np
from streamdf import StreamDf

sdf = StreamDf.empty({'x': np.int32, 'time': 'datetime64[D]'}, 'time')

sdf.extend({'x': 1, 'time': np.datetime64('2018-01-01')})
sdf.extend({'x': 5, 'time': np.datetime64('2018-02-01')})
sdf.extend({'x': 3, 'time': np.datetime64('2018-02-03')})

assert len(sdf) == 3

# Time travel (zero copy)
sliced = sdf.slice_until(np.datetime64('2018-02-02'))

assert len(sliced) == 2

lightweight, fast and robust columnar dataframe for data analytics with online update

Related tags

Overview

streamdf

Key Features

Example

Owner

SGMC: Spectral Graph Matrix Completion

Finally, some decent sample sentences

Toward Model Interpretability in Medical NLP

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

A multi-voice TTS system trained with an emphasis on quality

Anuvada: Interpretable Models for NLP using PyTorch

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

A natural language processing model for sequential sentence classification in medical abstracts.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Contains links to publicly available datasets for modeling health outcomes using speech and language.

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Fastseq 基于ONNXRUNTIME的文本生成加速框架

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)