Python implementation of Gorilla time series compression

Overview

Gorilla Time Series Compression

This is an implementation (with some adaptations) of the compression algorithm described in section 4.1 (Time series compression) of [1] (read the paper here).

Gorilla compression is lossless.

This library can be used in three ways:

  • Timestamps only compression.
  • Values only compression (useful for regular time series compression).
  • Timestamp-Value pairs compression (useful for irregular time series compression).

In all three cases, the result of the encoding process is a dict with everything necessary for decoding (see Usage for examples). If you want to use this library for compressed message exchanges, you can serialize the result of the encoding process as you like (JSON, protobuf, etc.)

This implementation is based on section 4.1 of [1] and on the Facebook's open source implementation [2] (which have some differences).

Differences from the original paper

  • Timestamps or values can be encoded separately.
  • The header (with an aligned timestamp) at the beginning (64 bits) of the message is not encoded.
  • The float format can be specified (f64, f32, f16) to optimize the size of certain fields.

Installation

To install the latest release:

$ pip install gorillacompression

You can also build a local package and install it:

$ make build
$ pip install dist/*.whl

Usage

Import gorillacompression module.

>>> import gorillacompression as gc

Data to encode.

>>> timestamps = [1628164645, 1628164649, 1628164656, 1628164669]
>>> values = [18.95, 18.91, 17.01, 14.05]
>>> pairs = list(zip(timestamps, values))
>>> pairs
[(1628164645, 18.95), (1628164649, 18.91), (1628164656, 17.01), (1628164669, 14.05)]

In the three scenarios of compression (timestamps, values, pairs), you can use:

  • encode_all to encode all elements or encode_next to encode element by element.
  • decode_all to decode everything.

encode_next returns True if the element has been encoded correctly, False if the element has not been encoded accompanied by a warning explaining the reason.

Timestamps only compression

The expected input timestamp is a POSIX timestamp less than 2147483647 ('January 19, 2038 04:14:07'). The delta between two successive timestamps must be greater than or equal to 0.

You can use encode_all to encode all timestamps:

>>> content = gc.TimestampsEncoder.encode_all(timestamps)
>>> content
{'encoded': b'\xc2\x17\xa4K\x08\xa1Q@', 'nb_timestamps': 4}
>>> gc.TimestampsDecoder.decode_all(content)
[1628164645, 1628164649, 1628164656, 1628164669]

Or you can use encode_next to encode one by one:

>>> ts_encoder = gc.TimestampsEncoder()
>>> for ts in timestamps:
...     ts_encoder.encode_next(ts)
>>> content = ts_encoder.get_encoded()
>>> content
{'encoded': b'\xc2\x17\xa4K\x08\xa1Q@', 'nb_timestamps': 4}
>>> gc.TimestampsDecoder.decode_all(content)
[1628164645, 1628164649, 1628164656, 1628164669]

Values only compression

You can use encode_all to encode all values:

>>> content = gc.ValuesEncoder.encode_all(values)
>>> content
{'encoded': b'@2\xf333333\xe7f\xf1\xbco\x1b\xc6\xee\xc7\xeaz\x9e\xa7\xa9\xeb\xaf^\x8d\x8bb\xd8\xb6,\x80', 'nb_values': 4, 'float_format': 'f64'}
>>> gc.ValuesDecoder.decode_all(content)
[18.95, 18.91, 17.01, 14.05]

Or you can use encode_next to encode one by one:

>>> values_encoder = gc.ValuesEncoder()
>>> for v in values:
...     values_encoder.encode_next(v)
>>> content = values_encoder.get_encoded()
>>> content
{'encoded': b'@2\xf333333\xe7f\xf1\xbco\x1b\xc6\xee\xc7\xeaz\x9e\xa7\xa9\xeb\xaf^\x8d\x8bb\xd8\xb6,\x80', 'nb_values': 4, 'float_format': 'f64'}
>>> gc.ValuesDecoder.decode_all(content)
[18.95, 18.91, 17.01, 14.05]

Timestamp-Value pairs compression

You can use encode_all to encode all pairs:

>>> content = gc.PairsEncoder.encode_all(pairs)
>>> content
{'encoded': b'\xc2\x17\xa4J\x80e\xe6ffffg\x08\xe7f\xf1\xbco\x1b\xc6\xd0\xb7c\xf5=OS\xd4\xf5\xa2\xeb\xd7\xa3b\xd8\xb6-\x8b ', 'nb_pairs': 4, 'float_format': 'f64'}
>>> gc.PairsDecoder.decode_all(content)
[(1628164645, 18.95), (1628164649, 18.91), (1628164656, 17.01), (1628164669, 14.05)]

Or you can use encode_next to encode one by one:

>>> pairs_encoder = gc.PairsEncoder()
>>> for (ts, v) in pairs:
...     pairs_encoder.encode_next(ts, v)
>>> content = pairs_encoder.get_encoded()
>>> content
{'encoded': b'\xc2\x17\xa4J\x80e\xe6ffffg\x08\xe7f\xf1\xbco\x1b\xc6\xd0\xb7c\xf5=OS\xd4\xf5\xa2\xeb\xd7\xa3b\xd8\xb6-\x8b ', 'nb_pairs': 4, 'float_format': 'f64'}
>>> gc.PairsDecoder.decode_all(content)
[(1628164645, 18.95), (1628164649, 18.91), (1628164656, 17.01), (1628164669, 14.05)]

Gorilla compression algorithm explanation

Below is a brief explanation of the implemented method. (Refer to [1] section 4.1 (read the paper here) for the original explanation)

Timestamps compression

  • The first timestamp is encoded in a fixed number of bits.
  • The following timestamps are encoded as follows:
  (a) Calculate the delta of delta
          D = (t_n − t_(n−1)) − (t_(n−1) − t_(n−2))
  (b) If D is zero, then store a single ‘0’ bit
  (c) If D is between [-63, 64], store ‘10’ followed by the value (7 bits)
  (d) If D is between [-255, 256], store ‘110’ followed by the value (9 bits)
  (e) if D is between [-2047, 2048], store ‘1110’ followed by the value (12 bits)
  (f) Otherwise store ‘1111’ followed by D using 32 bits

Values compression

Notation

    n bits:
    +---- n ----+
    |           |
    +---- n' ---+

    n bytes:
    +==== n ====+
    |           |
    +==== n' ===+

    `~` in place of `n` means a variable number of bytes or bits.

    When it makes sense, n refers to the default value, and n' to the variable containing the value.

This explanation corresponds to the case of float format f64, for the other formats (f32, f16), the size of some fields is different (refer to the code for more details).

  1. The first value is stored with no compression.
    +======================= 8 =======================+
    |  First value (IEEE 754, binary64, Big Endian)   |
    +======================= 8 =======================+
  1. If XOR with the previous is zero (same value), store single ‘0’ bit.
    +-- 1 --+
    |   0   |
    +-- 1 --+
  1. When XOR is non-zero, calculate the number of leading and trailing zeros in the XOR, store bit ‘1’ followed by either a) or b):
  • (a) (Control bit ‘0’) If the block of meaningful bits falls within the block of previous meaningful bits*, i.e., there are at least as many leading zeros and as many trailing zeros as with the previous value, use that information for the block position and just store the meaningful XORed value*.
    +--- 2 ---+--- length of the meaningful XORed value ---+
    |   10    |         [meaningful XORed value]           |
    +--- 2 ---+--- length of the meaningful XORed value ---+
  • (b) (Control bit ‘1’) Store the length of the number of leading zeros in the next 5 bits, then store the length of the meaningful XORed value in the next 6 bits. Finally store the meaningful bits of the XORed value.
    +--- 2 ---+------------- 5 -------------+------------------- 6 ------------------+--- length of the meaningful XORed value ---+
    |   11    |   number of leading zeros   |   length of the meaningful XORed value |         [meaningful XORed value]           |
    +--- 2 ---+------------- 5 -------------+------------------- 6 ------------------+--- length of the meaningful XORed value ---+
  1. After the compression of the last value, if the length of the bitarray is not a multiple of 8, the few remaining bits are padded with zero.
    +---- n ----+
    |   0...0   |
    +---- n ----+

    n < 8

(*) The terms "meaningful bits" and "meaningful XORed value" used in the original paper may be confusing.

  • In case (b), "meaningful XORed value" is a value with absolutely no leading and trailing zero.
  • In case (a), "meaningful XORed value" is the XORed value striped off same amount of leading and trailing zeroes as previous value delta. The meaningful bits in this case may still contain some leading and trailing zeroes.

Timestamp-Value pairs compression

The encoding of a pair is the encoding of the timestmap followed by the encoding of the value.

Contribute

Please, open issues. PR are very welcome!

$ git clone https://github.com/ghilesmeddour/gorilla-time-series-compression.git
$ cd gorilla-time-series-compression
make format
make dead-code-check
make test
make type-check
make coverage
make build

TODOs

  • Add more unit tests (f32 and f16 float formats are currently not tested).
  • Add profiling, benchmarks, etc.
  • Improve doc, docstring, etc.

Other implementations

References

[1] Pelkonen, T., Franklin, S., Teller, J., Cavallaro, P., Huang, Q., Meza, J., & Veeraraghavan, K. (2015). Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12), 1816-1827.

You might also like...
Compute the fair market value (FMV) of staking rewards at time of receipt.

tendermint-tax A tool to help calculate the tax liability of staking rewards on Tendermint chains. Specifically, this tool calculates the fair market

This is a python table of data implementation with styles, colors
This is a python table of data implementation with styles, colors

Table This is a python table of data implementation with styles, colors Example Table adapts to the lack of data Lambda color features Full power of l

A simple python implementation of Decision Tree.

DecisionTree A simple python implementation of Decision Tree, using Gini index. Usage: import DecisionTree node = DecisionTree.trainDecisionTree(lab

A fast Python implementation of Ac Auto Mechine

A fast Python implementation of Ac Auto Mechine

✨ Voici un code en Python par moi, et en français qui permet d'exécuter du Javascript en Python.
✨ Voici un code en Python par moi, et en français qui permet d'exécuter du Javascript en Python.

JavaScript In Python ❗ Voici un code en Python par moi, et en français qui permet d'exécuter du Javascript en Python. 🔮 Une vidéo pour vous expliquer

Simple python module to get the information regarding battery in python.
Simple python module to get the information regarding battery in python.

Battery Stats A python3 module created for easily reading the current parameters of Battery in realtime. It reads battery stats from /sys/class/power_

Python @deprecat decorator to deprecate old python classes, functions or methods.

deprecat Decorator Python @deprecat decorator to deprecate old python classes, functions or methods. Installation pip install deprecat Usage To use th

A python package containing all the basic functions and classes for python. From simple addition to advanced file encryption.
A python package containing all the basic functions and classes for python. From simple addition to advanced file encryption.

A python package containing all the basic functions and classes for python. From simple addition to advanced file encryption.

Find dependent python scripts of a python script in a project directory.

Find dependent python scripts of a python script in a project directory.

Comments
  • OverflowError: unsigned integer not in range(0, 64), got 64

    OverflowError: unsigned integer not in range(0, 64), got 64

    I got this message when trying to encode values :

    
    values = [-0.39263690585168304, -0.39263690585168304, -0.39263690585168304, 0.450762617155903, 0.450762617155903, 0.450762617155903, -0.284155454538896]
    values_encoder = gc.ValuesEncoder()
    for v in values:
        print(v)
        values_encoder.encode_next(v)
    

    This output :

    -0.39263690585168304
    -0.39263690585168304
    -0.39263690585168304
    0.450762617155903
    0.450762617155903
    0.450762617155903
    -0.284155454538896
    
    ---------------------------------------------------------------------------
    OverflowError                             Traceback (most recent call last)
    /tmp/ipykernel_9713/2845470605.py in <module>
         10 for v in values:
         11     print(v)
    ---> 12     values_encoder.encode_next(v)
    
    ~/.local/lib/python3.8/site-packages/gorillacompression/values/encode.py in encode_next(self, value)
        118             # Encode length of the meaningful XORed value.
        119             length_of_the_meaningful_xored_value = self.n_bits_value - n_leading_zeros - n_trailing_zeros
    --> 120             self.bit_array += util.int2ba(
        121                 length_of_the_meaningful_xored_value,
        122                 length=self.n_bits_length_of_the_meaningful_xored_value)
    
    ~/.local/lib/python3.8/site-packages/bitarray/util.py in int2ba(__i, length, endian, signed)
        267             raise OverflowError("unsigned integer not positive, got %d" % __i)
        268         if length and __i >= (1 << length):
    --> 269             raise OverflowError("unsigned integer not in range(0, %d), "
        270                                 "got %d" % (1 << length, __i))
        271 
    
    OverflowError: unsigned integer not in range(0, 64), got 64
    
    

    Version 0.0.1

    same problem with encode_all

    opened by Jeanbouvatt 3
  • Make encode_all a class method, or add a float_format parameter to the method

    Make encode_all a class method, or add a float_format parameter to the method

    Currently, encode_all is a static method of the ValuesEncoder class. Its usage can be misleading when a float format different from f64 is used:

    content = gc.ValuesEncoder(float_format='f16').encode_all(sequence) content['float_format']

    The result is 'f64', since the method does not take into account the parameters of the instance of ValuesEncoder used to invoke the method.

    I would suggest to make encode_all a class method, and/or add a float_format parameter to the static method to avoid this potential trouble.

    opened by mgoeminne 2
  • Encode all unable to use 'f32' or 'f16' formats

    Encode all unable to use 'f32' or 'f16' formats

    Hi,

    In line 148 of the file encode.py, you seem to have made a small error in the sense that you initialize a default ValuesEncoder. This means that if you are trying to do encoding in a different format (say 'f32') this will be superceded by the default. In other words, this means that in the current pip implementation of gorillacompression it is impossible to do anything but 'f64' compression unless you use encode_next manually.

    Simple fix would be to initialize that value encoder in that function using the values from "self.". (Alternatively, I can make a pull request).

    Regards, Zack

    opened by HeatPhoenix 2
Releases(v0.2.1)
Owner
Ghiles Meddour
Data Analyst at Munic
Ghiles Meddour
A collection of common regular expressions bundled with an easy to use interface.

CommonRegex Find all times, dates, links, phone numbers, emails, ip addresses, prices, hex colors, and credit card numbers in a string. We did the har

Madison May 1.5k Dec 31, 2022
Aurin - A quick AUR installer for Arch Linux. Install packages from AUR website in a click.

Aurin - A quick AUR installer for Arch Linux. Install packages from AUR website in a click.

Suleman 51 Nov 04, 2022
Just some scripts to export vector tiles to geojson.

Vector tiles to GeoJSON Nowadays modern web maps are usually based on vector tiles. The great thing about vector tiles is, that they are not just imag

Lilith Wittmann 77 Jul 26, 2022
Use generator for range function

Use the generator for the range function! installation method: pip install yrange How to use: First import yrange in your application. You can then wo

1 Oct 28, 2021
Fcpy: A Python package for high performance, fast convergence and high precision numerical fractional calculus computing.

Fcpy: A Python package for high performance, fast convergence and high precision numerical fractional calculus computing.

SciFracX 1 Mar 23, 2022
A small python tool to get relevant values from SRI invoices

SriInvoiceProcessing A small python tool to get relevant values from SRI invoices Some useful info to run the tool Login into your SRI account and ret

Wladymir Brborich 2 Jan 07, 2022
🔩 Like builtins, but boltons. 250+ constructs, recipes, and snippets which extend (and rely on nothing but) the Python standard library. Nothing like Michael Bolton.

Boltons boltons should be builtins. Boltons is a set of over 230 BSD-licensed, pure-Python utilities in the same spirit as — and yet conspicuously mis

Mahmoud Hashemi 6k Jan 04, 2023
Functional UUIDs for Python.

🏷️FUUID stands for Functional Universally Unique IDentifier. FUUIDs are compatible with regular UUIDs but are naturally ordered by generation time, collision-free and support succinct representations

Phil Demetriou 147 Oct 27, 2022
This project is a set of programs that I use to create a README.md file.

This project is a set of programs that I use to create a README.md file.

Tom Dörr 223 Dec 24, 2022
produces PCA on genotypes from fasta files (popPhyl's ID format)

popPhyl_PCA Performs PCA of genotypes. Works in two steps. 1. Input file A single fasta file containing different loci, in different populations/speci

camille roux 2 Oct 08, 2021
Parse URLs for DOIs, PubMed identifiers, PMC identifiers, arXiv identifiers, etc.

citation-url Parse URLs for DOIs, PubMed identifiers, PMC identifiers, arXiv identifiers, etc. This module has a single parse() function that takes in

Charles Tapley Hoyt 2 Feb 12, 2022
Simple script to export contacts from telegram into vCard file

Telegram Contacts Exporter Simple script to export contacts from telegram into vCard file Getting Started Prerequisites You must to put your Telegram

Pere Antoni 1 Oct 17, 2021
✨ Un générateur de lien raccourcis en fonction d'un lien totalement fait en Python par moi, et en français.

Shorter Link ❗ Un générateur de lien raccourcis en fonction d'un lien totalement fait en Python par moi, et en français. Dépendences : pip install pys

MrGabin 3 Jun 06, 2021
This python program will display all SSID usernames and SSID passwords you once connected to your laptop

Windows-Wifi-password-extractor This python program will display all SSID usernames and SSID passwords you once connected to your laptop How to run th

Bhaskar Pal 3 Apr 26, 2022
SH-PUBLIC is a python based cloning script. You can clone unlimited UID facebook accounts by using this tool.

SH-PUBLIC is a python based cloning script. You can clone unlimited UID facebook accounts by using this tool. This tool works on any Android devices without root.

(Md. Tanvir Ahmed) 5 Mar 09, 2022
A thing to simplify listening for PG notifications with asyncpg

A thing to simplify listening for PG notifications with asyncpg

ANNA 18 Dec 23, 2022
Networkx with neo4j back-end

Dump networkx graph into nodes/relations TSV from neo4jnx.tsv import graph_to_tsv g = pklload('indranet_dir_graph.pkl') graph_to_tsv(g, 'docker/nodes.

Benjamin M. Gyori 1 Oct 27, 2021
A simple python implementation of Decision Tree.

DecisionTree A simple python implementation of Decision Tree, using Gini index. Usage: import DecisionTree node = DecisionTree.trainDecisionTree(lab

1 Nov 12, 2021
Python Random Number Genrator

This Genrates Random Numbers. This Random Number Generator was made using python. This software uses Time and Random extension. Download the EXE file and run it to get your answer.

Krish Sethi 2 Feb 03, 2022
Python @deprecat decorator to deprecate old python classes, functions or methods.

deprecat Decorator Python @deprecat decorator to deprecate old python classes, functions or methods. Installation pip install deprecat Usage To use th

12 Dec 12, 2022