Weird Sort-and-Compress Thing

A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by some name I don't know yet). There's a lot still to improve about this algorithm, so be careful where you use it.

How it works

Here's an example for the following list:

l = [1, 2, 2, 2, 3]

The algorithm starts with counting sort, creating a dictionary with each unique number as key and the number of occurences of it in the list as the value:

d = {1: 1, 2: 3, 3: 1}

To decrease the space needed to store the numbers in memory, we'll only store the first number and then the difference between each of the next numbers and the previous one:

d2 = [(1, 1), (1, 3), (1, 1))

Now, the minimum amount of memory we need to store every key that's in d2 is 1 bit, since 1 is the maximum difference between any subsequent elements. The same applies to the values, except that to store any value here we need 2 bits of memory, since the maximum value is 3(11 in binary). So we know that we can store this list as a sequence of 3 bits elements, like this:

d2_bin = ["101", "111", 101"]

We can now return the list as a single number, along with a pair of integers containing the number of bits in each key and the number of bits in each value, allowing the value to be decompressed.

Memory efficiency

Here's a list with the sum of the number of bits of all numbers in a list with 100 elements, generated with random values in the range 0 to 50 and generated 20 times, vs. the number of bits in the resulting compressed integer(taking as a premise that all numbers in the array are all actually stored in continuous memory, including duplicates):

And 1000 numbers from 0 to 50, also 20 times:

4724 => 358
4827 => 309
4818 => 308
4801 => 309
4763 => 309
4763 => 309
4801 => 359
4757 => 359
4766 => 309
4794 => 309
4769 => 309
4789 => 359
4887 => 359
4787 => 309
4761 => 309
4749 => 309
4844 => 308
4798 => 359
4799 => 308
4763 => 359

Weird Sort-and-Compress Thing

Related tags

Overview

Weird Sort-and-Compress Thing

How it works

Memory efficiency

Owner

Douglas

A Chinese to English Neural Model Translation Project

Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Built for cleaning purposes in military institutions

A natural language modeling framework based on PyTorch

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Findings of ACL 2021

FireFlyer Record file format, writer and reader for DL training samples.

Unsupervised text tokenizer focused on computational efficiency

Tool to check whether a GCP bucket is public or not.

NLP, before and after spaCy

Minimal GUI for accessing the Watson Text to Speech service.

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Simple and efficient RevNet-Library with DeepSpeed support

Utilities for preprocessing text for deep learning with Keras

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

txtai: Build AI-powered semantic search applications in Go

Curso práctico: NLP de cero a cien 🤗

Yet Another Compiler Visualizer

This is my reading list for my PhD in AI, NLP, Deep Learning and more.