An Indexer that works out-of-the-box when you have less than 100K stored Documents

Last update: Mar 15, 2022

Related tags

Overview

U100KIndexer

An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with 768-dim embeddings, you can expect 300ms for single query or 20~120QPS for batch queries. Results are full Documents.

U100KIndexer leverages jina.DocumenetArrayMemmap as the storage backend and .match() to conduct nearest neighbours search. It returns the full Documents as-is, hence no need to concatenate it with another key-value indexer to retrieve Documents.

Pros & cons

Pros

Exhaustive search: highest recall
Fast indexing
Acceptable query performance under 100K
Always return full Documents
No extra dependencies

Cons

Slow query time

Performance

The indexing and query performance on 768-dim embeddings is as follows (unit is second):

Stored data	Indexing time	Query size=1	Query size=8	Query size=64
10000	0.256	0.019	0.029	0.086
50000	1.156	0.147	0.177	0.314
100000	2.329	0.297	0.332	0.536
200000	4.704	0.656	0.744	1.050
400000	11.105	1.289	1.536	2.793

Benchmark script can be found in benchmark.py.

Tips

To change workspace,

U100KIndexer(metas={'workspace': './my'})

Or .add(..., uses_metas={'workspace': './my'}) when you use it in a Flow.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Related tags

Overview

U100KIndexer

Pros & cons

Pros

Cons

Performance

Tips

Owner

Jina AI

ETL pipeline on movie data using Python and postgreSQL

Python data processing, analysis, visualization, and data operations

sportsdataverse python package

Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

INFO-H515 - Big Data Scalable Analytics

Top 50 best selling books on amazon

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

VevestaX is an open source Python package for ML Engineers and Data Scientists.

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

INF42 - Topological Data Analysis

ICLR 2022 Paper submission trend analysis

General Assembly's 2015 Data Science course in Washington, DC

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Feature Detection Based Template Matching

Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

Anomaly Detection with R