A library for end-to-end learning of embedding index and retrieval model

Last update: Dec 21, 2022

Related tags

Overview

Poeem

Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertising and search systems. Apart from other libraries, such as Faiss and ScaNN, which build embedding indexes with already learned embeddings, Poeem jointly learn the embedding index together with retrieval model in order to avoid the quantization distortion. Consequentially, Poeem is proved to outperform the previous methods significantly, as shown in our SIGIR paper. Poeem is written based on Tensorflow GPU version 1.15, and some of the core functionalities are written in C++, as custom TensorFlow ops. It is developed by JD.com Search.

For more details, check out our SIGIR 2021 paper here.

System Requirements

We only support Linux systems for now, e.g., CentOS and Ubuntu. Windows users might need to build the library from source.
Python 3.6 installation.
TensorFlow GPU version 1.15 (pip install tensorflow-gpu==1.15.0). Other TensorFlow versions are not tested.
CUDA toolkit 10.1, required by TensorFlow GPU 1.15.

Quick Start

Poeem aims at an almost drop-in utility for training and serving large scale embedding retrieval models. We try to make it easy to use as much as we can.

Install

Install poeem for most Linux system can be done easily with pip.

$ pip install poeem

Quick usage

As an extreme simple example, you can use Poeem simply by the following commands

>>> import tensorflow as tf, poeem
>>> hparams = poeem.embedding.PoeemHparam()
>>> poeem_indexing_layer = poeem.embedding.PoeemEmbed(64, hparams)
>>> emb = tf.random.normal([100, 64])  # original embedding before indexing layer
>>> emb_quantized, coarse_code, code, regularizer = poeem_indexing_layer.forward(emb)
>>> emb = emb - tf.stop_gradient(emb - emb_quantized)   # use this embedding for downstream computation
>>> with tf.Session() as sess:
>>>   sess.run(tf.global_variables_initializer())
>>>   sess.run(emb)

Tutorial

The above simple example, as a quick start, does not show how to build embedding index and how to serve it online. Experienced or advanced users who are interested in applying it in real-world or industrial system, can further read the tutorials.

Authors

The main authors of Poeem are:

Han Zhang wrote most Python models and conducted most of experiments.
Hongwei Shen wrote most of the C++ TensorFlow ops and managed the pip released package.
Yunjiang Jiang developed the rotation algorithm and wrote the related code.
Wen-Yun Yang initiated the Poeem project, wrote some of TensorFlow ops, integrated different parts and wrote the tutorials.

How to Cite

Reference to cite if you use Poeem in a research paper or in a real-world system

  @inproceeding{poeem_sigir21,
    title={Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index},
    author={Han Zhang, Hongwei Shen, Yiming Qiu, Yunjiang Jiang, Songlin Wang, Sulong Xu, Yun Xiao, Bo Long and Wen-Yun Yang},
    booktitle={The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
    pages={},
    year={2021}
}

License

MIT licensed

A library for end-to-end learning of embedding index and retrieval model

Related tags

Overview

Poeem

Content

System Requirements

Quick Start

Install

Quick usage

Tutorial

Authors

How to Cite

License

Owner

🧪 Cutting-edge experimental spaCy components and features

Chinese segmentation library

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

A python package for deep multilingual punctuation prediction.

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

The training code for the 4th place model at MDX 2021 leaderboard A.

FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

EdiTTS: Score-based Editing for Controllable Text-to-Speech

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

Module for automatic summarization of text documents and HTML pages.

Large-scale Knowledge Graph Construction with Prompting

Chinese Grammatical Error Diagnosis

Unsupervised Language Model Pre-training for French

Linear programming solver for paper-reviewer matching and mind-matching

In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

A PyTorch implementation of VIOLET

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.