It is a forest of random projection trees

Overview

rpforest

rpforest

CircleCI

rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given query point in a fast but approximate manner.

rpforest differs from alternative ANN packages such as annoy by not requiring the storage of all the vectors indexed in the model. Used in this way, rpforest serves to produce a list of candidate ANNs for use by a further service where point vectors are stored (for example, a relational database).

How it works

It works by building a forest of N binary random projection trees.

In each tree, the set of training points is recursively partitioned into smaller and smaller subsets until a leaf node of at most M points is reached. Each parition is based on the cosine of the angle the points make with a randomly drawn hyperplane: points whose angle is smaller than the median angle fall in the left partition, and the remaining points fall in the right partition.

The resulting tree has predictable leaf size (no larger than M) and is approximately balanced because of median splits, leading to consistent tree traversal times.

Querying the model is accomplished by traversing each tree to the query point's leaf node to retrieve ANN candidates from that tree, then merging them and sorting by distance to the query point.

Installation

  1. Install numpy first.
  2. Install rpforest using pip: pip install rpforest

Usage

Fitting

Model fitting is straightforward:

from rpforest import RPForest

model = RPForest(leaf_size=50, no_trees=10)
model.fit(X)

The speed-precision tradeoff is governed by the leaf_size and no_trees parameters. Increasing leaf_size leads the model to produce shallower trees with larger leaf nodes; increasing no_trees fits more trees.

In-memory queries

Where the entire set of points can be kept in memory, rpforest supports in-memory ANN queries. After fitting, ANNs can be obtained by calling:

nns = model.query(x_query, 10)

Return nearest neighbours for vector x by first retrieving candidate NNs from x's leaf nodes, then merging them and sorting by cosine similarity with x. At most no_trees * leaf_size NNs will can be returned.

Candidate queries

rpforest can support indexing and candidate ANN queries on datasets larger than would fit in available memory. This is accomplished by first fitting the model on a subset of the data, then indexing a larger set of data into the fitted model:

from rpforest import RPForest

model = RPForest(leaf_size=50, no_trees=10)
model.fit(X_train)

model.clear()  # Deletes X_train vectors

for point_id, x in get_x_vectors():
     model.index(point_id, x)

nns = model.get_candidates(x_query, 10)

Model persistence

Model persistence is achieved simply by pickling and unpickling.

model = pickle.loads(pickle.dumps(model))

Performance

Erik Bernhardsson, the author of annoy, maintains an ANN performance shootout repository, comparing a number of Python ANN packages.

On the GloVe cosine distance benchmark, rpforest is not as fast as highly optimised C and C++ packages like FLANN and annoy. However, it far outerpforms scikit-learn's LSHForest and panns.

Performance

Development

Pull requests are welcome. To install for development:

  1. Clone the rpforest repository: git clone [email protected]:lyst/rpforest.git
  2. Install it for development using pip: cd rpforest && pip install -e .
  3. You can run tests by running python setupy.py test.

When making changes to the .pyx extension files, you'll need to run python setup.py cythonize in order to produce the extension .cpp files before running pip install -e ..

Comments
  • Is rpforest supports custom similarity/distance function

    Is rpforest supports custom similarity/distance function

    hi, @maciejkula , According to your paper titled "Metadata Embeddings for User and Item Cold-start Recommendations", lyst generate recommendation using lightfm and some kind of ANN algorithm. So I came to rpforest in lyst's repository and I think maybe that's exactly the ANN. Now Suppose that we have trained a lightfm model, include embeddings and bias. It seems that it is still hard to rapidly generate top-k recommendation using rpforest, Since as Readme said, rpforest is based on cosine similarity, however, the score for a user-item pair in lightfm is the sum of a dot product of two embeddings and two bias. So my question is:

    Is rpforest supports custom similarity/distance function, or some other way can achieve top-k recommendation?

    thanks jianyi

    opened by hiyijian 10
  • Compile error when installing rpforest

    Compile error when installing rpforest

    In file included from rpforest/rpforest_fast.cpp:271:
    
    In file included from /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4:
    
    In file included from /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:17:
    
    In file included from /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1804:
    
    /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: "Using deprecated NumPy API, disable it by "          "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
    
    #warning "Using deprecated NumPy API, disable it by " \
    
     ^
    
    rpforest/rpforest_fast.cpp:5727:28: error: no member named 'shrink_to_fit' in 'std::vector<int, std::allocator<int> >'
    
        __pyx_v_node->indices->shrink_to_fit();
    
        ~~~~~~~~~~~~~~~~~~~~~  ^
    
    rpforest/rpforest_fast.cpp:5940:28: error: no member named 'shrink_to_fit' in 'std::vector<int, std::allocator<int> >'
    
        __pyx_v_node->indices->shrink_to_fit();
    
        ~~~~~~~~~~~~~~~~~~~~~  ^
    
    1 warning and 2 errors generated.
    
    error: command 'gcc' failed with exit status 1
    
    opened by delip 10
  • Does windows support the libs

    Does windows support the libs

    In win7 environment, when i install rpforest ,i met the problem. i use vs2015. the complie error informatios: C:\Users\juine\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python \9.0\VC\Bin\cl.exe /c logo /Ox /MD /W3 /GS- /DNDEBUG -ID:\Python27\lib\site-p ackages\numpy\core\include -ID:\Python27\include -ID:\Python27\PC /Tprpforest/rp forest_fast.cpp /Fobuild\temp.win32-2.7\Release\rpforest/rpforest_fast.obj -ffas t-math cl : Command line warning D9002 : ignoring unknown option '-ffast-math' rpforest_fast.cpp d:\python27\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_ap i.h(12) : Warning Msg: Using deprecated NumPy API, disable it by #defining NPY_N O_DEPRECATED_API NPY_1_7_API_VERSION rpforest/rpforest_fast.cpp(271) : fatal error C1083: Cannot open include fil e: 'stdint.h': No such file or directory error: command 'C:\Users\juine\AppData\Local\Programs\Common\Microsof t\Visual C++ for Python\9.0\VC\Bin\cl.exe' failed with exit status 2

    opened by juine 4
  • C++ error with python 3.5

    C++ error with python 3.5

    Hello, I'm trying to fir an rpforet module on a big matrix (3000000 x 300) in python 3.5 on OS X 10.11 and I get the following error:

    Traceback (most recent call last):
      File "rpforest_test.py", line 29, in <module>
        index.fit(model.syn0)
      File "/usr/local/lib/python3.5/site-packages/rpforest/rpforest.py", line 81, in fit
        tree.make_tree(self._X)
      File "rpforest/rpforest_fast.pyx", line 237, in rpforest.rpforest_fast.Tree.make_tree (rpforest/rpforest_fast.cpp:3896)
    ValueError: Buffer dtype mismatch, expected 'double' but got 'float'
    
    opened by w4nderlust 2
  • Errors installing rpforest in conda environment on Mac OS X

    Errors installing rpforest in conda environment on Mac OS X

    OS/compiler details:

    OS X version: 10.11.2 (El Capitan)

    $ clang --version
    Apple LLVM version 7.0.2 (clang-700.1.81)
    Target: x86_64-apple-darwin15.2.0
    Thread model: posix
    

    gcc is an alias for clang.

    Installing rpforest in a fresh virtualenv environment works fine:

    $ mkvirtualenv rpfenv
    ... python 3.4 env built ...
    (rpfenv)$ pip install numpy
    ... numpy 1.10.2 installed ...
    (rpfenv)$ pip install rpforest
    ... rpforest 1.1 installed ...
    

    rpforest_fast was compiled successfully with:

    clang -Wno-unused-result -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/dsc/.virtualenvs/rpfenv/lib/python3.4/site-packages/numpy/core/include -I/usr/local/Cellar/python3/3.4.3_2/Frameworks/Python.framework/Versions/3.4/include/python3.4m -c rpforest/rpforest_fast.cpp -o build/temp.macosx-10.10-x86_64-3.4/rpforest/rpforest_fast.o -std=c++11
    

    Trying to do the same in a conda environment results in compilation errors however:

    $ conda create -n rpfenv2 python=3.4
    ... python 3.4 env built ...
    $ source activate rpfenv2
    (rpfenv2)$ pip install numpy
    ... numpy 1.10.2 installed ...
    (rpfenv2)$ pip install rpforest
    

    rpforest 1.1 is downloaded, but compilation fails. Compilation command:

    clang -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/dsc/miniconda3/envs/rpfenv2/include -arch x86_64 -I/Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include -I/Users/dsc/miniconda3/envs/rpfenv2/include/python3.4m -c rpforest/rpforest_fast.cpp -o build/temp.macosx-10.5-x86_64-3.4/rpforest/rpforest_fast.o -std=c++11
    

    Compiler errors:

      In file included from rpforest/rpforest_fast.cpp:271:
      In file included from /Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include/numpy/arrayobject.h:4:
      In file included from /Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include/numpy/ndarrayobject.h:18:
      In file included from /Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include/numpy/ndarraytypes.h:1781:
      /Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: "Using deprecated NumPy API, disable it by "          "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
      #warning "Using deprecated NumPy API, disable it by " \
       ^
      rpforest/rpforest_fast.cpp:5727:28: error: no member named 'shrink_to_fit' in 'std::vector<int, std::allocator<int> >'
          __pyx_v_node->indices->shrink_to_fit();
          ~~~~~~~~~~~~~~~~~~~~~  ^
      rpforest/rpforest_fast.cpp:5940:28: error: no member named 'shrink_to_fit' in 'std::vector<int, std::allocator<int> >'
          __pyx_v_node->indices->shrink_to_fit();
          ~~~~~~~~~~~~~~~~~~~~~  ^
      1 warning and 2 errors generated.
      error: command 'clang' failed with exit status 1
    
    opened by davechallis 1
  • label points that are being fit()

    label points that are being fit()

    I'm not sure if the implementation already supports this, but is it possible assign a label with every point with fit(), so when there is a query, I can identify the neighbors by the labels?

    opened by delip 1
  • CircleCI 2.0, tox, py35+ tests and support

    CircleCI 2.0, tox, py35+ tests and support

    Decided to use tox, so that you can:

    • run tests locally in multiple python versions
    • have a consistent testing platform across CI and local environment

    Also:

    • updated readme
    • refactored setup.py ... made it not a fatal exception if python setup.py is run without an installed numpy
    • fixed flake8 issues
    • black-ified code
    • fixed tests code to support py35+
    • updated cpp library with the latest cython 0.29.14 (previously 0.23.4)
    opened by iserko 0
  • Reuse hyperplanes

    Reuse hyperplanes

    This PR makes all interior nodes of a tree at a given depth now use the same projection hyperplane. This drastically reduces the memory footprint of the tree without affecting the guarantees of the data structure (which relies on the hyperplanes being independently drawn _ between_ the trees in the forest).

    opened by maciejkula 0
  • Raising error when tree already exists

    Raising error when tree already exists

    Referencing https://github.com/lyst/rpforest/blob/master/rpforest/rpforest.py#L59

    The tree already exists, so is there a way to handle this gracefully instead of raising an error?

    opened by RitwikGupta 0
Owner
Lyst
Your World of Fashion
Lyst
A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Stox 31 Dec 16, 2022
Book Recommender System Using Sci-kit learn N-neighbours

Model-Based-Recommender-Engine I created a book Recommender System using Sci-kit learn's N-neighbours algorithm for my model and the streamlit library

1 Jan 13, 2022
Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogramas anuais com spark, em pyspark e SQL!

Olá! Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogr

Henrique de Paula 10 Apr 04, 2022
Iterative stochastic gradient descent (SGD) linear regressor with regularization

SGD-Linear-Regressor Iterative stochastic gradient descent (SGD) linear regressor with regularization Dataset: Kaggle “Graduate Admission 2” https://w

Zechen Ma 1 Oct 29, 2021
Deep Survival Machines - Fully Parametric Survival Regression

Package: dsm Python package dsm provides an API to train the Deep Survival Machines and associated models for problems in survival analysis. The under

Carnegie Mellon University Auton Lab 10 Dec 30, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 07, 2022
A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

pyUpSet A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al. Contents Purpose How to install How it work

288 Jan 04, 2023
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Telekom Open Source Software 17 Nov 20, 2022
Predict the output which should give a fair idea about the chances of admission for a student for a particular university

Predict the output which should give a fair idea about the chances of admission for a student for a particular university.

ArvindSandhu 1 Jan 11, 2022
Datetimes for Humans™

Maya: Datetimes for Humans™ Datetimes are very frustrating to work with in Python, especially when dealing with different locales on different systems

Timo Furrer 3.4k Dec 28, 2022
An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

Seldon Core: Blazing Fast, Industry-Ready ML An open source platform to deploy your machine learning models on Kubernetes at massive scale. Overview S

Seldon 3.5k Jan 01, 2023
Repository for DCA0305, an undergraduate course about Machine Learning Workflows and Pipelines

Federal University of Rio Grande do Norte Technology Center Department of Computer Engineering and Automation Machine Learning Based Systems Design Re

Ivanovitch Silva 81 Oct 18, 2022
A collection of machine learning examples and tutorials.

machine_learning_examples A collection of machine learning examples and tutorials.

LazyProgrammer.me 7.1k Jan 01, 2023
Factorization machines in python

Factorization Machines in Python This is a python implementation of Factorization Machines [1]. This uses stochastic gradient descent with adaptive re

Corey Lynch 892 Jan 03, 2023
TensorFlow implementation of an arbitrary order Factorization Machine

This is a TensorFlow implementation of an arbitrary order (=2) Factorization Machine based on paper Factorization Machines with libFM. It supports: d

Mikhail Trofimov 785 Dec 21, 2022
A simple machine learning python sign language detection project.

SST Coursework 2022 About the app A python application that utilises the tensorflow object detection algorithm to achieve automatic detection of ameri

Xavier Koh 2 Jun 30, 2022
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 04, 2023
Machine-Learning with python (jupyter)

Machine-Learning with python (jupyter) 머신러닝 야학 작심 10일과 쥬피터 노트북 기반 데이터 사이언스 시작 들어가기전 https://nbviewer.org/ 페이지를 통해서 쥬피터 노트북 내용을 볼 수 있다. 위 페이지에서 현재 레포 기

HyeonWoo Jeong 1 Jan 23, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learn

Vowpal Wabbit 8.1k Dec 30, 2022
A simple and lightweight genetic algorithm for optimization of any machine learning model

geneticml This package contains a simple and lightweight genetic algorithm for optimization of any machine learning model. Installation Use pip to ins

Allan Barcelos 8 Aug 10, 2022