Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Last update: Jul 09, 2022

Overview

Fast Laplacian Eigenmaps in python

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMSlib to compute approximate-nearest-neighbors. Performs several times faster than the default scikit-learn implementation.

Installation

You'll need NMSlib for using this package properly. Installing it with no binaries is recommended if your CPU supports advanced instructions (it problably does):

pip3 install --no-binary :all: nmslib
# Along with requirements:
pip3 install numpy pandas scipy scikit-learn

Then you can install this package with pip:

pip3 install fastlapmap

Usage

See the following example with the handwritten digits data. Here, I visually compare results from the scikit-learn Laplacian Eigenmaps implementation to those from my implementation.

Note that this implementation contains two similarity-learning algorithms: anisotropic diffusion maps and fuzzy simplicial sets.

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from fastlapmap import LapEigenmap

# Load some data
from sklearn.datasets import load_digits
digits = load_digits()
data = digits.data

# Define hyperparameters
N_EIGS=2
N_NEIGHBORS=10
N_JOBS=10

sk_se = SpectralEmbedding(n_components=N_EIGS, n_neighbors=N_NEIGHBORS, n_jobs=N_JOBS).fit_transform(data)

flapmap_diff = LapEigenmap(data, n_eigs=2, similarity='diffusion', norm_laplacian=True, k=N_NEIGHBORS, n_jobs=N_JOBS)
flapmap_fuzzy = LapEigenmap(data, n_eigs=2, similarity='fuzzy', norm_laplacian=True, k=N_NEIGHBORS, n_jobs=N_JOBS)

fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.suptitle('Handwritten digits data:', fontsize=24)
ax1.scatter(sk_se[:, 0], sk_se[:, 1], c=digits.target, cmap='Spectral', s=5)
ax1.set_title('Sklearn\'s Laplacian Eigenmaps', fontsize=20)
ax2.scatter(flapmap_diff[:, 0], flapmap_diff[:, 1], c=digits.target, cmap='Spectral', s=5)
ax2.set_title('Fast Laplacian Eigenmaps with diffusion harmonics', fontsize=20)
ax3.scatter(flapmap_fuzzy[:, 0], flapmap_fuzzy[:, 1], c=digits.target, cmap='Spectral', s=5)
ax3.set_title('Fast Laplacian Eigenmaps with fuzzy simplicial sets', fontsize=20)
plt.show()

As we can see, results are nearly identical.

Benchmark

See the runtime comparison between this implementation and scikit-learn:

## Load benchmark function:
from fastlapmap.benchmark import runtime_benchmark

# Load data
from sklearn.datasets import load_digits
digits = load_digits()
data = digits.data

# Define hyperparameters
N_EIGS = 2
N_NEIGHBORS = 10
N_JOBS = 10
SIZES = [1000, 5000, 10000, 25000, 50000, 100000]
N_RUNS = 3

runtime_benchmark(data,
                  n_eigs=N_EIGS,
                  n_neighbors=N_NEIGHBORS,
                  n_jobs=N_JOBS,
                  sizes=SIZES,
                  n_runs=N_RUNS)

As you can see, the diffusion harmoics model is the fastest, followed closely by fuzzy simplicial sets. Both outperform scikit-learn default implementation and escalate linearly with sample size.

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Related tags

Overview

Fast Laplacian Eigenmaps in python

Installation

Usage

Benchmark

Owner

NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

WAL enables programmable waveform analysis.

small package with utility functions for analyzing (fly) calcium imaging data

Bamboolib - a GUI for pandas DataFrames

Template for a Dataflow Flex Template in Python

Conduits - A Declarative Pipelining Tool For Pandas

This mini project showcase how to build and debug Apache Spark application using Python

Analysiscsv.py for extracting analysis and exporting as CSV

Powerful, efficient particle trajectory analysis in scientific Python.

The official pytorch implementation of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

An easy-to-use feature store

EOD Historical Data Python Library (Unofficial)

PyIOmica (pyiomica) is a Python package for omics analyses.

pyETT: Python library for Eleven VR Table Tennis data

PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

Pyspark project that able to do joins on the spark data frames.

Functional tensors for probabilistic programming

PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.