t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

Related tags

Data Analysistreesne
Overview

tree-SNE

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology. Building on recent advances in speeding up t-SNE and obtaining finer-grained structure, we combine the two to create tree-SNE, a hierarchical clustering and visualization algorithm based on stacked one-dimensional t-SNE embeddings. We also introduce alpha-clustering, which recommends the optimal cluster assignment, without foreknowledge of the number of clusters, based off of the cluster stability across multiple scales. We demonstrate the effectiveness of tree-SNE and alpha-clustering on images of handwritten digits, mass cytometry (CyTOF) data from blood cells, and single-cell RNA-sequencing (scRNA-seq) data from retinal cells. Furthermore, to demonstrate the validity of the visualization, we use alpha-clustering to obtain unsupervised clustering results competitive with the state of the art on several image data sets.

ArXiv preprint: https://arxiv.org/abs/2002.05687

Prerequisites

Install Fit-SNE from https://github.com/KlugerLab/FIt-SNE and add the FIt-SNE directory that you cloned to your PYTHONPATH environmental variable. This lets tree-SNE access the Python file used to interface with FIt-SNE. This can be done one of several ways:

  • run export PYTHONPATH="$PYTHONPATH":/path/to/FIt-SNE in your terminal before running your Python script using tree-SNE
  • add export PYTHONPATH="$PYTHONPATH":/path/to/FIt-SNE to your .bash_profile
  • add the line import sys; sys.path.append('/path/to/FIt-SNE/') to your Python script before calling import tree_sne

Also make sure to have Numpy, Scipy, Sklearn, and Matplotlib installed.

We've tested with Python 3.6+.

Test/Example

Run example.py to make sure everything is set up right. This will run tree-SNE on the USPS handwritten digit dataset, run alpha-clustering, calculate the NMI, and display the tree. You can refer to this file for calling conventions. Note the top line adding FIt-SNE to the Python path.

Sample Usage

Assuming you have a 2D Numpy array containing your data in a variable X. To build a tree-SNE plot with 30 layers, cluster on each layer, and determine the optimal clustering via alpha-clustering (note does not require preknowledge of the number of clusters):

from tree_sne import TreeSNE

tree = TreeSNE()
embeddings, layer_clusters, best_clusters = tree.fit(X, n_layers = 30)

The embeddings variable will contain each data point's embedding in each layer, with embeddings.shape of (n_points, n_layers, n_features). For now, n_features will always be 1, as we haven't yet implemented stacked 2D t-SNE embeddings. The variable layer_clusters will contain cluster assignments for each point in each layer of the embedding, and best_clusters will contain optimal cluster assignments for the data.

To display the tree using our code with cluster labels, run:

from display_tree import display_tree_mnist
import numpy as np

display_tree_mnist(embeddings, true_labels = best_clusters, legend_labels = list(np.unique(best_clusters)), distinct = True)

Alternatively, some labels you provide can be used instead of best_clusters. We realize this is messy but until we refactor this is what we have. We're sorry. You don't have to use our display code if you don't want to, and we'll improve it soon.

If your data has more clusters, reduce the conservativeness parameter to TreeSNE. Typical values range from 1 to 2. It should never drop below 1 according to our theory motivation for its implementation, and we've only had to decrease it when trying to find 100 clusters, in which case we set it to 1.3. n_layers and conservativeness are the only two parameters that we think users may want to adjust, at least for the time being. Once we've refactored we'll write more documentation. Note that conservativeness only effects alpha-clustering and does not actually change the tree-SNE embedding itself.

MNIST tree-SNE example plot

Authors

Acknowledgments

The authors thank Stefan Steinerberger for inspiration, support, and advice; George Linderman for enabling one-dimensional t-SNE with degrees of freedom < 1 in the FIt-SNE package; Scott Gigante for data pre-processing and helpful discussions of visualizations and alpha-clustering; Smita Krishnaswamy for encouragement and feedback; and Ariel Jaffe for discussing the Nyström method and its relationship to subsampled spectral clustering.

Owner
Isaac Robinson
Yale computer science and math major interested in entrepreneurship
Isaac Robinson
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

Matrix Profile Foundation 302 Dec 29, 2022
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021
Extract data from a wide range of Internet sources into a pandas DataFrame.

pandas-datareader Up to date remote data access for pandas, works for multiple versions of pandas. Installation Install using pip pip install pandas-d

Python for Data 2.5k Jan 09, 2023
CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

J. W. Dell 0 Mar 13, 2022
nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

NRG Tech Services 23 Dec 08, 2022
Geospatial data-science analysis on reasons behind delay in Grab ride-share services

Grab x Pulis Detailed analysis done to investigate possible reasons for delay in Grab services for NUS Data Analytics Competition 2022, to be found in

Keng Hwee 6 Jun 07, 2022
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
Flood modeling by 2D shallow water equation

hydraulicmodel Flood modeling by 2D shallow water equation. Refer to Hunter et al (2005), Bates et al. (2010). Diffusive wave approximation Local iner

6 Nov 30, 2022
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
Learn machine learning the fun way, with Oracle and RedBull Racing

Red Bull Racing Analytics Hands-On Labs Introduction Are you interested in learning machine learning (ML)? How about doing this in the context of the

Oracle DevRel 55 Oct 24, 2022
Data Analytics on Genomes and Genetics

Data Analytics performed on On genomes and Genetics dataset to predict genetic disorder and disorder subclass. DONE by TEAM SIGMA!

1 Jan 12, 2022
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

2.7k Jan 03, 2023
Single-Cell Analysis in Python. Scales to >1M cells.

Scanpy – Single-Cell Analysis in Python Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It inc

Theis Lab 1.4k Jan 05, 2023
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

Rustam Zokirov 1 Dec 06, 2021
This is a repo documenting the best practices in PySpark.

Spark-Syntax This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark f

Eric Xiao 447 Dec 25, 2022
This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

PILLAI, Amal 1 Oct 02, 2022
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023