Clustergram - Visualization and diagnostics for cluster analysis in Python

Last update: Dec 26, 2022

Related tags

Overview

Clustergram

Visualization and diagnostics for cluster analysis

Clustergram is a diagram proposed by Matthias Schonlau in his paper The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses.

In hierarchical cluster analysis, dendrograms are used to visualize how clusters are formed. I propose an alternative graph called a “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases. This graph is useful in exploratory analysis for nonhierarchical clustering algorithms such as k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

The clustergram was later implemented in R by Tal Galili, who also gives a thorough explanation of the concept.

This is a Python translation of Tal's script written for scikit-learn and RAPIDS cuML implementations of K-Means, Mini Batch K-Means and Gaussian Mixture Model (scikit-learn only) clustering, plus hierarchical/agglomerative clustering using SciPy. Alternatively, you can create clustergram using from_* constructors based on alternative clustering algorithms.

Getting started

You can install clustergram from conda or pip:

conda install clustergram -c conda-forge

pip install clustergram

In any case, you still need to install your selected backend (scikit-learn and scipy or cuML).

The example of clustergram on Palmer penguins dataset:

import seaborn
df = seaborn.load_dataset('penguins')

First we have to select numerical data and scale them.

from sklearn.preprocessing import scale
data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())

And then we can simply pass the data to clustergram.

from clustergram import Clustergram

cgram = Clustergram(range(1, 8))
cgram.fit(data)
cgram.plot()

Styling

Clustergram.plot() returns matplotlib axis and can be fully customised as any other matplotlib plot.

seaborn.set(style='whitegrid')

cgram.plot(
    ax=ax,
    size=0.5,
    linewidth=0.5,
    cluster_style={"color": "lightblue", "edgecolor": "black"},
    line_style={"color": "red", "linestyle": "-."},
    figsize=(12, 8)
)

Mean options

On the y axis, a clustergram can use mean values as in the original paper by Matthias Schonlau or PCA weighted mean values as in the implementation by Tal Galili.

cgram = Clustergram(range(1, 8))
cgram.fit(data)
cgram.plot(figsize=(12, 8), pca_weighted=True)

cgram = Clustergram(range(1, 8))
cgram.fit(data)
cgram.plot(figsize=(12, 8), pca_weighted=False)

Scikit-learn, SciPy and RAPIDS cuML backends

Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.

Using scikit-learn (default):

cgram = Clustergram(range(1, 8), backend='sklearn')
cgram.fit(data)
cgram.plot()

Using cuML:

cgram = Clustergram(range(1, 8), backend='cuML')
cgram.fit(data)
cgram.plot()

data can be all data types supported by the selected backend (including cudf.DataFrame with cuML backend).

Supported methods

Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy's hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.

Using K-Means (default):

cgram = Clustergram(range(1, 8), method='kmeans')
cgram.fit(data)
cgram.plot()

Using Mini Batch K-Means, which can provide significant speedup over K-Means:

cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100)
cgram.fit(data)
cgram.plot()

Using Gaussian Mixture Model:

cgram = Clustergram(range(1, 8), method='gmm')
cgram.fit(data)
cgram.plot()

Using Ward's hierarchical clustering:

cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
cgram.fit(data)
cgram.plot()

Manual input

Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.

Using Clustergram.from_data which creates cluster centers as mean or median values:

data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})

cgram = Clustergram.from_data(data, labels)
cgram.plot()

Using Clustergram.from_centers based on explicit cluster centers.:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
            1: np.array([[0, 0]]),
            2: np.array([[-1, -1], [1, 1]]),
            3: np.array([[-1, -1], [1, 1], [0, 0]]),
        }
cgram = Clustergram.from_centers(centers, labels)
cgram.plot(pca_weighted=False)

To support PCA weighted plots you also need to pass data:

cgram = Clustergram.from_centers(centers, labels, data=data)
cgram.plot()

Partial plot

Clustergram.plot() can also plot only a part of the diagram, if you want to focus on a limited range of k.

cgram = Clustergram(range(1, 20))
cgram.fit(data)
cgram.plot(figsize=(12, 8))

cgram.plot(k_range=range(3, 10), figsize=(12, 8))

Additional clustering performance evaluation

Clustergam includes handy wrappers around a selection of clustering performance metrics offered by scikit-learn. Data which were originally computed on GPU are converted to numpy on the fly.

Silhouette score

Compute the mean Silhouette Coefficient of all samples. See scikit-learn documentation for details.

>>> cgram.silhouette_score()
2    0.531540
3    0.447219
4    0.400154
5    0.377720
6    0.372128
7    0.331575
Name: silhouette_score, dtype: float64

Once computed, resulting Series is available as cgram.silhouette. Calling the original method will recompute the score.

Calinski and Harabasz score

Compute the Calinski and Harabasz score, also known as the Variance Ratio Criterion. See scikit-learn documentation for details.

>>> cgram.calinski_harabasz_score()
2    482.191469
3    441.677075
4    400.392131
5    411.175066
6    382.731416
7    352.447569
Name: calinski_harabasz_score, dtype: float64

Once computed, resulting Series is available as cgram.calinski_harabasz. Calling the original method will recompute the score.

Davies-Bouldin score

Compute the Davies-Bouldin score. See scikit-learn documentation for details.

>>> cgram.davies_bouldin_score()
2    0.714064
3    0.943553
4    0.943320
5    0.973248
6    0.950910
7    1.074937
Name: davies_bouldin_score, dtype: float64

Once computed, resulting Series is available as cgram.davies_bouldin. Calling the original method will recompute the score.

Acessing labels

Clustergram stores resulting labels for each of the tested options, which can be accessed as:

>>> cgram.labels
     1  2  3  4  5  6  7
0    0  0  2  2  3  2  1
1    0  0  2  2  3  2  1
2    0  0  2  2  3  2  1
3    0  0  2  2  3  2  1
4    0  0  2  2  0  0  3
..  .. .. .. .. .. .. ..
337  0  1  1  3  2  5  0
338  0  1  1  3  2  5  0
339  0  1  1  1  1  1  4
340  0  1  1  3  2  5  5
341  0  1  1  1  1  1  5

Saving clustergram

You can save both plot and clustergram.Clustergram to a disk.

Saving plot

Clustergram.plot() returns matplotlib axis object and as such can be saved as any other plot:

import matplotlib.pyplot as plt

cgram.plot()
plt.savefig('clustergram.svg')

Saving object

If you want to save your computed clustergram.Clustergram object to a disk, you can use pickle library:

import pickle

with open('clustergram.pickle','wb') as f:
    pickle.dump(cgram, f)

Then loading is equally simple:

with open('clustergram.pickle','rb') as f:
    loaded = pickle.load(f)

References

Schonlau M. The clustergram: a graph for visualizing hierarchical and non-hierarchical cluster analyses. The Stata Journal, 2002; 2 (4):391-402.

Schonlau M. Visualizing Hierarchical and Non-Hierarchical Cluster Analyses with Clustergrams. Computational Statistics: 2004; 19(1):95-111.

https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

Comments

ENH: support interactive bokeh plots
Adds Clustergram.bokeh() method which generates clustergram in a form of internactive bokeh plot. On top of an ability to zoom to specific sections shows the count of observations and cluster label (linked to Clustergram.labels).

To-do:

[ ] documentation

[x] check RAPIDS compatibility

I think I'll need to split docs into muliple pages at this point.
opened by martinfleis 1
ENH: from_data and from_centers methods

Addind the ability to create clustergram using custom data, without the need to run any cluster algorithm within clustergram itself.

from_data gets labels and data and creates cluster centers as mean or median values.

from_centers utilises custom centers when mean/median is not the optimal solution (like in case of GMM for example).

Closes #10

opened by martinfleis 1
skip k=1 for K-Means

k=1 does not need to be modelled, cluster centre is a pure mean of an input array. All the other options require k=1 e.g to fit gaussian.

Skip k=1 in all k-means implementations to get avoid unnecessary computation.

opened by martinfleis 0
ENH: add bokeh plotting backend

With some larger clustergrams it may be quite useful to have the ability to zoom to certain places interactively. I think that bokeh plotting backend would be good for that.

opened by martinfleis 0
ENH: expose labels, refactor plot computation internals, add additional metrics

Closes #7

This refactors internals a bit, which in turn allows exposing the actual clustering labels for each tested iteration.

Aso adding a few additional methods to assess clustering performance on top of clustergram.

opened by martinfleis 0
Support multiple PCAs

The current way of weighting by PCA is hard-coded to use the first one. But it could be useful to see clustergrams weighted by other PCAs as well.

And it would be super cool to get a 3d version with the first component on one axis and a second one on the other (not sure how useful though :D).

opened by martinfleis 0
Can this work with cluster made by top2vec ?

Thanks for your interesting package.

Do you think Clustergram could work with top2vec ? https://github.com/ddangelov/Top2Vec

I saw that there is the option to create a clustergram from a DataFrame.

In top2vec, each "document" to cluster is represented as a embedding of a certain dimension, 256 , for example.

So I could indeed generate a data frame, like this:

| x0 | x1| ... | x255 | topic | | -----|----|---- | -------| -- | | 0.5| 0.2 | ....| -0.2 | 2 | | 0.7| 0.2 | ....| -0.1 | 2 | | 0.5| 0.2 | ....| -0.2 | 3 |

Does Clustergram assume anything on the rows of this data frame ? I saw that the from_data method either takes "mean" or "medium" as method to calculate the cluster centers.

In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?

top2vec calculates as well the "topic vectors" as a mean of the "document vectors", I believe.

opened by behrica 17

Releases(v0.6.0)

v0.6.0(Nov 12, 2021)
Enhancements:

ENH: optionally measure BIC during GMM (#21)

Bug fixes:

BUG: cuML non-weighted plot fix (#25)

Full Changelog: https://github.com/martinfleis/clustergram/compare/v0.5.1...v0.6.0
Source code(tar.gz)
Source code(zip)
clustergram-0.6.0.tar.gz(34.17 KB)
v0.5.1(May 24, 2021)
Bugfix for from_data method with non-default indices.

Bugs:

BUG: cluster centers empty due to index mismatch (#19)

Source code(tar.gz)
Source code(zip)
clustergram-0.5.1.tar.gz(34.34 KB)
v0.5.0(May 11, 2021)
Clustergram now supports interactive plotting using a new .bokeh() method based on BokehJS. It can be handy for the exploration of larger and more complex clustergrams or those with significant outliers.

Enhancements:

ENH: support interactive bokeh plots (#14)

ENH: skip k=1 in K-Means implementations (#18)

documentation restructuring

Source code(tar.gz)
Source code(zip)
clustergram-0.5.0.tar.gz(33.52 KB)
v0.4.0(Apr 27, 2021)
Spring comes with native hierarchical clustering and the ability to create clustergam from a manual input.

Enhancements:

ENH: support hierarchical clustering using scipy (#11)

ENH: from_data and from_centers methods (#12)

Source code(tar.gz)
Source code(zip)
clustergram-0.4.0.tar.gz(32.18 KB)
v0.3.0(Jan 31, 2021)
API chages:

pca_weighted is now keyword of Clustergram.plot() not init.

Enhancements:

Support MiniBatchKMeans (sklearn)

Custom __repr__

Expose cluster labels obtained during the loop

Expose cluster centers

Silhouette score

Calinski and Harabasz score

Davies-Bouldin score

Source code(tar.gz)
Source code(zip)
clustergram-0.3.0.tar.gz(29.25 KB)
v0.2.1(Dec 23, 2020)

Minor release updating documentation to acknowledge that clustergram can be installed from conda-forge.
Source code(tar.gz)
Source code(zip)
clustergram-0.2.1.tar.gz(7.81 KB)
v0.2.0(Dec 21, 2020)
Version 0.2.0 brings support of Gaussian Mixture Models (using scikit-learn) and few minor changes.

Enhancements:

Gaussian Mixture Model support (#4)

Verbosity - Clustergram now indicates the progress

Additional arguments can be passed to the PCA object

Bug fixes:

BUG: avoid LinAlgError: singular matrix

Source code(tar.gz)
Source code(zip)
clustergram-0.2.0.tar.gz(7.69 KB)
v0.1.2(Oct 14, 2020)

The first official release of clustergram.

See https://clustergram.readthedocs.io/.
Source code(tar.gz)
Source code(zip)
clustergram-0.1.2.tar.gz(6.95 KB)

Owner

Martin Fleischmann

Researcher in geographic data science. Member of @geopandas and @pysal development teams.

GitHub Repository https://clustergram.readthedocs.io

thundernet ncnn

MMDetection_Lite 基于mmdetection 实现一些轻量级检测模型，安装方式和mmdeteciton相同 voc0712 voc 0712训练 voc2007测试 coco预训练 thundernet_voc_shufflenetv2_1.5 input shape mAP 320

39 Dec 05, 2022

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting This is the Pytorch implement of CVPR 2016 paper on Context Encoders 1) Semantic Inpainting Demo Inst

321 Dec 25, 2022

Using BERT+Bi-LSTM+CRF

Chinese Medical Entity Recognition Based on BERT+Bi-LSTM+CRF Step 1 I share the dataset on my google drive, please download the whole 'CCKS_2019_Task1

55 Dec 21, 2022

StackNet is a computational, scalable and analytical Meta modelling framework

StackNet This repository contains StackNet Meta modelling methodology (and software) which is part of my work as a PhD Student in the computer science

1.3k Dec 15, 2022

Optical machine for senses sensing using speckle and deep learning

# Senses-speckle [Remote Photonic Detection of Human Senses Using Secondary Speckle Patterns](https://doi.org/10.21203/rs.3.rs-724587/v1) paper Python

0 Sep 26, 2021

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

NÜWA - Pytorch (wip) Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch. This repository will be popul

463 Dec 28, 2022

Extremely simple and fast extreme multi-class and multi-label classifiers.

napkinXC napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification, that focus of implementing various m

43 Nov 14, 2022

Detectron2 for Document Layout Analysis

Detectron2 trained on PubLayNet dataset This repo contains the training configurations, code and trained models trained on PubLayNet dataset using Det

163 Nov 21, 2022

Face Library is an open source package for accurate and real-time face detection and recognition

Face Library Face Library is an open source package for accurate and real-time face detection and recognition. The package is built over OpenCV and us

52 Nov 09, 2022

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

14 Dec 02, 2022

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

248 Dec 04, 2022

Bianace Prediction Pytorch Model

Bianace Prediction Pytorch Model Main Results ETHUSDT from 2021-01-01 00:00:00 t

4 Jul 20, 2022

Source code for CAST - Crisis Domain Adaptation Using Sequence-to-sequence Transformers (Accepted to ISCRAM 2021, CorePaper).

Source code for CAST: Crisis Domain Adaptation UsingSequence-to-sequenceTransformers (Paper, BibTeX, Accepted to ISCRAM 2021, CorePaper) Quick start D

0 Jul 14, 2021

Clustergram - Visualization and diagnostics for cluster analysis in Python

Related tags

Overview

Clustergram

Visualization and diagnostics for cluster analysis

Getting started

Styling

Mean options

Scikit-learn, SciPy and RAPIDS cuML backends

Supported methods

Manual input

Partial plot

Additional clustering performance evaluation

Silhouette score

Calinski and Harabasz score

Davies-Bouldin score

Acessing labels

Saving clustergram

Saving plot

Saving object

References

Comments

ENH: support interactive bokeh plots

ENH: from_data and from_centers methods

skip k=1 for K-Means

ENH: add bokeh plotting backend

ENH: expose labels, refactor plot computation internals, add additional metrics

Support multiple PCAs

Can this work with cluster made by top2vec ?

Releases(v0.6.0)

v0.6.0(Nov 12, 2021)

v0.5.1(May 24, 2021)

v0.5.0(May 11, 2021)

v0.4.0(Apr 27, 2021)

v0.3.0(Jan 31, 2021)

v0.2.1(Dec 23, 2020)

v0.2.0(Dec 21, 2020)

v0.1.2(Oct 14, 2020)

Owner

Martin Fleischmann

thundernet ncnn

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Using BERT+Bi-LSTM+CRF

StackNet is a computational, scalable and analytical Meta modelling framework

Optical machine for senses sensing using speckle and deep learning

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Extremely simple and fast extreme multi-class and multi-label classifiers.

Detectron2 for Document Layout Analysis

Face Library is an open source package for accurate and real-time face detection and recognition

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Bianace Prediction Pytorch Model

Source code for CAST - Crisis Domain Adaptation Using Sequence-to-sequence Transformers (Accepted to ISCRAM 2021, CorePaper).

Apollo optimizer in tensorflow

Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking

Simple ONNX operation generator. Simple Operation Generator for ONNX.

The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

Minimalist Error collection Service compatible with Rollbar clients. Sentry or Rollbar alternative.

The 1st Place Solution of the Facebook AI Image Similarity Challenge (ISC21) : Descriptor Track.