Clustering is a popular approach to detect patterns in unlabeled data

Last update: Nov 11, 2022

Related tags

Overview

Visual Clustering

Clustering is a popular approach to detect patterns in unlabeled data. Existing clustering methods typically treat samples in a dataset as points in a metric space and compute distances to group together similar points. Visual Clustering a different way of clustering points in 2-dimensional space, inspired by how humans "visually" cluster data. The algorithm is based on trained neural networks that perform instance segmentation on plotted data.

For more details, see the accompanying paper: "Clustering Plotted Data by Image Segmentation", arXiv preprint, and please use the citation below.

@article{naous2021clustering,
  title={Clustering Plotted Data by Image Segmentation},
  author={Naous, Tarek and Sarkar, Srinjay and Abid, Abubakar and Zou, James},
  journal={arXiv preprint arXiv:2110.05187},
  year={2021}
}

Installation

pip install visual-clustering

Usage

The algorithm can be used the same way as the classical clustering algorithms in scikit-learn:
You first import the class VisualClustering and create an instance of it.

from visual_clustering import VisualClustering

model = VisualClustering(median_filter_size = 1, max_filter_size= 1)

The parameters median_filter_size and max_filter_size are set to 1 by default.
You can experiment with different values to see what works best for your dataset !

Let's create a simple synthetic dataset of blobs.

from sklearn import datasets

data = datasets.make_blobs(n_samples=50000, centers=6, random_state=23,center_box=(-30, 30))
plt.scatter(data[0][:, 0], data[0][:, 1], s=1, c='black')

To cluster the dataset, use the fit function of the model:

predictions = model.fit(data[0])

Visualizing the results

You can visualize the results using matplotlib as you would normally do with classical clustering algorithms:

import matplotlib.pyplot as plt
from itertools import cycle, islice
import numpy as np

colors = np.array(list(islice(cycle(["#000000", '#377eb8', '#ff7f00', '#4daf4a', '#f781bf', '#a65628', '#984ea3']), int(max(predictions) + 1))))
#Black color for outliers (if any)
colors = np.append(colors, ["#000000"])
plt.scatter(data[0][:, 0], data[0][:, 1], s=10, color=colors[predictions.astype('int8')])

Run this code inside a colab notebook:
https://colab.research.google.com/drive/1DcZXhKnUpz1GDoGaJmpS6VVNXVuaRmE5?usp=sharing

Dependencies

Make sure that you have the following libraries installed:

transformers 4.15.0
scipy 1.4.1
tensorflow 2.7.0
keras 2.7.0
numpy 1.19.5
cv2 4.1.2
skimage 0.18.3

Contact

Clustering is a popular approach to detect patterns in unlabeled data

Related tags

Overview

Visual Clustering

Installation

Usage

Visualizing the results

Dependencies

Contact

Owner

Tarek Naous

Official code for "Mean Shift for Self-Supervised Learning"

Multimodal Co-Attention Transformer (MCAT) for Survival Prediction in Gigapixel Whole Slide Images

Faster RCNN pytorch windows

A setup script to generate ITK Python Wheels

We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

A Tensorflow implementation of BicycleGAN.

An efficient and easy-to-use deep learning model compression framework

Leveraging OpenAI's Codex to solve cornerstone problems in Music

Acute ischemic stroke dataset

A LiDAR point cloud cluster for panoptic segmentation

网络协议2天集训

A tight inclusion function for continuous collision detection

PyTorch implementation of our paper: Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

MTCNN face detection implementation for TensorFlow, as a PIP package.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model in ONNX

TCube generates rich and fluent narratives that describes the characteristics, trends, and anomalies of any time-series data (domain-agnostic) using the transfer learning capabilities of PLMs.

A Python package to process & model ChEMBL data.

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Official Pytorch implementation of "Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021)