[v1 (ISBI'21) + v2] MedMNIST: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification

Overview

MedMNIST

Project (Website) | Dataset (Zenodo) | Paper (arXiv) | MedMNIST v1 (ISBI'21)

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni

We introduce MedMNIST v2, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of 708,069 2D images and 10,214 3D images in total, could support numerous research / educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST v2, including 2D / 3D neural networks and open-source / commercial AutoML tools.

MedMNISTv2_overview

For more details, please refer to our paper:

MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification (arXiv)

Key Features

  • Diverse: It covers diverse data modalities, dataset scales (from 100 to 100,000), and tasks (binary/multi-class, multi-label, and ordinal regression). It is as diverse as the VDD and MSD to fairly evaluate the generalizable performance of machine learning algorithms in different settings, but both 2D and 3D biomedical images are provided.
  • Standardized: Each sub-dataset is pre-processed into the same format, which requires no background knowledge for users. As an MNIST-like dataset collection to perform classification tasks on small images, it primarily focuses on the machine learning part rather than the end-to-end system. Furthermore, we provide standard train-validation-test splits for all datasets in MedMNIST v2, therefore algorithms could be easily compared.
  • Lightweight: The small size of 28×28 (2D) or 28×28×28 (3D) is friendly to evaluate machine learning algorithms.
  • Educational: As an interdisciplinary research area, biomedical image analysis is difficult to hand on for researchers from other communities, as it requires background knowledge from computer vision, machine learning, biomedical imaging, and clinical science. Our data with Creative Commons (CC) Licenses is easy to use for educational purposes.

Please note that this dataset is NOT intended for clinical use.

Code Structure

  • medmnist/:
    • dataset.py: PyTorch datasets and dataloaders of MedMNIST.
    • evaluator.py: Standardized evaluation functions.
    • info.py: Dataset information dict for each subset of MedMNIST.
  • examples/:
    • getting_started.ipynb: To explore the MedMNIST dataset with jupyter notebook. It is ONLY intended for a quick exploration, i.e., it does not provide full training and evaluation functionalities.
    • getting_started_without_PyTorch.ipynb: This notebook provides snippets about how to use MedMNIST data (the .npz files) without PyTorch.
  • setup.py: To install medmnist as a module.
  • [EXTERNAL] MedMNIST/experiments: training and evaluation scripts to reproduce both 2D and 3D experiments in our paper, including PyTorch, auto-sklearn, AutoKeras and Google AutoML Vision together with their weights ;)

Installation and Requirements

Setup the required environments and install medmnist as a standard Python package:

pip install --upgrade git+https://github.com/MedMNIST/MedMNIST.git

Check whether you have installed the latest version:

>>> import medmnist
>>> print(medmnist.__version__)

The code requires only common Python environments for machine learning. Basically, it was tested with

  • Python 3 (Anaconda 3.6.3 specifically)
  • PyTorch==1.3.1
  • numpy==1.18.5, pandas==0.25.3, scikit-learn==0.22.2, Pillow==8.0.1, fire

Higher (or lower) versions should also work (perhaps with minor modifications).

If you use PyTorch

  • Great! Our code is designed to work with PyTorch.

  • Explore the MedMNIST dataset with jupyter notebook (getting_started.ipynb), and train basic neural networks in PyTorch.

If you do not use PyTorch

  • Although our code is tested with PyTorch, you are free to parse them with your own code (without PyTorch or even without Python!), as they are only standard NumPy serialization files. It is simple to create a dataset without PyTorch.
  • Go to getting_started_without_PyTorch.ipynb, which provides snippets about how to use MedMNIST data (the .npz files) without PyTorch.
  • Simply change the super class of MedMNIST from torch.utils.data.Dataset to collections.Sequence, you will get a standard dataset without PyTorch. Check dataset_without_pytorch.py for more details.
  • You still have most functionality of our MedMNIST code ;)

Dataset

Please download the dataset(s) via Zenodo. You could also use our code to download automatically by setting download=True in dataset.py.

The MedMNIST dataset contains several subsets. Each subset (e.g., pathmnist.npz) is comprised of 6 keys: train_images, train_labels, val_images, val_labels, test_images and test_labels.

  • train_images / val_images / test_images: N × 28 × 28 for 2D gray-scale datasets, N × 28 × 28 × 3 for 2D RGB datasets, N × 28 × 28 × 28 for 3D datasets. N denotes the number of samples.
  • train_labels / val_labels / test_labels: N x L. N denotes the number of samples. L denotes the number of task labels; for single-label (binary/multi-class) classification, L=1, and {0,1,2,3,..,C} denotes the category labels (C=1 for binary); for multi-label classification L!=1, e.g., L=14 for chestmnist.npz.

Command Line Tools

  • List all available datasets:

      python -m medmnist available
    
  • Download all available datasets:

      python -m medmnist download
    
  • Delete all downloaded npz from root:

      python -m medmnist clean
    
  • Print the dataset details given a subset flag:

      python -m medmnist info --flag=xxxmnist
    
  • Save the dataset as standard figure and csv files, which could be used for AutoML tools, e.g., Google AutoML Vision:

      python -m medmnist save --flag=xxxmnist --folder=tmp/
    
  • Parse and evaluate a standard result file, refer to Evaluator.parse_and_evaluate for details.

      python -m medmnist evaluate --path=folder/{flag}_{split}@{run}.csv
    

Citation

If you find this project useful, please cite both v1 and v2 paper as:

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. "MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification". arXiv preprint arXiv:2110.14795, 2021.

Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

or using the bibtex:

@article{medmnistv2,
    title={MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification},
    author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing},
    journal={arXiv preprint arXiv:2110.14795},
    year={2021}
}
 
@inproceedings{medmnistv1,
    title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis},
    author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing},
    booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)},
    pages={191--195},
    year={2021}
}

Please also cite the corresponding paper of source data if you use any subset of MedMNIST as per the project page.

LICENSE

The code is under Apache-2.0 License.

The datasets are under Creative Commons (CC) Licenses in general. Each subset keeps the same license as that of the source dataset.

A Comparative Review of Recent Kinect-Based Action Recognition Algorithms (TIP2020, Matlab codes)

A Comparative Review of Recent Kinect-Based Action Recognition Algorithms This repo contains: the HDG implementation (Matlab codes) for 'Analysis and

Lei Wang 5 Oct 22, 2022
Evaluating different engineering tricks that make RL work

Reinforcement Learning Tricks, Index This repository contains the code for the paper "Distilling Reinforcement Learning Tricks for Video Games". Short

Anssi 15 Dec 26, 2022
Vector.ai assignment

fabio-tests-nisargatman Low Level Approach: ###Tables: continents: id*, name, population, area, createdAt, updatedAt countries: id*, name, population,

Ravi Pullagurla 1 Nov 09, 2021
Edge-aware Guidance Fusion Network for RGB-Thermal Scene Parsing

EGFNet Edge-aware Guidance Fusion Network for RGB-Thermal Scene Parsing Dataset and Results Test maps: 百度网盘 提取码:zust Citation @ARTICLE{ author={Zhou,

ShaohuaDong 10 Dec 08, 2022
Learning Efficient Online 3D Bin Packing on Packing Configuration Trees

Learning Efficient Online 3D Bin Packing on Packing Configuration Trees This repository is being continuously updated, please stay tuned! Any code con

86 Dec 28, 2022
CUAD

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

The Atticus Project 273 Dec 17, 2022
Short and long time series classification using convolutional neural networks

time-series-classification Short and long time series classification via convolutional neural networks In this project, we present a novel framework f

35 Oct 22, 2022
End-to-end image segmentation kit based on PaddlePaddle.

English | 简体中文 PaddleSeg PaddleSeg has released the new version including the following features: Our team won the 6.2k Jan 02, 2023

Dyalog-apl-docset - Dyalog APL Dash Docset Generator

Dyalog APL Dash Docset Generator o alasa e kili sona kepeken tenpo lili a A Dash

Maciej Goszczycki 1 Jan 10, 2022
Implementation of Basic Machine Learning Algorithms on small datasets using Scikit Learn.

Basic Machine Learning Algorithms All the basic Machine Learning Algorithms are implemented in Python using libraries Acknowledgements Machine Learnin

Piyal Banik 47 Oct 16, 2022
Fast Soft Color Segmentation

Fast Soft Color Segmentation

3 Oct 29, 2022
Text to image synthesis using thought vectors

Text To Image Synthesis Using Thought Vectors This is an experimental tensorflow implementation of synthesizing images from captions using Skip Though

Paarth Neekhara 2.1k Jan 05, 2023
Processed, version controlled history of Minecraft's generated data and assets

mcmeta Processed, version controlled history of Minecraft's generated data and assets Repository structure Each of the following branches has a commit

Misode 75 Dec 28, 2022
Official implementations of PSENet, PAN and PAN++.

News (2021/11/03) Paddle implementation of PAN, see Paddle-PANet. Thanks @simplify23. (2021/04/08) PSENet and PAN are included in MMOCR. Introduction

395 Dec 14, 2022
UNAVOIDS: Unsupervised and Nonparametric Approach for Visualizing Outliers and Invariant Detection Scoring

UNAVOIDS: Unsupervised and Nonparametric Approach for Visualizing Outliers and Invariant Detection Scoring Code Summary aggregate.py: this script aggr

1 Dec 28, 2021
Code of Adverse Weather Image Translation with Asymmetric and Uncertainty aware GAN

Adverse Weather Image Translation with Asymmetric and Uncertainty-aware GAN (AU-GAN) Official Tensorflow implementation of Adverse Weather Image Trans

Jeong-gi Kwak 36 Dec 26, 2022
CT-Net: Channel Tensorization Network for Video Classification

[ICLR2021] CT-Net: Channel Tensorization Network for Video Classification @inproceedings{ li2021ctnet, title={{\{}CT{\}}-Net: Channel Tensorization Ne

33 Nov 15, 2022
Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Automatic, Readable, Reusable, Extendable Machin is a reinforcement library designed for pytorch. Build status Platform Status Linux Windows Supported

Iffi 348 Dec 24, 2022
SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

[arXiv] The main motivation of the SHIFT15M project is to provide a dataset that contains natural dataset shifts collected from a web service IQON, wh

ZOZO, Inc. 138 Nov 24, 2022
Implementation of the paper "Self-Promoted Prototype Refinement for Few-Shot Class-Incremental Learning"

Self-Promoted Prototype Refinement for Few-Shot Class-Incremental Learning This is the implementation of the paper "Self-Promoted Prototype Refinement

Kai Zhu 78 Dec 02, 2022