Data Augmentation with Variational Autoencoders

Overview



Documentation 	Status Downloads 	Status

Documentation

Pyraug

This library provides a way to perform Data Augmentation using Variational Autoencoders in a reliable way even in challenging contexts such as high dimensional and low sample size data.

Installation

To install the library from pypi.org run the following using pip

$ pip install pyraug

or alternatively you can clone the github repo to access to tests, tutorials and scripts.

$ git clone https://github.com/clementchadebec/pyraug.git

and install the library

$ cd pyraug
$ pip install .

Augmenting your Data

In Pyraug, a typical augmentation process is divided into 2 distinct parts:

  1. Train a model using the Pyraug's TrainingPipeline or using the provided scripts/training.py script
  2. Generate new data from a trained model using Pyraug's GenerationPipeline or using the provided scripts/generation.py script

There exist two ways to augment your data pretty straightforwardly using Pyraug's built-in functions.

Using Pyraug's Pipelines

Pyraug provides two pipelines that may be used to either train a model on your own data or generate new data with a pretrained model.

note: These pipelines are independent of the choice of the model and sampler. Hence, they can be used even if you want to access to more advanced features such as defining your own autoencoding architecture.

Launching a model training

To launch a model training, you only need to call a TrainingPipeline instance. In its most basic version the TrainingPipeline can be built without any arguments. This will by default train a RHVAE model with default autoencoding architecture and parameters.

>>> from pyraug.pipelines import TrainingPipeline
>>> pipeline = TrainingPipeline()
>>> pipeline(train_data=dataset_to_augment)

where dataset_to_augment is either a numpy.ndarray, torch.Tensor or a path to a folder where each file is a data (handled data formats are .pt, .nii, .nii.gz, .bmp, .jpg, .jpeg, .png).

More generally, you can instantiate your own model and train it with the TrainingPipeline. For instance, if you want to instantiate a basic RHVAE run:

>>> from pyraug.models import RHVAE
>>> from pyraug.models.rhvae import RHVAEConfig
>>> model_config = RHVAEConfig(
...    input_dim=int(intput_dim)
... ) # input_dim is the shape of a flatten input data
...   # needed if you did not provide your own architectures
>>> model = RHVAE(model_config)

In case you instantiate yourself a model as shown above and you did not provide all the network architectures (encoder, decoder & metric if applicable), the ModelConfig instance will expect you to provide the input dimension of your data which equals to n_channels x height x width x .... Pyraug's VAE models' networks indeed default to Multi Layer Perceptron neural networks which automatically adapt to the input data shape.

note: In case you have different size of data, Pyraug will reshape it to the minimum size min_n_channels x min_height x min_width x ...

Then the TrainingPipeline can be launched by running:

>>> from pyraug.pipelines import TrainingPipeline
>>> pipe = TrainingPipeline(model=model)
>>> pipe(train_data=dataset_to_augment)

At the end of training, the model weights models.pt and model config model_config.json file will be saved in a folder outputs/my_model/training_YYYY-MM-DD_hh-mm-ss/final_model.

Important: For high dimensional data we advice you to provide you own network architectures and potentially adapt the training and model parameters see documentation for more details.

Launching data generation

To launch the data generation process from a trained model, run the following.

>>> from pyraug.pipelines import GenerationPipeline
>>> from pyraug.models import RHVAE
>>> model = RHVAE.load_from_folder('path/to/your/trained/model') # reload the model
>>> pipe = GenerationPipeline(model=model) # define pipeline
>>> pipe(samples_number=10) # This will generate 10 data points

The generated data is in .pt files in dummy_output_dir/generation_YYYY-MM-DD_hh-mm-ss. By default, it stores batch data of a maximum of 500 samples.

Retrieve generated data

Generated data can then be loaded pretty easily by running

>>> import torch
>>> data = torch.load('path/to/generated_data.pt')

Using the provided scripts

Pyraug provides two scripts allowing you to augment your data directly with commandlines.

note: To access to the predefined scripts you should first clone the Pyraug's repository. The following scripts are located in scripts folder. For the time being, only RHVAE model training and generation is handled by the provided scripts. Models will be added as they are implemented in pyraug.models

Launching a model training:

To launch a model training, run

$ python scripts/training.py --path_to_train_data "path/to/your/data/folder" 

The data must be located in path/to/your/data/folder where each input data is a file. Handled image types are .pt, .nii, .nii.gz, .bmp, .jpg, .jpeg, .png. Depending on the usage, other types will be progressively added.

At the end of training, the model weights models.pt and model config model_config.json file will be saved in a folder outputs/my_model_from_script/training_YYYY-MM-DD_hh-mm-ss/final_model.

Launching data generation

Then, to launch the data generation process from a trained model, you only need to run

$ python scripts/generation.py --num_samples 10 --path_to_model_folder 'path/to/your/trained/model/folder' 

The generated data is stored in several .pt files in outputs/my_generated_data_from_script/generation_YYYY-MM-DD_hh_mm_ss. By default, it stores batch data of 500 samples.

Important: In the simplest configuration, default configurations are used in the scripts. You can easily override as explained in documentation. See tutorials for a more in depth example.

Retrieve generated data

Generated data can then be loaded pretty easily by running

>>> import torch
>>> data = torch.load('path/to/generated_data.pt')

Getting your hands on the code

To help you to understand the way Pyraug works and how you can augment your data with this library we also provide tutorials that can be found in examples folder:

Dealing with issues

If you are experiencing any issues while running the code or request new features please open an issue on github

Citing

If you use this library please consider citing us:

@article{chadebec_data_2021,
	title = {Data {Augmentation} in {High} {Dimensional} {Low} {Sample} {Size} {Setting} {Using} a {Geometry}-{Based} {Variational} {Autoencoder}},
	copyright = {All rights reserved},
	journal = {arXiv preprint arXiv:2105.00026},
  	arxiv = {2105.00026},
	author = {Chadebec, Clément and Thibeau-Sutre, Elina and Burgos, Ninon and Allassonnière, Stéphanie},
	year = {2021}
}

Credits

Logo: SaulLu

You might also like...
Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners This repository is built upon BEiT, thanks very much! Now, we on

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

MAE for Self-supervised ViT Introduction This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-sup

 An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners
An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners This is a coarse version for MAE, only make the pretrain model, the fine

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks without the use of any outside machine learning libraries - all from scratch.

Autoencoders pretraining using clustering

Autoencoders pretraining using clustering

Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

mae-repo PyTorch re-implememtation of "masked autoencoders are scalable vision learners". In this repo, it heavily borrows codes from codebase https:/

ConvMAE: Masked Convolution Meets Masked Autoencoders
ConvMAE: Masked Convolution Meets Masked Autoencoders

ConvMAE ConvMAE: Masked Convolution Meets Masked Autoencoders Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1, 1 Shanghai AI Laboratory, 2 M

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

This is the official Pytorch implementation of
This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et al. 2020

README This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et a

Comments
  • It takes a long time to train the model

    It takes a long time to train the model

    I am trying to train a RHVAE model for data augmentation and the model starts training but it takes a long time training and do not see any results. I do not know if is an error from my dataset, computer or from the library. Could you help me?

    opened by mikel-hernandezj 2
  • Geodesics computation

    Geodesics computation

    It would be great to have a function to compute geodesics, given a trained model and two points in the latent space.

    The goal would be to allow the exploration of the latent space via geodesics, as visualised in Figure 2 of (Chadebec et al., 2021):

    Screenshot 2021-09-28 at 10 06 34 enhancement 
    opened by Virgiliok 2
  • riemann_tools

    riemann_tools

    Hi,

    In on of your example notebooks (geodesic_computation_example), you import the function Geodesic_autodiff from the package riemann_tools. I cannot find any mention of this package however. Could you perhaps provide some documentation on how to install/import the riemann_tools? Thank you in advance!

    Edit: removing the import solved the problem

    opened by VivienvV 0
Releases(v0.0.6)
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators [Project Website] [Replicate.ai Project] StyleGAN-NADA: CLIP-Guided Domain Adaptation

992 Dec 30, 2022
Libraries, tools and tasks created and used at DeepMind Robotics.

Libraries, tools and tasks created and used at DeepMind Robotics.

DeepMind 270 Nov 30, 2022
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

OCTIS : Optimizing and Comparing Topic Models is Simple! OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and compa

MIND 478 Jan 01, 2023
Code for "ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on", accepted at WACV 2021 Generation of Human Behavior Workshop.

ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on [ Paper ] [ Project Page ] This repository contains the code fo

Andrew Jong 97 Dec 13, 2022
Image Restoration Using Swin Transformer for VapourSynth

SwinIR SwinIR function for VapourSynth, based on https://github.com/JingyunLiang/SwinIR. Dependencies NumPy PyTorch, preferably with CUDA. Note that t

Holy Wu 11 Jun 19, 2022
Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets This is the official implementation of "Towards Good Pract

Sanja Fidler's Lab 52 Nov 22, 2022
An open-source outlier detection package by Getcontact Data Team

pyfbad The pyfbad library supports anomaly detection projects. An end-to-end anomaly detection application can be written using the source codes of th

Teknasyon Tech 41 Dec 27, 2022
Human Pose Detection on EdgeTPU

Coral PoseNet Pose estimation refers to computer vision techniques that detect human figures in images and video, so that one could determine, for exa

google-coral 476 Dec 31, 2022
Using Convolutional Neural Networks (CNN) for Semantic Segmentation of Breast Cancer Lesions (BRCA)

Using Convolutional Neural Networks (CNN) for Semantic Segmentation of Breast Cancer Lesions (BRCA). Master's thesis documents. Bibliography, experiments and reports.

Erick Cobos 73 Dec 04, 2022
Restricted Boltzmann Machines in Python.

How to Use First, initialize an RBM with the desired number of visible and hidden units. rbm = RBM(num_visible = 6, num_hidden = 2) Next, train the m

Edwin Chen 928 Dec 30, 2022
Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

Christoph Reich 23 Sep 21, 2022
Magisk module to enable hidden features on Android 12 Developer Preview 1.

Android 12 Extensions This is a Magisk module that enables hidden features on Android 12 Developer Preview 1. Features Scrolling screenshots Wallpaper

Danny Lin 384 Jan 06, 2023
Fast Axiomatic Attribution for Neural Networks (NeurIPS*2021)

Fast Axiomatic Attribution for Neural Networks This is the official repository accompanying the NeurIPS 2021 paper: R. Hesse, S. Schaub-Meyer, and S.

Visual Inference Lab @TU Darmstadt 11 Nov 21, 2022
A self-supervised learning framework for audio-visual speech

AV-HuBERT (Audio-Visual Hidden Unit BERT) Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction Robust Self-Supervised A

Meta Research 431 Jan 07, 2023
Spectrum is an AI that uses machine learning to generate Rap song lyrics

Spectrum Spectrum is an AI that uses deep learning to generate rap song lyrics. View Demo Report Bug Request Feature Open In Colab About The Project S

39 Dec 16, 2022
This git repo contains the implementation of my ML project on Heart Disease Prediction

Introduction This git repo contains the implementation of my ML project on Heart Disease Prediction. This is a real-world machine learning model/proje

Aryan Dutta 1 Feb 02, 2022
This is a simple framework to make object detection dataset very quickly

FastAnnotation Table of contents General info Requirements Setup General info This is a simple framework to make object detection dataset very quickly

Serena Tetart 1 Jan 24, 2022
This project generates news headlines using a Long Short-Term Memory (LSTM) neural network.

News Headlines Generator bunnysaini/Generate-Headlines Goal This project aims to generate news headlines using a Long Short-Term Memory (LSTM) neural

Bunny Saini 1 Jan 24, 2022
An 16kHz implementation of HiFi-GAN for soft-vc.

HiFi-GAN An 16kHz implementation of HiFi-GAN for soft-vc. Relevant links: Official HiFi-GAN repo HiFi-GAN paper Soft-VC repo Soft-VC paper Example Usa

Benjamin van Niekerk 42 Dec 27, 2022