The official implementation of the paper, "SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning"

Overview

SubTab:

Author: Talip Ucar ([email protected])

The official implementation of the paper,

SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning

PWC

Table of Contents:

  1. Model
  2. Environment
  3. Data
  4. Configuration
  5. Training and Evaluation
  6. Adding New Datasets
  7. Results
  8. Experiment tracking
  9. Citing the paper
  10. Citing this repo

Model

SubTab

Click for a slower version of the animation

SubTab

Environment

We used Python 3.7 for our experiments. The environment can be set up by following three steps:

pip install pipenv             # To install pipenv if you don't have it already
pipenv install --skip-lock     # To install required packages. 
pipenv shell                   # To activate virtual env

If the second step results in issues, you can install packages in Pipfile individually by using pip i.e. "pip install package_name".

Data

MNIST dataset is already provided to demo the framework. For your own dataset, follow the instructions in Adding New Datasets.

Configuration

There are two types of configuration files:

1. runtime.yaml
2. mnist.yaml
  1. runtime.yaml is a high-level configuration file used by all datasets to:

    • define the random seed
    • turn on/off mlflow (Default: False)
    • turn on/off python profiler (Default: False)
    • set data directory
    • set results directory
  2. Second configuration file is dataset-specific and is used to configure the architecture of the model, loss functions, and so on.

    • For example, we set up a configuration file for MNIST dataset with the same name. Please note that the name of the configuration file should be same as name of the dataset with all letters in lowercase.
    • We can have configuration files for other datasets such as tcga.yaml and income.yaml for tcga and income datasets respectively.

Training and Evaluation

You can train and evaluate the model by using:

python train.py # For training
python eval.py  # For evaluation
  • train.py will also run evaluation at the end of the training.
  • You can also run evaluation separately by using eval.py.

Adding New Datasets

For each new dataset, you can use the following steps:

  1. Provide a _load_dataset_name() function, similar to MNIST load function

    • For example, you can add _load_tcga() for tcga dataset, or _load_income() for income dataset.
    • The function should return (x_train, y_train, x_test, y_test)
  2. Add a separate elif condition in this section within _load_data() method of TabularDataset() class in utils/load_data.py

  3. Create a new config file with the same name as dataset name.

    • For example, tcga.yaml for tcga dataset, or income.yaml for income dataset.

    • You can also duplicate one of the existing configuration files (e.g. mnist.yaml), and re-name it.

    • Make sure that the new config file is under config/ directory.

  4. Provide data folder with pre-processed training and test set, and place it under ./data/ directory. You can also do train-test split and pre-processing within your custom _load_dataset_name() function.

  5. (Optional) If you want to place the new dataset under a different directory than the local "./data/", then:

    • Place the dataset folder anywhere, and define the root directory to it in this line of /config/runtime.yaml.

    • For example, if the path to tcga dataset is /home/.../data/tcga/, you only need to include /home/.../data/ in runtime.yaml. The code will fill in tcga folder name from the name given in the command line argument (e.g. -d dataset_name. In this case, dataset_name would be tcga).

Structure of the repo

- train.py
- eval.py

- src
    |-model.py
    
- config
    |-runtime.yaml
    |-mnist.yaml
    
- utils
    |-load_data.py
    |-arguments.py
    |-model_utils.py
    |-loss_functions.py
    ...
    
- data
    |-mnist
    ...
    
- results
    |
    ...

Results

Results at the end of training is saved under ./results directory. Results directory structure is as following:

- results
    |-dataset name
            |-evaluation
                |-clusters (for plotting t-SNE and PCA plots of embeddings)
                |-reconstructions (not used)
            |-training
                |-model_mode (e.g. ae for autoencoder)   
                     |-model
                     |-plots
                     |-loss

You can save results of evaluations under "evaluation" folder.

Experiment tracking

MLFlow is used to track experiments. It is turned off by default, but can be turned on by changing option on this line in runtime config file in ./config/runtime.yaml

Citing the paper

@article{ucar2021subtab,
  title={SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning},
  author={Ucar, Talip and Hajiramezanali, Ehsan and Edwards, Lindsay},
  journal={arXiv preprint arXiv:2110.04361},
  year={2021}
}

Citing this repo

If you use SubTab framework in your own studies, and work, please cite it by using the following:

@Misc{talip_ucar_2021_SubTab,
  author =   {Talip Ucar},
  title =    {{SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning}},
  howpublished = {\url{https://github.com/AstraZeneca/SubTab}},
  month        = June,
  year = {since 2021}
}
Owner
AstraZeneca
Data and AI: Unlocking new science insights
AstraZeneca
Turn based roguelike in python

pyTB Turn based roguelike in python Documentation can be found here: http://mcgillij.github.io/pyTB/index.html Screenshot Dependencies Written in Pyth

Jason McGillivray 4 Sep 29, 2022
MEND: Model Editing Networks using Gradient Decomposition

MEND: Model Editing Networks using Gradient Decomposition Setup Environment This codebase uses Python 3.7.9. Other versions may work as well. Create a

Eric Mitchell 141 Dec 02, 2022
No Code AI/ML platform

NoCodeAIML No Code AI/ML platform - Community Edition Video credits: Uday Kiran Typical No Code AI/ML Platform will have features like drag and drop,

Bhagvan Kommadi 5 Jan 28, 2022
[AAAI22] Reliable Propagation-Correction Modulation for Video Object Segmentation

Reliable Propagation-Correction Modulation for Video Object Segmentation (AAAI22) Preview version paper of this work is available at: https://arxiv.or

Xiaohao Xu 70 Dec 04, 2022
Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at [email protected]

TableParser Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at DS3 Lab 11 Dec 13, 2022

ISNAS-DIP: Image Specific Neural Architecture Search for Deep Image Prior [CVPR 2022]

ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior (CVPR 2022) Metin Ersin Arican*, Ozgur Kara*, Gustav Bredell, Ender Konukogl

Özgür Kara 24 Dec 18, 2022
Progressive Coordinate Transforms for Monocular 3D Object Detection

Progressive Coordinate Transforms for Monocular 3D Object Detection This repository is the official implementation of PCT. Introduction In this paper,

58 Nov 06, 2022
image scene graph generation benchmark

Scene Graph Benchmark in PyTorch 1.7 This project is based on maskrcnn-benchmark Highlights Upgrad to pytorch 1.7 Multi-GPU training and inference Bat

Microsoft 303 Dec 27, 2022
Use unsupervised and supervised learning to predict stocks

AIAlpha: Multilayer neural network architecture for stock return prediction This project is meant to be an advanced implementation of stacked neural n

Vivek Palaniappan 1.5k Dec 26, 2022
Real-Time Semantic Segmentation in Mobile device

Real-Time Semantic Segmentation in Mobile device This project is an example project of semantic segmentation for mobile real-time app. The architectur

708 Jan 01, 2023
NuPIC Studio is an all­-in-­one tool that allows users create a HTM neural network from scratch

NuPIC Studio is an all­-in-­one tool that allows users create a HTM neural network from scratch, train it, collect statistics, and share it among the members of the community. It is not just a visual

HTM Community 93 Sep 30, 2022
Code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection"

CTDNet The PyTorch code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection" Requirements Python 3.6

CVTEAM 28 Oct 20, 2022
Code for paper "Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation" EMNLP 2021

The repo provides the code for paper "Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation" EMNLP 2

Yuning Mao 18 May 24, 2022
Official implementation of FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 @ ICASSP 2021

FCL-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech synthesis (ICASSP 2021) Paper | Demo Block diagram of FCL-taco2, where the decode

Disong Wang 39 Sep 28, 2022
The Few-Shot Bot: Prompt-Based Learning for Dialogue Systems

Few-Shot Bot: Prompt-Based Learning for Dialogue Systems This repository includes the dataset, experiments results, and code for the paper: Few-Shot B

Andrea Madotto 103 Dec 28, 2022
SeqTR: A Simple yet Universal Network for Visual Grounding

SeqTR This is the official implementation of SeqTR: A Simple yet Universal Network for Visual Grounding, which simplifies and unifies the modelling fo

seanZhuh 76 Dec 24, 2022
nnFormer: Interleaved Transformer for Volumetric Segmentation

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

jsguo 610 Dec 28, 2022
Semi-supervised Learning for Sentiment Analysis

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining Code, models and Datasets for《Neural Semi-supervised Learning fo

47 Jan 01, 2023
Segmentation Training Pipeline

Segmentation Training Pipeline This package is a part of Musket ML framework. Reasons to use Segmentation Pipeline Segmentation Pipeline was developed

Musket ML 52 Dec 12, 2022
Make differentially private training of transformers easy for everyone

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

Xuechen Li 73 Dec 28, 2022