Repository for Multimodal AutoML Benchmark

Overview

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Repository for the NeurIPS 2021 Dataset Track Submission "Benchmarking Multimodal AutoML for Tabular Data with Text Fields" (Link, Full Paper with Appendix). An earlier version of the paper, called "Multimodal AutoML on Structured Tables with Text Fields" (Link) has been accepted by ICML 2021 AutoML workshop as Oral. As we have since updated the benchmark with more datasets, the version used in the AutoML workshop paper has been archived at the icml_workshop branch.

This benchmark contains a diverse collection of tabular datasets. Each dataset contains numeric/categorical as well as text columns. The goal is to evaluate the performance of (automated) ML systems for supervised learning (classification and regression) with such multimodal data. The folder multimodal_text_benchmark/scripts/benchmark/ provides Python scripts to run different variants of the AutoGluon and H2O AutoML tools on the benchmark.

Datasets used in the Benchmark

Here's a brief summary of the datasets in our benchmark. Each dataset is described in greater detail in the multimodal_text_benchmark/ folder.

ID key #Train #Test Task Metric Prediction Target
prod product_sentiment_machine_hack 5,091 1,273 multiclass accuracy sentiment related to product
salary data_scientist_salary 15,84 3961 multiclass accuracy salary range in data scientist job listings
airbnb melbourne_airbnb 18,316 4,579 multiclass accuracy price of Airbnb listing
channel news_channel 20,284 5,071 multiclass accuracy category of news article
wine wine_reviews 84,123 21,031 multiclass accuracy variety of wine
imdb imdb_genre_prediction 800 200 binary roc_auc whether film is a drama
fake fake_job_postings2 12,725 3,182 binary roc_auc whether job postings are fake
kick kick_starter_funding 86,052 21,626 binary roc_auc will Kickstarter get funding
jigsaw jigsaw_unintended_bias100K 100,000 25,000 binary roc_auc whether comments are toxic
qaa google_qa_answer_type_reason_explanation 4,863 1,216 regression r2 type of answer
qaq google_qa_question_type_reason_explanation 4,863 1,216 regression r2 type of question
book bookprice_prediction 4,989 1,248 regression r2 price of books
jc jc_penney_products 10,860 2,715 regression r2 price of JC Penney products
cloth women_clothing_review 18,788 4,698 regression r2 review score
ae ae_price_prediction 22,662 5,666 regression r2 American-Eagle item prices
pop news_popularity2 24,007 6,002 regression r2 news article popularity online
house california_house_price 24,007 6,002 regression r2 sale price of houses in California
mercari mercari_price_suggestion100K 100,000 25,000 regression r2 price of Mercari products

License

The versions of datasets in this benchmark are released under the CC BY-NC-SA license. Note that the datasets in this benchmark are modified versions of previously publicly-available original copies and we do not own any of the datasets in the benchmark. Any data from this benchmark which has previously been published elsewhere falls under the original license from which the data originated. Please refer to the licenses of each original source linked in the multimodal_text_benchmark/README.md.

Install the Benchmark Suite

cd multimodal_text_benchmark
# Install the benchmarking suite
python3 -m pip install -U -e .

You can do a quick test of the installation by going to the test folder

cd multimodal_text_benchmark/tests
python3 -m pytest test_datasets.py

To work with one of the datasets, use the following code:

from auto_mm_bench.datasets import dataset_registry

print(dataset_registry.list_keys())  # list of all dataset names
dataset_name = 'product_sentiment_machine_hack'

train_dataset = dataset_registry.create(dataset_name, 'train')
test_dataset = dataset_registry.create(dataset_name, 'test')
print(train_dataset.data)
print(test_dataset.data)

To access all datasets that comprise the benchmark:

from auto_mm_bench.datasets import create_dataset, TEXT_BENCHMARK_ALIAS_MAPPING

for dataset_name in list(TEXT_BENCHMARK_ALIAS_MAPPING.values()):
    print(dataset_name)
    dataset = create_dataset(dataset_name)

Run Experiments

Go to multimodal_text_benchmark/scripts/benchmark to see how to run some baseline ML methods over the benchmark.

References

BibTeX entry of the ICML Workshop Version:

@article{agmultimodaltext,
  title={Multimodal AutoML on Structured Tables with Text Fields},
  author={Shi, Xingjian and Mueller, Jonas and Erickson, Nick and Li, Mu and Smola, Alexander},
  journal={8th ICML Workshop on Automated Machine Learning (AutoML)},
  year={2021}
}
Owner
Xingjian Shi
Xingjian Shi
This is the first released system towards complex meters` detection and recognition, which is implemented by computer vision techniques.

A three-stage detection and recognition pipeline of complex meters in wild This is the first released system towards detection and recognition of comp

Yan Shu 19 Nov 28, 2022
dyld_shared_cache processing / Single-Image loading for BinaryNinja

Dyld Shared Cache Parser Author: cynder (kat) Dyld Shared Cache Support for BinaryNinja Without any of the fuss of requiring manually loading several

cynder 76 Dec 28, 2022
Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion"

DSPoint Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion" Coming soon, as soon as I finish a

Ziyao Zeng 14 Feb 26, 2022
Awesome-AI-books - Some awesome AI related books and pdfs for learning and downloading

Awesome AI books Some awesome AI related books and pdfs for downloading and learning. Preface This repo only used for learning, do not use in business

luckyzhou 1k Jan 01, 2023
[CVPR 2021] Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach

Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach This is the repo to host the dataset TextSeg and code for TexRNe

SHI Lab 174 Dec 19, 2022
An Implicit Function Theorem (IFT) optimizer for bi-level optimizations

iftopt An Implicit Function Theorem (IFT) optimizer for bi-level optimizations. Requirements Python 3.7+ PyTorch 1.x Installation $ pip install git+ht

The Money Shredder Lab 2 Dec 02, 2021
A project for developing transformer-based models for clinical relation extraction

Clinical Relation Extration with Transformers Aim This package is developed for researchers easily to use state-of-the-art transformers models for ext

uf-hobi-informatics-lab 101 Dec 19, 2022
CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge

vadim epstein 690 Jan 02, 2023
SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data (AAAI 2021)

SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data (AAAI 2021) PyTorch implementation of SnapMix | paper Method Overview Cite

DavidHuang 126 Dec 30, 2022
A knowledge base construction engine for richly formatted data

Fonduer is a Python package and framework for building knowledge base construction (KBC) applications from richly formatted data. Note that Fonduer is

HazyResearch 386 Dec 05, 2022
Semantic Segmentation with SegFormer on Drone Dataset.

SegFormer_Segmentation Semantic Segmentation with SegFormer on Drone Dataset. You can check out the blog on Medium You can also try out the model with

Praneet 8 Oct 20, 2022
Official code repository for "Exploring Neural Models for Query-Focused Summarization"

Query-Focused Summarization Official code repository for "Exploring Neural Models for Query-Focused Summarization" This is a work in progress. Expect

Salesforce 29 Dec 18, 2022
BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting

BOVText: A Large-Scale, Bilingual Open World Dataset for Video Text Spotting Updated on December 10, 2021 (Release all dataset(2021 videos)) Updated o

weijiawu 47 Dec 26, 2022
Powerful unsupervised domain adaptation method for dense retrieval.

Powerful unsupervised domain adaptation method for dense retrieval

Ubiquitous Knowledge Processing Lab 191 Dec 28, 2022
NeROIC: Neural Object Capture and Rendering from Online Image Collections

NeROIC: Neural Object Capture and Rendering from Online Image Collections This repository is for the source code for the paper NeROIC: Neural Object C

Snap Research 647 Dec 27, 2022
A custom-designed Spider Robot trained to walk using Deep RL in a PyBullet Simulation

SpiderBot_DeepRL Title: Implementation of Single and Multi-Agent Deep Reinforcement Learning Algorithms for a Walking Spider Robot Authors(s): Arijit

Arijit Dasgupta 9 Jul 28, 2022
Toolbox to analyze temporal context invariance of deep neural networks

PyTCI A toolbox that estimates the integration window of a sensory response using the "Temporal Context Invariance" paradigm (TCI). The TCI method Int

4 Oct 23, 2022
Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

PlantDoc: A Dataset for Visual Plant Disease Detection This repository contains the Cropped-PlantDoc dataset used for benchmarking classification mode

Pratik Kayal 109 Dec 29, 2022
😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

------ Update September 2018 ------ It's been a year since TorchMoji and DeepMoji were released. We're trying to understand how it's being used such t

Hugging Face 865 Dec 24, 2022
Python 3 module to print out long strings of text with intervals of time inbetween

Python-Fastprint Python 3 module to print out long strings of text with intervals of time inbetween Install: pip install fastprint Sync Usage: from fa

Kainoa Kanter 2 Jun 27, 2022