Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset.

Last update: Jan 04, 2023

Related tags

Overview

Foodi-ML dataset

This is the GitHub repository for the Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset. This dataset contains over 1.5M unique images and over 9.5M store names, product names, descriptions and collection sections gathered from the Glovo application. The data made available corresponds to food, drinks and groceries products from over 37 countries in Europe, the Middle East, Africa and Latin America. The dataset comprehends 33 languages, including 870k samples of languages of countries from Eastern Europe and West Asia such as Ukrainian and Kazakh, which have been so far underrepresented in publicly available visio-linguistic datasets. The dataset also includes widely spoken languages such as Spanish and English.

License

The FooDI-ML dataset is offered under the BY-NC-SA license.

1. Download the dataset

The FooDI-ML dataset is hosted in a S3 bucket in AWS. Therefore AWS CLI is needed to download it. Our dataset is composed of:

One DataFrame (glovo-foodi-ml-dataset) stored as a csv file containing all text information + image paths in S3. The size of this CSV file is 540 MB.
Set of images listed in the DataFrame. The disk space required to store all images is 316.1 GB.

1.1. Download AWS CLI

If you do not have AWS CLI already installed, please download the latest version of AWS CLI for your operating system.

1.2. Download FooDI-ML

Run the following command to download the DataFrame in ENTER_DESTINATION_PATH directory. We provide an example as if we were going to download the dataset in the directory /mnt/data/foodi-ml/.

aws s3 cp s3://glovo-products-dataset-d1c9720d/glovo-foodi-ml-dataset.csv ENTER_DESTINATION_PATH --no-sign-request

Example: aws s3 cp s3://glovo-products-dataset-d1c9720d/glovo-foodi-ml-dataset.csv /mnt/data/foodi-ml/ --no-sign-request
Run the following command to download the images in ENTER_DESTINATION_PATH/dataset directory (please note the appending of /dataset). This command will download the images in ENTER_DESTINATION_PATHdirectory.

aws s3 cp --recursive s3://glovo-products-dataset-d1c9720d/dataset ENTER_DESTINATION_PATH/dataset --no-sign-request --quiet

Example: aws s3 cp --recursive s3://glovo-products-dataset-d1c9720d/dataset /mnt/data/foodi-ml/dataset --no-sign-request --quiet
Run the script rename_images.py. This script modifies the DataFrame column to include the paths of the images in the location you specified with ENTER_DESTINATION_PATH/dataset.
```
pip install pandas
python scripts/rename_images.py --output-dir ENTER_DESTINATION_PATH
```

Getting started

Our dataset is managed by the DataFrame glovo-foodi-ml-dataset.csv. This dataset contains the following columns:

country_code: This column comprehends 37 unique country codes as explained in our paper. These codes are:

'ES', 'PL', 'CI', 'PT', 'MA', 'IT', 'AR', 'BG', 'KZ', 'BR', 'ME', 'TR', 'PE', 'SI', 'GE', 'EG', 'RS', 'RO', 'HR', 'UA', 'DO', 'KG', 'CR', 'UY', 'EC', 'HN', 'GH', 'KE', 'GT', 'CL', 'FR', 'BA', 'PA', 'UG', 'MD', 'NG', 'PR'
city_code: Name of the city where the store is located.
store_name: Name of the store selling that product. If store_name is equal to AS_XYZ, it represents an auxiliary store. This means that while the samples contained are for the most part valid, the store name can't be used in learning tasks
product_name: Name of the product. All products have product_name, so this column does not contain any NaN value.
collection_section: Name of the section of the product, used for organizing the store menu. Common values are "drinks", "our pizzas", "desserts". All products have collection_section associated to it, so this column does not have any NaN value in it.
product_description: A detailed description of the product, describing ingredients and components of it. Not all products of our data have description, so this column contains NaN values that must be removed by the researchers as a preprocessing step.
subset: Categorical variable indicating if the sample belongs to the Training, Validation or Test set. The respective values in the DataFrame are ["train", "val", "test"].
HIER: Boolean variable indicating if the store name can be used to retrieve product information (indicating if the store_name is not an auxiliary store (with code AS_XYZ)).
s3_path: Path of the image of the product in the disk location you chose.

Dataset Statistics

A notebook analyzing several dataset statistics is provided in notebooks/FooDI-ML Dataset Stats Analytics.ipynb.

Benchmark

To run the benchmark included in the original paper one must follow the procedure listed in the following link.

The hyperparameters of the model are included here link

Citation

This paper is under review. In the meanwhile you can cite it in arxiv: https://arxiv.org/abs/2110.02035

Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset.

Related tags

Overview

Foodi-ML dataset

License

1. Download the dataset

1.1. Download AWS CLI

1.2. Download FooDI-ML

Getting started

Dataset Statistics

Benchmark

Citation

Owner

Related resources for our EMNLP 2021 paper

Colour detection is necessary to recognize objects, it is also used as a tool in various image editing and drawing apps.

scikit-learn: machine learning in Python

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

The source code of CVPR17 'Generative Face Completion'.

MiniSom is a minimalistic implementation of the Self Organizing Maps

Citation Intent Classification in scientific papers using the Scicite dataset an Pytorch

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

Pytorch implementation of Decoupled Spatial-Temporal Transformer for Video Inpainting

x-transformers-paddle 2.x version

A python-image-classification web application project, written in Python and served through the Flask Microframework. This Project implements the VGG16 covolutional neural network, through Keras and Tensorflow wrappers, to make predictions on uploaded images.

Facilitating Database Tuning with Hyper-ParameterOptimization: A Comprehensive Experimental Evaluation

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

CaFM-pytorch ICCV ACCEPT Introduction of dataset VSD4K

A Kernel fuzzer focusing on race bugs

Repo for our ICML21 paper Unsupervised Learning of Visual 3D Keypoints for Control

Self-training for Few-shot Transfer Across Extreme Task Differences

TensorFlow implementation of Style Transfer Generative Adversarial Networks: Learning to Play Chess Differently.