Command-line tool for downloading and extending the RedCaps dataset.

Overview

RedCaps Downloader

This repository provides the official command-line tool for downloading and extending the RedCaps dataset. Users can seamlessly download images of officially released annotations as well as download more image-text data from any subreddit over an arbitrary time-span.

Installation

This tool requires Python 3.8 or higher. We recommend using conda for setup. Download Anaconda or Miniconda first. Then follow these steps:

# Clone the repository.
git clone https://github.com/redcaps-dataset/redcaps-downloader
cd redcaps-downloader

# Create a new conda environment.
conda create -n redcaps python=3.8
conda activate redcaps

# Install dependencies along with this code.
pip install -r requirements.txt
python setup.py develop

Basic usage: Download official RedCaps dataset

We expect most users will only require this functionality. Follow these steps to download the official RedCaps annotations and images and arrange all the data in recommended directory structure:

/path/to/redcaps/
├── annotations/
│   ├── abandoned_2017.json
│   ├── abandoned_2017.json
│   ├── ...
│   ├── itookapicture_2019.json
│   ├── itookapicture_2020.json
│   ├── 
   
    _
    
     .json
│   └── ...
│
└── images/
    ├── abandoned/
    │   ├── guli1.jpg
    |   └── ...
    │
    ├── itookapicture/
    │   ├── 1bd79.jpg
    |   └── ...
    │
    ├── 
     
      /
    │   ├── 
      
       .jpg
    │   ├── ...
    └── ...

      
     
    
   
  1. Create an empty directory and symlink it relative to this code directory:

    cd redcaps-downloader
    
    # Edit path here:
    mkdir -p /path/to/redcaps
    ln -s /path/to/redcaps ./datasets/redcaps
  2. Download official RedCaps annotations from Dropbox and unzip them.

    cd datasets/redcaps
    wget https://www.dropbox.com/s/twvv541sbg0qqux/redcaps_v1.0_annotations.zip?dl=1
    unzip redcaps_v1.0_annotations.zip
  3. Download images by using redcaps download-imgs command (for a single annotation file).

    for ann_file in ./datasets/redcaps/annotations/*.json; do
        redcaps download-imgs -a $ann_file --save-to path/to/images --resize 512 -j 4
        # Set --resize -1 to turn off resizing shorter edge (saves disk space).
    done

    Parallelize download by changing -j. RedCaps images are sourced from Reddit, Imgur and Flickr, each have their own request limits. This code contains approximate sleep intervals to manage them. Use multiple machines (= different IP addresses) or a cluster to massively parallelize downloading.

That's it, you are all set to use RedCaps!

Advanced usage: Create your own RedCaps-like dataset

Apart from downloading the officially released dataset, this tool supports downloading image-text data from any subreddit – you can reproduce the entire collection pipeline as well as create your own variant of RedCaps! Here, we show how to collect annotations from r/roses (2020) as an example. Follow these steps for any subreddit and years.

Additional one-time setup instructions

RedCaps annotations are extracted from image post metadata, which are served by the Pushshift API and official Reddit API. These APIs are authentication-based, and one must sign up for developer access to obtain API keys (one-time setup):

  1. Copy ./credentials.template.json to ./credentials.json. Its contents are as follows:

    : " }, "imgur": { "client_id": "Your client ID here", "client_secret": "Your client secret here" } } ">
    {
        "reddit": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here",
            "username": "Your Reddit username here",
            "password": "Your Reddit password here",
            "user_agent": "
          
           : 
           "
          
        },
        "imgur": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here"
        }
    }
  2. Register a new Reddit app here. Reddit will provide a Client ID and Client Secret tokens - fill them in ./credentials.json. For more details, refer to the Reddit OAuth2 wiki. Enter your Reddit account name and password in ./credentials.json. Set User Agent to anything and keep it unchanged (e.g. your name).

  3. Register a new Imgur App by following instructions here. Fill the provided Client ID and Client Secret in ./credentials.json.

  4. Download pre-trained weights of an NSFW detection model.

    wget https://s3.amazonaws.com/nsfwdetector/nsfw.299x299.h5 -P ./datasets/redcaps/models

Data collection from r/roses (2020)

  1. download-anns: Dowload annotations of image posts made in a single month (e.g. January).

    redcaps download-anns --subreddit roses --month 2020-01 -o ./datasets/redcaps/annotations
    
    # Similarly, download annotations for all months of 2020:
    for ((month = 1; month <= 12; month += 1)); do
        redcaps download-anns --subreddit roses --month 2020-$month -o ./datasets/redcaps/annotations
    done
    • NOTE: You may not get all the annotations present in official release as some of them may have disappeared (deleted) over time. After this step, the dataset directory would contain 12 annotation files:
        ./datasets/redcaps/
        └── annotations/
            ├── roses_2020-01.json
            ├── roses_2020-02.json
            ├── ...
            └── roses_2020-12.json
    
  2. merge: Merge all the monthly annotation files into a single file.

    redcaps merge ./datasets/redcaps/annotations/roses_2020-* \
        -o ./datasets/redcaps/annotations/roses_2020.json --delete-old
    • --delete-old will remove individual files after merging. After this step, the merged file will replace individual monthly files:
        ./datasets/redcaps/
        └── annotations/
            └── roses_2020.json
    
  3. download-imgs: Download all images for this annotation file. This step is same as (3) in basic usage.

    redcaps download-imgs --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --resize 512 -j 4 -o ./datasets/redcaps/images --update-annotations
    • --update-annotations removes annotations whose images were not downloaded.
  4. filter-words: Filter all instances whose captions contain potentially harmful language. Any caption containing one of the 400 blocklisted words will be removed. This command modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-words --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images
  5. filter-nsfw: Remove all instances having images that are flagged by an off-the-shelf NSFW detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-nsfw --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images \
        --model ./datasets/redcaps/models/nsfw.299x299.h5
  6. filter-faces: Remove all instances having images with faces detected by an off-the-shelf face detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-faces --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images  # Model weights auto-downloaded
  7. validate: All the above steps create a single annotation file (and downloads images) similar to official RedCaps annotations. To double-check this, run the following command and expect no errors to be printed.

    redcaps validate --annotations ./datasets/redcaps/annotations/roses_2020.json

Citation

If you find this code useful, please consider citing:

@inproceedings{desai2021redcaps,
    title={{RedCaps: Web-curated image-text data created by the people, for the people}},
    author={Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson},
    booktitle={NeurIPS Datasets and Benchmarks},
    year={2021}
}
Owner
RedCaps dataset
RedCaps dataset
Command Line Manager + Interactive Shell for Python Projects

Manage Command Line Manager + Interactive Shell for Python Projects

Python Manage 123 Aug 28, 2022
💻 Physics2Calculator - A simple and powerful calculator for Physics 2

💻 Physics2Calculator A simple and powerful calculator for Physics 2 🔌 Predefined constants pi = 3.14159... k = 8988000000 (coulomb constant) e0 = 8.

Dylan Tintenfich 4 Dec 01, 2021
Multifunctional library for creating progress bars.

👋 Content Installation Using github Using pypi Quickstart Flags Useful links Documentation Pypi Changelog TODO Contributing FAQ Bar structure ⚙️ Inst

DenyS 27 Jan 01, 2023
adds flavor of interactive filtering to the traditional pipe concept of UNIX shell

percol __ ____ ___ ______________ / / / __ \/ _ \/ ___/ ___/ __ \/ / / /_/ / __/ / / /__/ /_/ / / / .__

Masafumi Oyamada 3.2k Jan 07, 2023
split-manga-pages: a command line utility written in Python that converts your double-page layout manga to single-page layout.

split-manga-pages split-manga-pages is a command line utility written in Python that converts your double-page layout manga (or any images in double p

Christoffer Aakre 3 May 24, 2022
Dark powered asynchronous completion framework for neovim/Vim8

deoplete.nvim Dark powered asynchronous completion framework for neovim/Vim8 Note: The development of this plugin is finished. Accepts minor patches a

Shougo 5.9k Dec 30, 2022
GoogleFormSpammer - A simple CLI script to spam Google Forms used by Crypto Wallet scammers to collect stolen data

GoogleFormSpammer - A simple CLI script to spam Google Forms used by Crypto Wallet scammers to collect stolen data

14 Dec 17, 2022
A supercharged Git/GitHub command line interface (CLI)

A supercharged Git/GitHub command line interface (CLI).

Donne Martin 7.4k Jan 07, 2023
Create animated ASCII-art for the command line almost instantly!

clippy Create and play colored 🟥 🟩 🟦 or colorless ⬛️ ⬜️ animated, or static, ASCII-art in the command line! clippy can help if you are wanting to;

Connor 10 Jun 26, 2022
PyArmor is a command line tool used to obfuscate python scripts

PyArmor is a command line tool used to obfuscate python scripts, bind obfuscated scripts to fixed machine or expire obfuscated scripts.

Dashingsoft 2k Jan 07, 2023
Python CLI utility and library for manipulating SQLite databases

sqlite-utils Python CLI utility and library for manipulating SQLite databases. Some feature highlights Pipe JSON (or CSV or TSV) directly into a new S

Simon Willison 1.1k Jan 04, 2023
The easiest way to create beautiful CLI for your programs.

The Yandere is a program written in Python3, allowing you to create your own beautiful CLI tool.

Billy 31 Dec 20, 2022
Bonjour Software pypahe is a Python Package Helper command-line tool.

pypahe Bonjour Software pypahe is a Python Package Helper command-line tool. Requirements Docker runtime Usage print the latest available version of a

Bonjour Software 0 Aug 10, 2021
Ralph is a command-line tool to fetch, extract, convert and push your tracking logs from various storage backends to your LRS or any other compatible storage or database backend.

Ralph is a command-line tool to fetch, extract, convert and push your tracking logs (aka learning events) from various storage backends to your

France Université Numérique 18 Jan 05, 2023
A python library for parsing multiple types of config files, envvars & command line arguments that takes the headache out of setting app configurations.

parse_it A python library for parsing multiple types of config files, envvars and command line arguments that takes the headache out of setting app co

Naor Livne 97 Oct 22, 2022
CLI Utility to encode and recursively recreate directories with ffmpeg.

FFenmass CLI Utility to encode and recursively recreate directories with ffmpeg. Report Bug · Request Feature Table of Contents Getting Started Prereq

George Av. 8 May 06, 2022
A basic molecule viewer written in Python, using curses; Thus, meant for linux terminals

asciiMOL A basic molecule viewer written in Python, using curses; Thus, meant for linux terminals. This is an alpha version, featuring: Opening defaul

Dominik Behrens 328 Dec 11, 2022
A Python package for a basic CLI and GUI user interface

Organizer CLI Organizer CLI is a python command line tool that goes through a given directory and organizes all un-folder bound files into folders by

Caltech Library 12 Mar 25, 2022
Tmux Based Dropdown Dashboard For Python

sextans It's a private configuration and an ongoing experiment while I use Archlinux. A simple drop down dashboard based on tmux. It includes followin

秋葉 4 Dec 22, 2021
Themes for Windows Terminal

Windows Terminal Themes Preview and copy themes for the new Windows Terminal. Use the project at windowsterminalthemes.dev How to use the themes This

Tom 1.1k Jan 03, 2023