Command-line tool for downloading and extending the RedCaps dataset.

Overview

RedCaps Downloader

This repository provides the official command-line tool for downloading and extending the RedCaps dataset. Users can seamlessly download images of officially released annotations as well as download more image-text data from any subreddit over an arbitrary time-span.

Installation

This tool requires Python 3.8 or higher. We recommend using conda for setup. Download Anaconda or Miniconda first. Then follow these steps:

# Clone the repository.
git clone https://github.com/redcaps-dataset/redcaps-downloader
cd redcaps-downloader

# Create a new conda environment.
conda create -n redcaps python=3.8
conda activate redcaps

# Install dependencies along with this code.
pip install -r requirements.txt
python setup.py develop

Basic usage: Download official RedCaps dataset

We expect most users will only require this functionality. Follow these steps to download the official RedCaps annotations and images and arrange all the data in recommended directory structure:

/path/to/redcaps/
├── annotations/
│   ├── abandoned_2017.json
│   ├── abandoned_2017.json
│   ├── ...
│   ├── itookapicture_2019.json
│   ├── itookapicture_2020.json
│   ├── 
   
    _
    
     .json
│   └── ...
│
└── images/
    ├── abandoned/
    │   ├── guli1.jpg
    |   └── ...
    │
    ├── itookapicture/
    │   ├── 1bd79.jpg
    |   └── ...
    │
    ├── 
     
      /
    │   ├── 
      
       .jpg
    │   ├── ...
    └── ...

      
     
    
   
  1. Create an empty directory and symlink it relative to this code directory:

    cd redcaps-downloader
    
    # Edit path here:
    mkdir -p /path/to/redcaps
    ln -s /path/to/redcaps ./datasets/redcaps
  2. Download official RedCaps annotations from Dropbox and unzip them.

    cd datasets/redcaps
    wget https://www.dropbox.com/s/twvv541sbg0qqux/redcaps_v1.0_annotations.zip?dl=1
    unzip redcaps_v1.0_annotations.zip
  3. Download images by using redcaps download-imgs command (for a single annotation file).

    for ann_file in ./datasets/redcaps/annotations/*.json; do
        redcaps download-imgs -a $ann_file --save-to path/to/images --resize 512 -j 4
        # Set --resize -1 to turn off resizing shorter edge (saves disk space).
    done

    Parallelize download by changing -j. RedCaps images are sourced from Reddit, Imgur and Flickr, each have their own request limits. This code contains approximate sleep intervals to manage them. Use multiple machines (= different IP addresses) or a cluster to massively parallelize downloading.

That's it, you are all set to use RedCaps!

Advanced usage: Create your own RedCaps-like dataset

Apart from downloading the officially released dataset, this tool supports downloading image-text data from any subreddit – you can reproduce the entire collection pipeline as well as create your own variant of RedCaps! Here, we show how to collect annotations from r/roses (2020) as an example. Follow these steps for any subreddit and years.

Additional one-time setup instructions

RedCaps annotations are extracted from image post metadata, which are served by the Pushshift API and official Reddit API. These APIs are authentication-based, and one must sign up for developer access to obtain API keys (one-time setup):

  1. Copy ./credentials.template.json to ./credentials.json. Its contents are as follows:

    : " }, "imgur": { "client_id": "Your client ID here", "client_secret": "Your client secret here" } } ">
    {
        "reddit": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here",
            "username": "Your Reddit username here",
            "password": "Your Reddit password here",
            "user_agent": "
          
           : 
           "
          
        },
        "imgur": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here"
        }
    }
  2. Register a new Reddit app here. Reddit will provide a Client ID and Client Secret tokens - fill them in ./credentials.json. For more details, refer to the Reddit OAuth2 wiki. Enter your Reddit account name and password in ./credentials.json. Set User Agent to anything and keep it unchanged (e.g. your name).

  3. Register a new Imgur App by following instructions here. Fill the provided Client ID and Client Secret in ./credentials.json.

  4. Download pre-trained weights of an NSFW detection model.

    wget https://s3.amazonaws.com/nsfwdetector/nsfw.299x299.h5 -P ./datasets/redcaps/models

Data collection from r/roses (2020)

  1. download-anns: Dowload annotations of image posts made in a single month (e.g. January).

    redcaps download-anns --subreddit roses --month 2020-01 -o ./datasets/redcaps/annotations
    
    # Similarly, download annotations for all months of 2020:
    for ((month = 1; month <= 12; month += 1)); do
        redcaps download-anns --subreddit roses --month 2020-$month -o ./datasets/redcaps/annotations
    done
    • NOTE: You may not get all the annotations present in official release as some of them may have disappeared (deleted) over time. After this step, the dataset directory would contain 12 annotation files:
        ./datasets/redcaps/
        └── annotations/
            ├── roses_2020-01.json
            ├── roses_2020-02.json
            ├── ...
            └── roses_2020-12.json
    
  2. merge: Merge all the monthly annotation files into a single file.

    redcaps merge ./datasets/redcaps/annotations/roses_2020-* \
        -o ./datasets/redcaps/annotations/roses_2020.json --delete-old
    • --delete-old will remove individual files after merging. After this step, the merged file will replace individual monthly files:
        ./datasets/redcaps/
        └── annotations/
            └── roses_2020.json
    
  3. download-imgs: Download all images for this annotation file. This step is same as (3) in basic usage.

    redcaps download-imgs --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --resize 512 -j 4 -o ./datasets/redcaps/images --update-annotations
    • --update-annotations removes annotations whose images were not downloaded.
  4. filter-words: Filter all instances whose captions contain potentially harmful language. Any caption containing one of the 400 blocklisted words will be removed. This command modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-words --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images
  5. filter-nsfw: Remove all instances having images that are flagged by an off-the-shelf NSFW detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-nsfw --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images \
        --model ./datasets/redcaps/models/nsfw.299x299.h5
  6. filter-faces: Remove all instances having images with faces detected by an off-the-shelf face detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-faces --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images  # Model weights auto-downloaded
  7. validate: All the above steps create a single annotation file (and downloads images) similar to official RedCaps annotations. To double-check this, run the following command and expect no errors to be printed.

    redcaps validate --annotations ./datasets/redcaps/annotations/roses_2020.json

Citation

If you find this code useful, please consider citing:

@inproceedings{desai2021redcaps,
    title={{RedCaps: Web-curated image-text data created by the people, for the people}},
    author={Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson},
    booktitle={NeurIPS Datasets and Benchmarks},
    year={2021}
}
Owner
RedCaps dataset
RedCaps dataset
A cd command that learns - easily navigate directories from the command line

NAME autojump - a faster way to navigate your filesystem DESCRIPTION autojump is a faster way to navigate your filesystem. It works by maintaining a d

William Ting 14.5k Jan 03, 2023
A python Ethereum utilities command-line tool.

peth-cli A python Ethereum utilities command-line tool. After wasting the all day trying to install seth and failed, I took another day to write this.

Moon 55 Nov 15, 2022
A simple CLI based any Download Tool, that find files and let you stream or download thorugh WebTorrent CLI or Aria or any command tool

Privateer A simple CLI based any Download Tool, that find files and let you stream or download thorugh WebTorrent CLI or Aria or any command tool How

Shreyash Chavan 2 Apr 04, 2022
A very simple and lightweight ToDo app using python that can be used from the command line

A very simple and lightweight ToDo app using python that can be used from the command line

Nilesh Sengupta 2 Jul 20, 2022
Chat In Terminal - Chat-App in python

Chat In Terminal Hello all. 😉 Sockets and servers are vey important for connection and importantly chatting with others. 😂 😁 I have thought of maki

Shreejan Dolai 5 Nov 17, 2022
A **CLI** folder organizer written in Python.

Fsorter Introduction A CLI folder organizer written in Python. Dependencies Before installing, install the following dependencies: Ubuntu/Debain Based

1 Nov 17, 2021
A command line utility for tracking a stock market portfolio. Primarily featuring high resolution braille graphs.

A command line stock market / portfolio tracker originally insipred by Ericm's Stonks program, featuring unicode for incredibly high detailed graphs even in a terminal.

Conrad Selig 51 Nov 29, 2022
A very simple python script to encode and decode PowerShell one-liners.

PowerShell Encoder A very simple python script to encode and decode PowerShell one-liners. I used Raikia's PowerShell encoder ALOT, but one day it wen

John Tear 5 Jul 29, 2022
CLI tool to fix linked references for dates.

Fix Logseq dates This is a CLI tool to fix the date references following a change in date format since the current version (0.4.4) of Logseq does not

Isaac Dadzie 5 May 18, 2022
A terminal UI dashboard to monitor requests for code review across Github and Gitlab repositories.

A terminal UI dashboard to monitor requests for code review across Github and Gitlab repositories.

Kyle Harrison 150 Dec 14, 2022
An question and answer shell environment based on xonsh using ansible for setup

An question and answer shell environment based on xonsh using ansible for setup

Steven Hollingsworth 2 Jan 11, 2022
Python API and CLI for the ikea IDÅSEN desk.

idasen This is a heavily modified fork of rhyst/idasen-controller. The IDÅSEN is an electric sitting standing desk with a Linak controller sold by ike

Alex 79 Dec 14, 2022
Apple Silicon 'top' CLI

asitop pip install asitop What A nvtop/htop style/inspired command line tool for Apple Silicon (aka M1) Macs. Note that it requires sudo to run due to

Timothy Liu 1.2k Dec 31, 2022
Container images for portable development environments

Docker Dev Spin up a container to develop from anywhere! To run, just: docker run -ti aghost7/nodejs-dev:boron tmux new Alternatively, if on Linux: p

Jonathan Boudreau 163 Dec 22, 2022
Analysis of a daily word game "Wordle"

Wordle Analysis of a daily word game "Wordle" https://www.powerlanguage.co.uk/wordle/ Description Worlde is a daily word game in which a player attemp

Bartek 1 Feb 07, 2022
A Telegram Bot Written In Python To Upload Medias To telegra.ph

Telegraph-Uploader A Telegram Bot Written In Python To Upload Medias To telegra.ph DEPLOY YOU CAN SIMPLY DEPLOY ON HEROKU BY CLICKING THE BUTTON BELOW

Rithunand 31 Dec 03, 2022
Management commands to help backup and restore your project database and media files

Django Database Backup This Django application provides management commands to help backup and restore your project database and media files with vari

687 Jan 04, 2023
A simple python implementation of a reverse shell

llehs A python implementation of a reverse shell. Note for contributors The project is open for contributions and is hacktoberfest registered! llehs u

Archisman Ghosh 2 Jul 05, 2022
A command-line tool to upload local files and pastebin pastes to your mega account, without signing in anywhere

A command-line tool to upload local files and pastebin pastes to your mega account, without signing in anywhere

ADI 4 Nov 17, 2022
spade is the next-generation networking command line tool.

spade is the next-generation networking command line tool. Say goodbye to the likes of dig, ping and traceroute with more accessible, more informative and prettier output.

Vivaan Verma 5 Jan 28, 2022