CLIPort: What and Where Pathways for Robotic Manipulation

Overview

CLIPort

CLIPort: What and Where Pathways for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, Dieter Fox
CoRL 2021

CLIPort is an end-to-end imitation-learning agent that can learn a single language-conditioned policy for various tabletop tasks. The framework combines the broad semantic understanding (what) of CLIP with the spatial precision (where) of TransporterNets to learn generalizable skills from limited training demonstrations.

For the latest updates, see: cliport.github.io

Guides

Installation

Clone Repo:

git clone https://github.com/cliport/cliport.git

Setup virtualenv and install requirements:

# setup virtualenv with whichever package manager you prefer
virtualenv -p $(which python3.8) --system-site-packages cliport_env  
source cliport_env/bin/activate
pip install --upgrade pip

cd cliport
pip install -r requirements.txt

export CLIPORT_ROOT=$(pwd)
python setup.py develop

Note: You might need versions of torch==1.7.1 and torchvision==0.8.2 that are compatible with your CUDA and hardware.

Quickstart

A quick tutorial on evaluating a pre-trained multi-task model.

Download a pre-trained checkpoint for multi-language-conditioned trained with 1000 demos:

python scripts/quickstart_download.py

Generate a small test set of 10 instances for stack-block-pyramid-seq-seen-colors inside $CLIPORT_ROOT/data:

python cliport/demos.py n=10 \
                        task=stack-block-pyramid-seq-seen-colors \
                        mode=test 

This will take a few minutes to finish.

Evaluate the best validation checkpoint for stack-block-pyramid-seq-seen-colors on the test set:

python cliport/eval.py model_task=multi-language-conditioned \
                       eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=test \
                       n_demos=10 \
                       train_demos=1000 \
                       exp_folder=cliport_quickstart \
                       checkpoint_type=test_best \
                       update_results=True \
                       disp=True

If you are on a headless machine turn off the visualization with disp=False.

You can evaluate the same multi-language-conditioned model on other tasks. First generate a val set for the task and then specify eval_task=<task_name> with mode=val and checkpoint_type=val_missing (the quickstart doesn't include validation results for all tasks; download all task results from here).

Download

Google Scanned Objects

Download center-of-mass (COM) corrected Google Scanned Objects:

python scripts/google_objects_download.py

Credit: Google.

Pre-trained Checkpoints and Result JSONs

This Google Drive Folder contains pre-trained multi-language-conditioned checkpoints for n=1,10,100,1000 and validation/test result JSONs for all tasks. The *val-results.json files contain the name of the best checkpoint (from validation) to be evaluated on the test set.

Note: Google Drive might complain about bandwidth restrictions. I recommend using rclone with API access enabled.

Evaluate the best validation checkpoint on the test set:

python cliport/eval.py model_task=multi-language-conditioned \
                       eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=test \
                       n_demos=10 \
                       train_demos=100 \
                       exp_folder=cliport_exps \
                       checkpoint_type=test_best \
                       update_results=True \
                       disp=True

Training and Evaluation

The following is a guide for training everything from scratch. All tasks follow a 4-phase workflow:

  1. Generate train, val, test datasets with demos.py
  2. Train agents with train.py
  3. Run validation with eval.py to find the best checkpoint on val tasks and save *val-results.json
  4. Evaluate the best checkpoint in *val-results.json on test tasks with eval.py

Dataset Generation

Single Task

Generate a train set of 1000 demonstrations for stack-block-pyramid-seq-seen-colors inside $CLIPORT_ROOT/data:

python cliport/demos.py n=1000 \
                        task=stack-block-pyramid-seq-seen-colors \
                        mode=train 

You can also do a sequential sweep with -m and comma-separated params task=towers-of-hanoi-seq-seen-colors,stack-block-pyramid-seq-seen-colors. Use disp=True to visualize the data generation.

Full Dataset

Run generate_dataset.sh to generate the full dataset and save it to $CLIPORT_ROOT/data:

sh scripts/generate_dataset.sh data

Note: This script is not parallelized and will take a long time (maybe days) to finish. The full dataset requires ~1.6TB of storage, which includes both language-conditioned and demo-conditioned (original TransporterNets) tasks. It's recommend that you start with single-task training if you don't have enough storage space.

Single-Task Training & Evaluation

Make sure you have a train (n demos) and val (100 demos) set for the task you want to train on.

Training

Train a cliport agent with 1000 demonstrations on the stack-block-pyramid-seq-seen-colors task for 200K iterations:

python cliport/train.py train.task=stack-block-pyramid-seq-seen-colors \
                        train.agent=cliport \
                        train.attn_stream_fusion_type=add \
                        train.trans_stream_fusion_type=conv \
                        train.lang_fusion_type=mult \
                        train.n_demos=1000 \
                        train.n_steps=201000 \
                        train.exp_folder=exps \
                        dataset.cache=False 

Validation

Iteratively evaluate all the checkpoints on val and save the results in exps/<task>-train/checkpoints/<task>-val-results.json:

python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=val \
                       n_demos=100 \
                       train_demos=1000 \
                       checkpoint_type=val_missing \
                       exp_folder=exps 

Test

Choose the best checkpoint from validation to run on the test set and save the results in exps/<task>-train/checkpoints/<task>-test-results.json:

python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=test \
                       n_demos=100 \
                       train_demos=1000 \
                       checkpoint_type=test_best \
                       exp_folder=exps 

Multi-Task Training & Evaluation

Training

Train multi-task models by specifying task=multi-language-conditioned, task=multi-loo-packing-box-pairs-unseen-colors (loo stands for leave-one-out or multi-attr tasks) etc.

python cliport/train.py train.task=multi-language-conditioned \
                        train.agent=cliport \
                        train.attn_stream_fusion_type=add \
                        train.trans_stream_fusion_type=conv \
                        train.lang_fusion_type=mult \
                        train.n_demos=1000 \
                        train.n_steps=601000 \
                        dataset.cache=False \
                        train.exp_folder=exps \
                        dataset.type=multi 

Important: You need to generate the full dataset of tasks specified in dataset.py before multi-task training or modify the list of tasks here.

Validation

Run validation with a trained multi-language-conditioned multi-task model on stack-block-pyramid-seq-seen-colors:

python cliport/eval.py model_task=multi-language-conditioned \
                       eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=val \
                       n_demos=100 \
                       train_demos=1000 \
                       checkpoint_type=val_missing \
                       type=single \
                       exp_folder=exps 

Test

Evaluate the best checkpoint on the test set:

python cliport/eval.py model_task=multi-language-conditioned \
                       eval_task=stack-block-pyramid-seq-seen-colors \
                       agent=cliport \
                       mode=test \
                       n_demos=100 \
                       train_demos=1000 \
                       checkpoint_type=test_best \
                       type=single \
                       exp_folder=exps 

Disclaimers

  • Code Quality Level: Tired grad student.
  • Scaling: The code only works for batch size 1. See #issue1 for reference. In theory, there is nothing preventing larger batch sizes other than GPU memory constraints.
  • Memory and Storage: There are lots of places where memory usage can be reduced. You don't need 3 copies of the same CLIP ResNet50 and you don't need to save its weights in checkpoints since it's frozen anyway. Dataset sizes could be dramatically reduced with better storage formats and compression.
  • Frameworks: There are lots of leftover NumPy bits from when I was trying to reproduce the TransportNets results. I'll try to clean up when I get some time.
  • Rotation Augmentation: All tasks use the same distribution for sampling SE(2) rotation perturbations. This obviously leads to issues with tasks that involve spatial relationships like 'left' or 'forward'.
  • Evaluation Runs: In an ideal setting, the evaluation metrics should be averaged over 3 or more repetitions with different seeds. This might be feasible if you are working just with multi-task models.
  • Duplicate Training Sets: The train sets of some *seen and *unseen tasks are identical, and only the val and test sets differ for purposes of evaluating generalization performance. So you might not need two duplicate train sets or train two separate models.
  • Other Limitations: Checkout Appendix I in the paper.

Notebooks

Checkout Kevin Zakka's Colab for zero-shot detection with CLIP. This notebook might be a good way of gauging what sort of visual attributes CLIP can ground with language. But note that CLIPort does NOT do "object detection", but instead directly "detects actions".

Others Todos

  • Dataset Visualizer
  • Affordance Heatmap Visualizer
  • Evaluation Results Plot

Docker Guide

Install Docker and NVIDIA Docker.

Modify docker_build.py and docker_run.py to your needs.

Build

Build the image:

python scripts/docker_build.py 

Run

Start container:

python scripts/docker_run.py --nvidia_docker
 
  cd ~/cliport

Use scripts/docker_run.py --headless if you are on a headless machines like a remote server or cloud instance.

Real-Robot Training FAQ

How much training data do I need?

It depends on the complexity of the task. With 5-10 demonstrations the agent should start to do something useful, but it will often make mistakes by picking the wrong object. For robustness you probably need 50-100 demostrations. A good way to gauge how much data you might need is to setup a simulated version of the problem and evaluate agents trained with 1, 10, 100, and 1000 demonstrations.

Why doesn't the agent follow my language instruction?

This means either there is some sort of bias in the dataset that the agent is exploiting, or you don't have enough training data. Also make sure that the task is doable - if a referred attribute is barely legible in the input, then it's going to be hard for agent to figure out what you mean.

Does CLIPort predict height (z-values) of the end-effector?

CLIPort does not predict height values. You can either: (1) come up with a heuristic based on the heightmap to determine the height position, or (2) train a simple MLP like in TransportNets-6DOF to predict z-values.

Shouldn't CLIP help in zero-shot detection of things? Why do I need collect more data?

Note that CLIPort is not doing "object detection". CLIPort fine-tunes CLIP's representations to "detect actions" in SE(2). CLIP by itself has no understanding of actions or affordances; recognizing and localizing objects (e.g. detecting hammer) does not tell you anything about how to manipulate them (e.g. grasping the hammer by the handle).

What are the best hyperparams for real-robot training?

The default settings should work well. Although recently, I have been playing around with using FiLM (Perez et. al, 2017) to fuse language features inspired by BC-0 (Jang et. al, 2021). Qualitatively, it seems like FiLM is better for reading text etc. but I haven't conducted a full quantitative analysis. Try it out yourself with train.agent=two_stream_clip_film_lingunet_lat_transporter (non-residual FiLM).

How to pick the best checkpoint for real-robot tasks?

Ideally, you should create a validation set with heldout instances and then choose the checkpoint with the lowest translation and rotation errors. You can also reuse the training instances but swap the language instructions with unseen goals.

Why is the agent confusing directions like 'forward' and 'left'?

By default, training samples are augmented with SE(2) rotations sampled from N(0, 60 deg). For tasks with rotational symmetries (like moving pieces on a chessboard) you need to be careful with this rotation augmentation parameter.

Acknowledgements

This work use code from the following open-source projects and datasets:

Google Ravens (TransporterNets)

Original: https://github.com/google-research/ravens
License: Apache 2.0
Changes: All PyBullet tasks are directly adapted from the Ravens codebase. The original TransporterNets models were reimplemented in PyTorch.

OpenAI CLIP

Original: https://github.com/openai/CLIP
License: MIT
Changes: Minor modifications to CLIP-ResNet50 to save intermediate features for skip connections.

Google Scanned Objects

Original: Dataset
License: Creative Commons BY 4.0
Changes: Fixed center-of-mass (COM) to be geometric-center for selected objects.

U-Net

Original: https://github.com/milesial/Pytorch-UNet/
License: GPL 3.0
Changes: Used as is in unet.py. Note: This part of the code is GPL 3.0.

Citations

CLIPort

@inproceedings{shridhar2021cliport,
  title     = {CLIPort: What and Where Pathways for Robotic Manipulation},
  author    = {Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter},
  booktitle = {Proceedings of the 5th Conference on Robot Learning (CoRL)},
  year      = {2021},
}

CLIP

@article{radford2021learning,
  title={Learning transferable visual models from natural language supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
  journal={arXiv preprint arXiv:2103.00020},
  year={2021}
}

TransporterNets

@inproceedings{zeng2020transporter,
  title={Transporter networks: Rearranging the visual world for robotic manipulation},
  author={Zeng, Andy and Florence, Pete and Tompson, Jonathan and Welker, Stefan and Chien, Jonathan and Attarian, Maria and Armstrong, Travis and Krasin, Ivan and Duong, Dan and Sindhwani, Vikas and others},
  booktitle={Proceedings of the 4th Conference on Robot Learning (CoRL)},
  year= {2020},
}

Questions or Issues?

Please file an issue with the issue tracker.

Coded illumination for improved lensless imaging

CodedCam Coded Illumination for Improved Lensless Imaging Paper | Supplementary results | Data and Code are available. Coded illumination for improved

Computational Sensing and Information Processing Lab 1 Nov 29, 2021
repro_eval is a collection of measures to evaluate the reproducibility/replicability of system-oriented IR experiments

repro_eval repro_eval is a collection of measures to evaluate the reproducibility/replicability of system-oriented IR experiments. The measures were d

IR Group at Technische Hochschule Köln 9 May 25, 2022
A Python library for generating new text from existing samples.

ReMarkov is a Python library for generating text from existing samples using Markov chains. You can use it to customize all sorts of writing from birt

8 May 17, 2022
MAg: a simple learning-based patient-level aggregation method for detecting microsatellite instability from whole-slide images

MAg Paper Abstract File structure Dataset prepare Data description How to use MAg? Why not try the MAg_lib! Trained models Experiment and results Some

Calvin Pang 3 Apr 08, 2022
New AidForBlind - Various Libraries used like OpenCV and other mentioned in Requirements.txt

AidForBlind Recommended PyCharm IDE Various Libraries used like OpenCV and other

Aalhad Chandewar 1 Jan 13, 2022
This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

Demo BERT ONNX pipeline written in rust This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust. R

Xavier Tao 14 Dec 17, 2022
Generative Adversarial Text to Image Synthesis

Text To Image Synthesis This is a tensorflow implementation of synthesizing images. The images are synthesized using the GAN-CLS Algorithm from the pa

Hao 575 Jan 08, 2023
Code of Classification Saliency-Based Rule for Visible and Infrared Image Fusion

CSF Code of Classification Saliency-Based Rule for Visible and Infrared Image Fusion Tips: For testing: CUDA_VISIBLE_DEVICES=0 python main.py For trai

Han Xu 14 Oct 31, 2022
(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)

IsoTree Fast and multi-threaded implementation of Extended Isolation Forest, Fair-Cut Forest, SCiForest (a.k.a. Split-Criterion iForest), and regular

141 Dec 29, 2022
Local Multi-Head Channel Self-Attention for FER2013

LHC-Net Local Multi-Head Channel Self-Attention This repository is intended to provide a quick implementation of the LHC-Net and to replicate the resu

12 Jan 04, 2023
Source code of AAAI 2022 paper "Towards End-to-End Image Compression and Analysis with Transformers".

Towards End-to-End Image Compression and Analysis with Transformers Source code of our AAAI 2022 paper "Towards End-to-End Image Compression and Analy

37 Dec 21, 2022
Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch

Perceiver - Pytorch Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch Install $ pip install perceiver-pytorch Usage

Phil Wang 876 Dec 29, 2022
Code to reproduce the results in "Visually Grounded Reasoning across Languages and Cultures", EMNLP 2021.

marvl-code [WIP] This is the implementation of the approaches described in the paper: Fangyu Liu*, Emanuele Bugliarello*, Edoardo M. Ponti, Siva Reddy

25 Nov 15, 2022
Official repo for QHack—the quantum machine learning hackathon

Note: This repository has been frozen while we consider the submissions for the QHack Open Hackathon. We hope you enjoyed the event! Welcome to QHack,

Xanadu 118 Jan 05, 2023
Aesara is a Python library that allows one to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays.

Aesara is a Python library that allows one to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays.

Aesara 898 Jan 07, 2023
Python package for downloading ECMWF reanalysis data and converting it into a time series format.

ecmwf_models Readers and converters for data from the ECMWF reanalysis models. Written in Python. Works great in combination with pytesmo. Citation If

TU Wien - Department of Geodesy and Geoinformation 31 Dec 26, 2022
Official Implementation of DDOD (Disentangle your Dense Object Detector), ACM MM2021

Disentangle Your Dense Object Detector This repo contains the supported code and configuration files to reproduce object detection results of Disentan

loveSnowBest 51 Jan 07, 2023
Code for the SIGGRAPH 2021 paper "Consistent Depth of Moving Objects in Video".

Consistent Depth of Moving Objects in Video This repository contains training code for the SIGGRAPH 2021 paper "Consistent Depth of Moving Objects in

Google 203 Jan 05, 2023
CKD - Collaborative Knowledge Distillation for Heterogeneous Information Network Embedding

Collaborative Knowledge Distillation for Heterogeneous Information Network Embed

zhousheng 9 Dec 05, 2022
A sequence of Jupyter notebooks featuring the 12 Steps to Navier-Stokes

CFD Python Please cite as: Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of Open Sour

Barba group 2.6k Dec 30, 2022