Public implementation of "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression" from CoRL'21

Related tags

Deep LearningSSRR
Overview

Self-Supervised Reward Regression (SSRR)

Codebase for CoRL 2021 paper "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression " Authors: Letian "Zac" Chen, Rohan Paleja, Matthew Gombolay

Usage

Quick overview

The pipeline of SSRR includes

  1. Initial IRL: Noisy-AIRL or AIRL.
  2. Noisy Dataset Generation: use initial policy learned in step 1 to generate trajectories with different noise levels and criticize trajectories with initial reward.
  3. Sigmoid Fitting: fit a sigmoid function for the noise-performance relationship using the data obtained in step 2.
  4. Reward Learning: learn a reward function by regressing to the sigmoid relationship obtained in step 3.
  5. Policy Learning: learn a policy by optimizing the reward learned in step 4.

I know this is a long README, but please make sure you read the entirety before trying out our code. Trust me, that will save your time!

Dependencies and Environment Preparations

Code is tested with Python 3.6 with Anaconda.

Required packages:

pip install scipy path.py joblib==0.12.3 flask h5py matplotlib scikit-learn pandas pillow pyprind tqdm nose2 mujoco-py cached_property cloudpickle git+https://github.com/Theano/[email protected]#egg=Theano git+https://github.com/neocxi/[email protected]#egg=Lasagne plotly==2.0.0 gym[all]==0.14.0 progressbar2 tensorflow-gpu==1.15 imgcat

Test sets of trajectories could be downloaded at Google Drive because Github could not hold files that are larger than 100MB! After downloading, please put full_demos/ under demos/.

If you are directly running python scripts, you will need to add the project root and the rllab_archive folder into your PYTHONPATH:

export PYTHONPATH=/path/to/this/repo/:/path/to/this/repo/rllab_archive/

If you are using the bash scripts provided (for example, noisy_airl_ssrr_drex_comparison_halfcheetah.sh), make sure to replace the first line to be

export PYTHONPATH=/path/to/this/repo/:/path/to/this/repo/rllab_archive/

Initial IRL

We provide code for AIRL and Noisy-AIRL implementation.

Running

Examples of running command would be

python script_experiment/halfcheetah_airl.py --output_dir=./data/halfcheetah_airl_test_1
python script_experiment/hopper_noisy_airl.py --output_dir=./data/hopper_noisy_airl_test_1 --noisy

Please note for Noisy-AIRL, you have to include the --noisy flag to make it actually sample trajectories with noise, otherwise it only changes the loss function according to Equation 6 in the paper.

Results

The result will be available in the output dir specified, and we recommend using rllab viskit to visualize it.

We also provide our run results available in data/{halfcheetah/hopper/ant}_{airl/noisy_airl}_test_1 if you want to skip this step!

Code Structure

The AIRL and Noisy-AIRL codes reside in inverse_rl/ with rllab dependencies in rllab_archive. The AIRL code is adjusted from the original AIRL codebase https://github.com/justinjfu/inverse_rl. The rllab archive was adjusted from the original rllab codebase https://github.com/rll/rllab.

Noisy Dataset Generation & Sigmoid Fitting

We implemented noisy dataset generation and sigmoid fitting together in code.

Running

Examples of running command would be

python script_experiment/noisy_dataset.py \
   --log_dir=./results/halfcheetah/temp/noisy_dataset/ \
   --env_id=HalfCheetah-v3 \
   --bc_agent=./results/halfcheetah/temp/bc/model.ckpt \
   --demo_trajs=./demos/suboptimal_demos/ant/dataset.pkl \
   --airl_path=./data/halfcheetah_airl_test_1/itr_999.pkl \
   --airl \
   --seed="${loop}"

Note that flag --airl determines whether we utilize the --airl_path or --bc_agent policy to generate the trajectory. Therefore, --bc_agent is optional when --airl present. For behavior cloning policy, please refer to https://github.com/dsbrown1331/CoRL2019-DREX.

The --airl_path always provide the initial reward to criticize the generated trajectories no matter whether --airl present.

Results

The result will be available in the log dir specified.

We also provide our run results available in results/{halfcheetah/hopper/ant}/{airl/noisy_airl}_data_ssrr_{1/2/3/4/5}/noisy_dataset/ if you want to skip this step!

Code Structure

Noisy dataset generation and Sigmoid fitting are implemented in script_experiment/noisy_dataset.py.

Reward Learning

We provide SSRR and D-REX implementation.

Running

Examples of running command would be

  python script_experiment/drex.py \
   --log_dir=./results/halfcheetah/temp/drex \
   --env_id=HalfCheetah-v3 \
   --bc_trajs=./demos/suboptimal_demos/halfcheetah/dataset.pkl \
   --unseen_trajs=./demos/full_demos/halfcheetah/unseen_trajs.pkl \
   --noise_injected_trajs=./results/halfcheetah/temp/noisy_dataset/prebuilt.pkl \
   --seed="${loop}"
  python script_experiment/ssrr.py \
   --log_dir=./results/halfcheetah/temp/ssrr \
   --env_id=HalfCheetah-v3 \
   --mode=train_reward \
   --noise_injected_trajs=./results/halfcheetah/temp/noisy_dataset/prebuilt.pkl \
   --bc_trajs=demos/suboptimal_demos/halfcheetah/dataset.pkl \
   --unseen_trajs=demos/full_demos/halfcheetah/unseen_trajs.pkl \
   --min_steps=50 --max_steps=500 --l2_reg=0.1 \
   --sigmoid_params_path=./results/halfcheetah/temp/noisy_dataset/fitted_sigmoid_param.pkl \
   --seed="${loop}"

The bash script also helps combining running of noisy dataset generation, sigmoid fitting, and reward learning, and repeats several times:

./airl_ssrr_drex_comparison_halfcheetah.sh

Results

The result will be available in the log dir specified.

The correlation between the predicted reward and the ground-truth reward tested on the unseen_trajs is reported at the end of running on console, or, if you are using the bash script, at the end of the d_rex.log or ssrr.log.

We also provide our run results available in results/{halfcheetah/hopper/ant}/{airl/noisy_airl}_data_ssrr_{1/2/3/4/5}/{drex/ssrr}/.

Code Structure

SSRR is implemented in script_experiment/ssrr.py, Agents/SSRRAgent.py, Datasets/NoiseDataset.py.

D-REX is implemented in script_experiment/drex.py, scrip_experiment/drex_utils.py, and script_experiment/tf_commons/ops.

Both implementations are adapted from https://github.com/dsbrown1331/CoRL2019-DREX.

Policy Learning

We utilize stable-baselines to optimize policy over the reward we learned.

Running

Before running, you should edit script_experiment/rl_utils/sac.yml to change the learned reward model directory, for example:

  env_wrapper: {"script_experiment.rl_utils.wrappers.CustomNormalizedReward": {"model_dir": "/home/zac/Programming/Zac-SSRR/results/halfcheetah/noisy_airl_data_ssrr_4/ssrr/", "ctrl_coeff": 0.1, "alive_bonus": 0.0}}

Examples of running command would be

python script_experiment/train_rl_with_learned_reward.py \
 --algo=sac \
 --env=HalfCheetah-v3 \
 --tensorboard-log=./results/HalfCheetah_custom_reward/ \
 --log-folder=./results/HalfCheetah_custom_reward/ \
 --save-freq=10000

Please note the flag --env-kwargs=terminate_when_unhealthy:False is necessary for Hopper and Ant as discussed in our paper Supplementary D.1.

Examples of running evaluation the learned policy's ground-truth reward would be

python script_experiment/test_rl_with_ground_truth_reward.py \
 --algo=sac \
 --env=HalfCheetah-v3 \
 -f=./results/HalfCheetah_custom_reward/ \
 --exp-id=1 \
 -e=5 \
 --no-render \
 --env-kwargs=terminate_when_unhealthy:False

Results

The result will be available in the log folder specified.

We also provide our run results in results/.

Code Structure

The code script_experiment/train_rl_with_learned_reward.py and utils/ call stable-baselines library to learn a policy with the learned reward function. Note that utils could not be renamed because of the rl-baselines-zoo constraint.

The codes are adjusted from https://github.com/araffin/rl-baselines-zoo.

Random Seeds

Because of the inherent stochasticity of GPU reduction operations such as mean and sum (https://github.com/tensorflow/tensorflow/issues/3103), even if we set the random seed, we cannot reproduce the exact result every time. Therefore, we encourage you to run multiple times to reduce the random effect.

If you have a nice way to get the same result each time, please let us know!

Ending Thoughts

We welcome discussions or extensions of our paper and code in Issues!

Feel free to leave a star if you like this repo!

For more exciting work our lab (CORE Robotics Lab in Georgia Institute of Technology led by Professor Matthew Gombolay), check out our website!

Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer"

TSOD Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer" Usage For training, open train_test, run p

Jinming Su 2 Dec 23, 2021
An official implementation of the Anchor DETR.

Anchor DETR: Query Design for Transformer-Based Detector Introduction This repository is an official implementation of the Anchor DETR. We encode the

MEGVII Research 276 Dec 28, 2022
SLAMP: Stochastic Latent Appearance and Motion Prediction

SLAMP: Stochastic Latent Appearance and Motion Prediction Official implementation of the paper SLAMP: Stochastic Latent Appearance and Motion Predicti

Kaan Akan 34 Dec 08, 2022
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand Introduction We propose a generalization of leaderboards, bidimensional leader

4 Dec 03, 2022
Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

Dongkyu Lee 4 Sep 18, 2022
This code is for eCaReNet: explainable Cancer Relapse Prediction Network.

eCaReNet This code is for eCaReNet: explainable Cancer Relapse Prediction Network. (Towards Explainable End-to-End Prostate Cancer Relapse Prediction

Institute of Medical Systems Biology 2 Jul 28, 2022
Generative Autoregressive, Normalized Flows, VAEs, Score-based models (GANVAS)

GANVAS-models This is an implementation of various generative models. It contains implementations of the following: Autoregressive Models: PixelCNN, G

MRSAIL (Mini Robotics, Software & AI Lab) 6 Nov 26, 2022
Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

LongDocSum Code for NAACL 2021 paper "Efficient Attentions for Long Document Summarization" This repository contains data and models needed to reprodu

56 Jan 02, 2023
A set of examples around hub for creating and processing datasets

Examples for Hub - Dataset Format for AI A repository showcasing examples of using Hub Uploading Dataset Places365 Colab Tutorials Notebook Link Getti

Activeloop 11 Dec 14, 2022
thundernet ncnn

MMDetection_Lite 基于mmdetection 实现一些轻量级检测模型,安装方式和mmdeteciton相同 voc0712 voc 0712训练 voc2007测试 coco预训练 thundernet_voc_shufflenetv2_1.5 input shape mAP 320

DayBreak 39 Dec 05, 2022
Continuum Learning with GEM: Gradient Episodic Memory

Gradient Episodic Memory for Continual Learning Source code for the paper: @inproceedings{GradientEpisodicMemory, title={Gradient Episodic Memory

Facebook Research 360 Dec 27, 2022
Complete U-net Implementation with keras

U Net Lowered with Keras Complete U-net Implementation with keras Original Paper Link : https://arxiv.org/abs/1505.04597 Special Implementations : The

Sagnik Roy 14 Oct 10, 2022
Code for the paper "Functional Regularization for Reinforcement Learning via Learned Fourier Features"

Reinforcement Learning with Learned Fourier Features State-space Soft Actor-Critic Experiments Move to the state-SAC-LFF repository. cd state-SAC-LFF

Alex Li 10 Nov 11, 2022
CCCL: Contrastive Cascade Graph Learning.

CCGL: Contrastive Cascade Graph Learning This repo provides a reference implementation of Contrastive Cascade Graph Learning (CCGL) framework as descr

Xovee Xu 19 Dec 05, 2022
Omniscient Video Super-Resolution

Omniscient Video Super-Resolution This is the official code of OVSR (Omniscient Video Super-Resolution, ICCV 2021). This work is based on PFNL. Datase

36 Oct 27, 2022
Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency[ECCV 2020]

Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency(ECCV 2020) This is an official python implementati

304 Jan 03, 2023
Official repository of DeMFI (arXiv.)

DeMFI This is the official repository of DeMFI (Deep Joint Deblurring and Multi-Frame Interpolation). [ArXiv_ver.] Coming Soon. Reference Jihyong Oh a

Jihyong Oh 56 Dec 14, 2022
SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data

SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data Au

14 Nov 28, 2022
Anomaly Localization in Model Gradients Under Backdoor Attacks Against Federated Learning

Federated_Learning This repo provides a federated learning framework that allows to carry out backdoor attacks under varying conditions. This is a ker

Arçelik ARGE Açık Kaynak Yazılım Organizasyonu 0 Nov 30, 2021
Informal Persian Universal Dependency Treebank

Informal Persian Universal Dependency Treebank (iPerUDT) Informal Persian Universal Dependency Treebank, consisting of 3000 sentences and 54,904 token

Roya Kabiri 0 Jan 05, 2022