Rainbow: Combining Improvements in Deep Reinforcement Learning

Overview

Rainbow

MIT License

Rainbow: Combining Improvements in Deep Reinforcement Learning [1].

Results and pretrained models can be found in the releases.

  • DQN [2]
  • Double DQN [3]
  • Prioritised Experience Replay [4]
  • Dueling Network Architecture [5]
  • Multi-step Returns [6]
  • Distributional RL [7]
  • Noisy Nets [8]

Run the original Rainbow with the default arguments:

python main.py

Data-efficient Rainbow [9] can be run using the following options (note that the "unbounded" memory is implemented here in practice by manually setting the memory capacity to be the same as the maximum number of timesteps):

python main.py --target-update 2000 \
               --T-max 100000 \
               --learn-start 1600 \
               --memory-capacity 100000 \
               --replay-frequency 1 \
               --multi-step 20 \
               --architecture data-efficient \
               --hidden-size 256 \
               --learning-rate 0.0001 \
               --evaluation-interval 10000

Note that pretrained models from the 1.3 release used a (slightly) incorrect network architecture. To use these, change the padding in the first convolutional layer from 0 to 1 (DeepMind uses "valid" (no) padding).

Requirements

To install all dependencies with Anaconda run conda env create -f environment.yml and use source activate rainbow to activate the environment.

Available Atari games can be found in the atari-py ROMs folder.

Acknowledgements

References

[1] Rainbow: Combining Improvements in Deep Reinforcement Learning
[2] Playing Atari with Deep Reinforcement Learning
[3] Deep Reinforcement Learning with Double Q-learning
[4] Prioritized Experience Replay
[5] Dueling Network Architectures for Deep Reinforcement Learning
[6] Reinforcement Learning: An Introduction
[7] A Distributional Perspective on Reinforcement Learning
[8] Noisy Networks for Exploration
[9] When to Use Parametric Models in Reinforcement Learning?

Comments
  • Prioritised Experience Replay

    Prioritised Experience Replay

    I am interested in implementing Rainbow too. I didn't go deep in code for the moment, but I just saw on the Readme.md that Prioritised Experience Replay is not checked. Will this feature be implemented or it is maybe already working? On their paper, Deepmind are actually showing that Prioritized Experience Replay is the most important feature, that means the "no priority" got the bigger performance gap with the full Rainbow.

    bug help wanted 
    opened by marintoro 28
  • Replicating DeepMind results

    Replicating DeepMind results

    As of 5c252ea, this repo has been checked over several times for discrepancies, but is still unable to replicate DeepMind's results. This issue is to discuss any further points that may need fixing.

    • [x] Should the loss be averaged or summed over the minibatch?
    • [x] Should noisy network updating use independent noise per transition in the batch [v1] or the same noise but another noise sample for action selection [v2]?
    • [x] Is the max priority over all time, or just from the current buffer (may be the former)? Results and paper indicate former.
    • [x] Are priorities added as δ, or δ + ε (ε may not be needed with a KL loss)? One single ablation run indicates adding ε causes performance to drop more at end of training. δ + ε shouldn't be needed with a KL loss,
    • [x] Most people implement PER by adding priorities already multiplied by α, but the maths indicates that the raw values should be stored and sampling should be done with respect to everything to the power of α? α isn't changed here - so not an issue.

    Space Invaders (averaged losses): newplot

    Space Invaders (summed losses): newplot

    help wanted 
    opened by Kaixhin 24
  • Resume support

    Resume support

    Added preliminary support for resuming. Initial testing looks like it works, but I'd appreciate if anyone else gets a chance to play with it in their setup.

    I didn't add an explicit resume flag, although we could do that. Currently, the assumption is that if you provide the --memory-save-path argument, you want the memory saved there, by default after every testing round. If you provide the --model argument and do not provide the --evaluateflag, the assumption is that you want to resume, and that --memory-save-path exists.

    Another flag we could add is a --T_start flag, akin to --T_max, in order to specify where training is resuming from to better the logging of resumed models. What do you think?

    Choosing to compress at all, and choosing to use bz2 specifically, came after a quick benchmark I did with some pickled memories I had. It drops them from ~2GB to <100 MB, and bz2 took somewhere around 2-3 minutes, while pickling without it took around 40 seconds.

    opened by guydav 14
  • Performance of release v1.0 on Space Invaders

    Performance of release v1.0 on Space Invaders

    I just launched the release v1.0 (commit 952fcb4) on Space Invaders for the whole week-end (around 25M steps). I took the exact same code with the exact same random seed. I got really lower performance than the one you are showing. Here are the plots of rewards and Q-values q_values_v1 0 reward_v1 0

    Could you explain exactly how you got your results for this release? Did you try multiple experiments with different random seed and average them or just took the best one of them? Or maybe it's a pytorch, atari_py or any other library issue? Could you give all your library version?

    opened by marintoro 13
  • Testing should be not deterministic

    Testing should be not deterministic

    There is a parameter --evaluation-episodes but in the current implementation, like we are always acting greedly, all the episodes are going to be exactly the same. I think that to get a better testing evaluation, you should add a deterministic=False when you are testing (i.e. in stead of taking the action with the higher Q value, you can sample on all the action with each Q value as the probability) .

    I implemented that on my branch on the last commit [email protected] (it's really straightforward)

    Btw I launched a training last night, everything worked properly. But I don't have access to a powerfull computer yet so the agent was still pretty poor in performance (in the early stage of training). I just wanted to know if you already launched a big training, on which game and if you compared it to a standard DRL algo (like simple DQN for example)? Because there may still be some non-breaking errors in the implementation which could be sneaky to spot and debug (I mean if the agent is learning worse than simple DQN, there must be something wrong for example).

    opened by marintoro 8
  • TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not collections.deque

    TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not collections.deque

    Traceback (most recent call last): File "main.py", line 81, in state, done = env.reset(), False File "C:\Users\simon\Desktop\DQN\RL-AlphaGO\Rainbow-master\env.py", line 53, in reset return torch.stack(self.state_buffer, 0) TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not collections.deque

    Could somebody give a hand?

    opened by forhonourlx 7
  • disable env.reset() after every episode

    disable env.reset() after every episode

    Hi, May I check if I would like to keep the environment as it is after each training episode, should I just comment line line 147 in main.py or should I also comment line 130? Besides what am I supposed to do if I just want to reset the agent's position but keep the environment as it is after each training episode?

    Thank you.

    question 
    opened by zyzhang1130 5
  • TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

    TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

    ...... self.actions.get(action): 4 self.actions.get(action): 4 self.actions.get(action): 4 self.actions.get(action): 4 self.actions.get(action): 1 self.actions.get(action): 1 self.actions.get(action): 1 self.actions.get(action): 1 self.actions.get(action): None

    Traceback (most recent call last): File "main.py", line 103, in next_state, reward, done = env.step(action) # Step File "C:\Users\simon\Desktop\DQN\RL-AlphaGO\Rainbow-master\env.py", line 63, in step reward += self.ale.act(self.actions.get(action)) File "C:\Program Files\Python35\lib\site-packages\atari_py\ale_python_interface.py", line 159, in act return ale_lib.act(self.obj, int(action)) TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

    opened by forhonourlx 5
  • Unit test Prioritised Experience Replay Memory

    Unit test Prioritised Experience Replay Memory

    PER was reported to cause issues (decreasing the performance of a DQN) when ported to another codebase. Although PER can cause performance to decrease, it is still likely that there exists a bug within it.

    bug 
    opened by Kaixhin 5
  • Policy and reward function

    Policy and reward function

    Hi, There is certain thing I would like to modify for policy and reward function. May I ask where is policy stored after each epoch of training? Is there some way to call/index/assign it with some flag? Thanks for answering.

    opened by zyzhang1130 4
  • Memory capacity for example data-efficient Rainbow?

    Memory capacity for example data-efficient Rainbow?

    Hi folks,

    I'm running the data-efficient Rainbow as a baseline for a project I'm starting, and one thing isn't making sense in my head. The original Rainbow paper uses a 1M transition buffer, and comparatively, the data-efficient paper (Appendix E) claims to use an unbounded memory.

    Do you have any sense of what does an unbounded memory even mean in practice? Is there any particular reason you chose to make it smaller than the default Rainbow's memory buffer, rather than larger?

    Thank you!

    question 
    opened by guydav 4
  • A problem about one game in ALE cannot be trained

    A problem about one game in ALE cannot be trained

    Hi, Kai! I find an issue which happens when I set the game "defender" as the environment. It only displays hyper-parameter setting "args", and however, any training results aren't output, not as the same as other games.

    Thanks!

    bug 
    opened by Hugh-Cai 1
  • Stuck in memory._retrieve when batch size  > 32

    Stuck in memory._retrieve when batch size > 32

    Hi,

    I notice that RAINBOW doesn't work when the batch size is greater than 32 (I tried for 64, 128, 256), where it is stuck in memory._retrieve the recursive call. Why does this happen? Is there something that I can do about this (to increase the batch size) or batch size needs to be small?

    Thanks

    question 
    opened by jiwoongim 1
  • Is the evluation procedure different?

    Is the evluation procedure different?

    Hi Kai,

    In the Rainbow paper, the evaluation procedure is described as

    The average scores of the agent are evaluated during training, every 1M steps in the environment, by suspending learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames (or 30 minutes of simulated play).

    However, the code as written tests for a fixed number of episodes. Am I missing anything? Or is this the procedure from the data-efficient Rainbow paper (I couldn't find a detailed description there).

    Thanks!

    enhancement question 
    opened by guydav 8
  • Human-expert normalized scores

    Human-expert normalized scores

    The Rainbow DQN paper uses human-expert normalized scores, so I am not sure how to evaluate the training results against the original paper. Do you know what values were used for human expert scores?

    I found snippets of the values used from papers here and there, but not sure if we can use the same number and how we can compute a single normalized value for all Atari games: image

    opened by ThisIsIsaac 4
  • Pinned memory experience replay

    Pinned memory experience replay

    A more efficient implementation would allocate a giant tensor in advance for each item (e.g. state, action) in a transition tuple, furthermore pin it (as long as the machine has enough RAM spare - should be at least 6GB?), and use asynchronous copies to GPU.

    enhancement 
    opened by Kaixhin 0
Releases(1.4)
Owner
Kai Arulkumaran
Researcher, programmer, DJ, transhumanist.
Kai Arulkumaran
The Agriculture Domain of ERPNext comes with features to record crops and land

Agriculture The Agriculture Domain of ERPNext comes with features to record crops and land, track plant, soil, water, weather analytics, and even trac

Frappe 21 Jan 02, 2023
Uni-Fold: Training your own deep protein-folding models.

Uni-Fold: Training your own deep protein-folding models. This package provides and implementation of a trainable, Transformer-based deep protein foldi

DeepModeling 88 Jan 03, 2023
Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

One-Shot Voice Conversion with Weight Adaptive Instance Normalization By Shengjie Huang, Yanyan Xu*, Dengfeng Ke*, Mingjie Chen, Thomas Hain. This rep

31 Dec 07, 2022
In Search of Probeable Generalization Measures

In Search of Probeable Generalization Measures Exciting News! In Search of Probeable Generalization Measures has been accepted to the International Co

Mahdi S. Hosseini 6 Sep 11, 2022
Code for "Multi-Time Attention Networks for Irregularly Sampled Time Series", ICLR 2021.

Multi-Time Attention Networks (mTANs) This repository contains the PyTorch implementation for the paper Multi-Time Attention Networks for Irregularly

The Laboratory for Robust and Efficient Machine Learning 68 Dec 17, 2022
In this project, we develop a face recognize platform based on MTCNN object-detection netcwork and FaceNet self-supervised network.

模式识别大作业——人脸检测与识别平台 本项目是一个简易的人脸检测识别平台,提供了人脸信息录入和人脸识别的功能。前端采用 html+css+js,后端采用 pytorch,

Xuhua Huang 5 Aug 02, 2022
toroidal - a lightweight transformer library for PyTorch

toroidal - a lightweight transformer library for PyTorch Toroidal transformers are of smaller size and lower weight than the more common E-I types. Th

MathInf GmbH 64 Jan 07, 2023
JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces

JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces JAXMAPP is a JAX-based library for multi-agent path planning (MAPP) in c

OMRON SINIC X 24 Dec 28, 2022
Fast, Attemptable Route Planner for Navigation in Known and Unknown Environments

FAR Planner uses a dynamically updated visibility graph for fast replanning. The planner models the environment with polygons and builds a global visi

Fan Yang 346 Dec 30, 2022
EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

EntityQuestions This repository contains the EntityQuestions dataset as well as code to evaluate retrieval results from the the paper Simple Entity-ce

Princeton Natural Language Processing 119 Sep 28, 2022
Pytorch GUI(demo) for iVOS(interactive VOS) and GIS (Guided iVOS)

GUI for iVOS(interactive VOS) and GIS (Guided iVOS) GUI Implementation of CVPR2021 paper "Guided Interactive Video Object Segmentation Using Reliabili

Yuk Heo 13 Dec 09, 2022
Get the partition that a file belongs and the percentage of space that consumes

tinos_eisai_sy Get the partition that a file belongs and the percentage of space that consumes (works only with OSes that use the df command) tinos_ei

Konstantinos Patronas 6 Jan 24, 2022
Aspect-Sentiment-Multiple-Opinion Triplet Extraction (NLPCC 2021)

The code and data for the paper "Aspect-Sentiment-Multiple-Opinion Triplet Extraction" Requirements Python 3.6.8 torch==1.2.0 pytorch-transformers==1.

慢半拍 5 Jul 02, 2022
🎓Automatically Update CV Papers Daily using Github Actions (Update at 12:00 UTC Every Day)

🎓Automatically Update CV Papers Daily using Github Actions (Update at 12:00 UTC Every Day)

Realcat 270 Jan 07, 2023
Implementation supporting the ICCV 2017 paper "GANs for Biological Image Synthesis"

GANs for Biological Image Synthesis This codes implements the ICCV-2017 paper "GANs for Biological Image Synthesis". The paper and its supplementary m

Anton Osokin 95 Nov 25, 2022
Multi-Anchor Active Domain Adaptation for Semantic Segmentation (ICCV 2021 Oral)

Multi-Anchor Active Domain Adaptation for Semantic Segmentation Munan Ning*, Donghuan Lu*, Dong Wei†, Cheng Bian, Chenglang Yuan, Shuang Yu, Kai Ma, Y

Munan Ning 36 Dec 07, 2022
PyTorch implementation of MulMON

MulMON This repository contains a PyTorch implementation of the paper: Learning Object-Centric Representations of Multi-object Scenes from Multiple Vi

NanboLi 16 Nov 03, 2022
Official pytorch code for SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal

SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal This is the official pytorch code for SSAT: A Symmetric Semantic-

ForeverPupil 57 Dec 13, 2022
Fuzzy Overclustering (FOC)

Fuzzy Overclustering (FOC) In real-world datasets, we need consistent annotations between annotators to give a certain ground-truth label. However, in

2 Nov 08, 2022
Voice Conversion by CycleGAN (语音克隆/语音转换):CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

Kun Ma 110 Dec 24, 2022