rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

Last update: Jan 01, 2023

Overview

`rliable`

rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

Desideratum	Current evaluation approach	Our Recommendation
Uncertainty in aggregate performance	Point estimates: Ignore statistical uncertainty Hinder results reproducibility	Interval estimates using stratified bootstrap confidence intervals (CIs)
Performance variability across tasks and runs	Tables with task mean scores: Overwhelming beyond a few tasks Standard deviations frequently omitted Incomplete picture for multimodal and heavy-tailed distributions	Score distributions (performance profiles): Show tail distribution of scores on combined runs across tasks Allow qualitative comparisons Easily read any score percentile
Aggregate metrics for summarizing benchmark performance	Mean: Often dominated by performance on outlier tasks Median: Statistically inefficient (requires a large number of runs to claim improvements) Poor indicator of overall performance: 0 scores on nearly half the tasks doesn't change it	Interquartile Mean (IQM) across all runs: Performance on middle 50% of combined runs Robust to outlier scores but more statistically efficient than median To show other aspects of performance gains, report Probability of improvement and Optimality gap

Desideratum

Current evaluation approach

Our Recommendation

Uncertainty in aggregate performance

Point estimates:

Ignore statistical uncertainty
Hinder results reproducibility

Interval estimates using stratified bootstrap confidence intervals (CIs)

Performance variability across tasks and runs

Tables with task mean scores:

Overwhelming beyond a few tasks
Standard deviations frequently omitted
Incomplete picture for multimodal and heavy-tailed distributions

Score distributions (performance profiles):

Show tail distribution of scores on combined runs across tasks
Allow qualitative comparisons
Easily read any score percentile

Aggregate metrics for summarizing benchmark performance

Mean:

Often dominated by performance on outlier tasks

Median:

Statistically inefficient (requires a large number of runs to claim improvements)
Poor indicator of overall performance: 0 scores on nearly half the tasks doesn't change it

Interquartile Mean (IQM) across all runs:

Performance on middle 50% of combined runs
Robust to outlier scores but more statistically efficient than median

To show other aspects of performance gains, report Probability of improvement and Optimality gap

rliable provides support for:

Stratified Bootstrap Confidence Intervals (CIs)
Performance Profiles (with plotting functions)
Aggregate metrics
- Interquartile Mean (IQM) across all runs
- Optimality Gap
- Probability of Improvement

Interactive colab

We provide a colab at bit.ly/statistical_precipice_colab, which shows how to use the library with examples of published algorithms on widely used benchmarks including Atari 100k, ALE, DM Control and Procgen.

Paper

For more details, refer to the accompanying NeurIPS 2021 paper (Oral): Deep Reinforcement Learning at the Edge of the Statistical Precipice.

Installation

To install rliable, run:

pip install -U rliable

To install latest version of rliable as a package, run:

pip install git+https://github.com/google-research/rliable

To import rliable, we suggest:

from rliable import library as rly
from rliable import metrics
from rliable import plot_utils

Aggregate metrics with 95% Stratified Bootstrap CIs

IQM, Optimality Gap, Median, Mean

algorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow',
              'IQN', 'M-IQN', 'DreamerV2']
# Load ALE scores as a dictionary mapping algorithms to their human normalized
# score matrices, each of which is of size `(num_runs x num_games)`.
atari_200m_normalized_score_dict = ...
aggregate_func = lambda x: np.array([
  metrics.aggregate_median(x),
  metrics.aggregate_iqm(x),
  metrics.aggregate_mean(x),
  metrics.aggregate_optimality_gap(x)])
aggregate_scores, aggregate_score_cis = rly.get_interval_estimates(
  atari_200m_normalized_score_dict, aggregate_func, reps=50000)
fig, axes = plot_utils.plot_interval_estimates(
  aggregate_scores, aggregate_score_cis,
  metric_names=['Median', 'IQM', 'Mean', 'Optimality Gap'],
  algorithms=algorithms, xlabel='Human Normalized Score')

Probability of Improvement

# Load ProcGen scores as a dictionary containing pairs of normalized score
# matrices for pairs of algorithms we want to compare
procgen_algorithm_pairs = {.. , 'x,y': (score_x, score_y), ..}
average_probabilities, average_prob_cis = rly.get_interval_estimates(
  procgen_algorithm_pairs, metrics.probability_of_improvement, reps=50000)
plot_probability_of_improvement(average_probabilities, average_prob_cis)

Sample Efficiency Curve

algorithms = ['DQN (Nature)', 'DQN (Adam)', 'C51', 'REM', 'Rainbow',
              'IQN', 'M-IQN', 'DreamerV2']
# Load ALE scores as a dictionary mapping algorithms to their human normalized
# score matrices across all 200 million frames, each of which is of size
# `(num_runs x num_games x 200)` where scores are recorded every million frame.
ale_all_frames_scores_dict = ...
frames = np.array([1, 10, 25, 50, 75, 100, 125, 150, 175, 200]) - 1
ale_frames_scores_dict = {algorithm: score[:, :, frames] for algorithm, score
                          in ale_all_frames_scores_dict.items()}
iqm = lambda scores: np.array([metrics.aggregate_iqm(scores[..., frame])
                               for frame in range(scores.shape[-1])])
iqm_scores, iqm_cis = rly.get_interval_estimates(
  ale_frames_scores_dict, iqm, reps=50000)
plot_utils.plot_sample_efficiency_curve(
    frames+1, iqm_scores, iqm_cis, algorithms=algorithms,
    xlabel=r'Number of Frames (in millions)',
    ylabel='IQM Human Normalized Score')

Performance Profiles

# Load ALE scores as a dictionary mapping algorithms to their human normalized
# score matrices, each of which is of size `(num_runs x num_games)`.
atari_200m_normalized_score_dict = ...
# Human normalized score thresholds
atari_200m_thresholds = np.linspace(0.0, 8.0, 81)
score_distributions, score_distributions_cis = rly.create_performance_profile(
    atari_200m_normalized_score_dict, atari_200m_thresholds)
# Plot score distributions
fig, ax = plt.subplots(ncols=1, figsize=(7, 5))
plot_utils.plot_performance_profiles(
  score_distributions, atari_200m_thresholds,
  performance_profile_cis=score_distributions_cis,
  colors=dict(zip(algorithms, sns.color_palette('colorblind'))),
  xlabel=r'Human Normalized Score $(\tau)$',
  ax=ax)

The above profile can also be plotted with non-linear scaling as follows:

plot_utils.plot_performance_profiles(
  perf_prof_atari_200m, atari_200m_tau,
  performance_profile_cis=perf_prof_atari_200m_cis,
  use_non_linear_scaling=True,
  xticks = [0.0, 0.5, 1.0, 2.0, 4.0, 8.0]
  colors=dict(zip(algorithms, sns.color_palette('colorblind'))),
  xlabel=r'Human Normalized Score $(\tau)$',
  ax=ax)

Dependencies

The code was tested under Python>=3.7 and uses these packages:

arch >= 4.19
scipy >= 1.7.0
numpy >= 0.9.0
absl-py >= 1.16.4

Citing

If you find this open source release useful, please reference in your paper:

@article{agarwal2021deep,
  title={Deep Reinforcement Learning at the Edge of the Statistical Precipice},
  author={Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
          and Courville, Aaron and Bellemare, Marc G},
  journal={Advances in Neural Information Processing Systems},
  year={2021}
}

Disclaimer: This is not an official Google product.

Comments

RAD results may be incorrect.

Hi @agarwl. I found that the 'step' in RAD's 'eval.log' refers to the policy step. But the 'step' in 'xxx--eval_scores.npy' refers to the environment step. We know that 'environment step = policy step * action_repreat'.

Here comes a problem: if you use the results of 100k steps in 'eval.log', then you actually evaluate the scores at 100k*action_repeat steps. This will lead to the overestimation of RAD. And I wonder whether you do such incorrect evaluations, or you take the results in 'xxx--eval_scores.npy', which are correct in terms of 'steps'. You may refer to a similar question in https://github.com/MishaLaskin/rad/issues/15.

I reproduced the results of RAD locally, and I found my results are much worse than the reported ones (in your paper). I list them in the following figure.

I compare the means of each task. Obviously, there is a huge gap, and my results are close to the ones reported by DrQ authors (see the Table in https://github.com/MishaLaskin/rad/issues/1). I guess you may evaluate scores at incorrect environment steps? So, could you please offer more details when evaluating RAD? Thanks :)

opened by TaoHuang13 19

Installation fails on MacBook Pro with M1 chip

The installation fails on my MacBook Pro with M1 chip.

I also tried on a MacBook Pro with an Intel chip (and the same OS version) and on a Linux system: the installation was successful on both configurations.

$ cd rliable
$ pip install -e .
Obtaining file:///Users/quentingallouedec/rliable
  Preparing metadata (setup.py) ... done
Collecting arch==5.0.1
  Using cached arch-5.0.1.tar.gz (937 kB)
  Installing build dependencies ... error
  error: subprocess-exited-with-error

... # Log too long for GitHub issue

error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

System info

Python version: 3.9
System Version: macOS 12.4 (21F79)
Kernel Version: Darwin 21.5.0

What I've tried

Install only arch 5.0.1

It seems to be related with the installation of arch. I've tried to pip install arch==5.0.1 and it also failed with the same logs.

Install the last version of arch

I've tried to pip install arch (current version: 5.2.0), and it worked.

Use `rliable` with the last version of `arch`

Since I can install arch==5.2.0, I've tried to make rliable work with arch 5.2.0 (by modifying manually setup.py). Pytest failed. Here is the logs for one of the failing unitest:

_____________________________________________ LibraryTest.test_stratified_bootstrap_runs_and_tasks _____________________________________________

self = <library_test.LibraryTest testMethod=test_stratified_bootstrap_runs_and_tasks>, task_bootstrap = True

    @parameterized.named_parameters(
        dict(testcase_name="runs_only", task_bootstrap=False),
        dict(testcase_name="runs_and_tasks", task_bootstrap=True))
    def test_stratified_bootstrap(self, task_bootstrap):
      """Tests StratifiedBootstrap."""
      bs = rly.StratifiedBootstrap(
          self._x, y=self._y, z=self._z, task_bootstrap=task_bootstrap)
>     for data, kwdata in bs.bootstrap(5):

tests/rliable/library_test.py:40: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
env/lib/python3.9/site-packages/arch/bootstrap/base.py:694: in bootstrap
    yield self._resample()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Stratified Bootstrap(no. pos. inputs: 1, no. keyword inputs: 2, ID: 0x15b353a00)

    def _resample(self) -> Tuple[Tuple[ArrayLike, ...], Dict[str, ArrayLike]]:
        """
        Resample all data using the values in _index
        """
        indices = self._index
>       assert isinstance(indices, np.ndarray)
E       AssertionError

env/lib/python3.9/site-packages/arch/bootstrap/base.py:1294: AssertionError
_______________________________________________ LibraryTest.test_stratified_bootstrap_runs_only ________________________________________________

self = <library_test.LibraryTest testMethod=test_stratified_bootstrap_runs_only>, task_bootstrap = False

    @parameterized.named_parameters(
        dict(testcase_name="runs_only", task_bootstrap=False),
        dict(testcase_name="runs_and_tasks", task_bootstrap=True))
    def test_stratified_bootstrap(self, task_bootstrap):
      """Tests StratifiedBootstrap."""
      bs = rly.StratifiedBootstrap(
          self._x, y=self._y, z=self._z, task_bootstrap=task_bootstrap)
>     for data, kwdata in bs.bootstrap(5):

tests/rliable/library_test.py:40: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
env/lib/python3.9/site-packages/arch/bootstrap/base.py:694: in bootstrap
    yield self._resample()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Stratified Bootstrap(no. pos. inputs: 1, no. keyword inputs: 2, ID: 0x15b2ff1f0)

    def _resample(self) -> Tuple[Tuple[ArrayLike, ...], Dict[str, ArrayLike]]:
        """
        Resample all data using the values in _index
        """
        indices = self._index
>       assert isinstance(indices, np.ndarray)
E       AssertionError

env/lib/python3.9/site-packages/arch/bootstrap/base.py:1294: AssertionError

It seems like there are breaking changes between arch 5.0.1 and arch 5.2.0. Maybe this issue can be solved by updating this dependency to it's current version.

opened by qgallouedec 10

Bug in plot_utils.py

Hi,

In plot_utils.py, I think this line ought to be algorithms = list(point_estimates.keys()) https://github.com/google-research/rliable/blob/72fc16c31c4021b72e7b21f3ba915e1b38cff481/rliable/plot_utils.py#L245 Otherwise, algorithms cannot be indexed in the next line.

opened by zhefan 2
Question about documentation in probability_of_improvement

Hi, I wonder if the documentation in probability_of_improvement function in metrics.py is wrong? Specifically,

scores_x: A matrix of size (num_runs_x x num_tasks) where scores_x[m][n] represent the score on run n of task m for algorithm X. https://github.com/google-research/rliable/blob/cc5eff51cab488b34cfeb5c5e37eae7a6b4a92b2/rliable/metrics.py#L77)

Should scores_x[n][m] be the score on run n of task m for algorithm X?

Thanks.

opened by zhefan 2
Downloading data set always stuck
Thanks for sharing the repo. There is a problem that every time I download the dataset, it is always stuck somewhere at 9X% Do you know what might cause this?

... Copying gs://rl-benchmark-data/atari_100k/SimPLe.json... Copying gs://rl-benchmark-data/atari_100k/OTRainbow.json... [55/59 files][ 2.9 MiB/ 3.0 MiB] 98% Done
opened by HYDesmondLiu 2
Fix dict_keys object -> list

This fixes a downstream task where algorithms[0] in the following line fails because point_estimates.keys() returns a dict_keys object, not a subscriptable list.

opened by jjshoots 1
How can I access the data directly without using gsutil?

I haven't got gsutil set up on my M1 MacBook and I'm not sure the steps are super streamlined. Can I somehow access the data from my browser or download it another way?
documentation

opened by slerman12 1
Add installation of compatible arch version to notebook

Latest arch version raises an exception when calling create_performance_profile. Adding !pip install arch==5.0.1 to the notebook file resolves the issue. This change should be reflected in the hosted colab notebook.

opened by Aladoro 1
Customisable linestyles in performance profile plots

The primary reason for this PR is an added option for customising linestyles in performance profile plots. It works in exactly the same way as the colors parameter it already had; a map, None by default which means all methods are plotted as solid lines, but a map can be passed in to change the linestyles of every method's plot.

Here you can see, as an example, a plot I'm currently working on where I'm using this functionality to have some methods plotted as dotted lines instead of solid ones:

Additionally, I have added a .gitignore file to ignore some files that were automatically created when I installed rliable with pip from local source code in my own fork of the repo, and files created by working with rliable source code in the PyCharm IDE.

opened by DennisSoemers 1
README image link broken: ale_score_distributions_new.png

It seems that the file images/ale_score_distributions_new.png pointed to in the README (https://github.com/google-research/rliable#performance-profiles) was deleted in one of the recent commits.

opened by nirbhayjm 1
Urgent question about data aggregates

Hi, we compiled the Atari 100k results from DrQ, CURL, and DER, and the mean/median human-norm scores are well below those reported in prior works, including from co-authors of the rliable paper.

We have median human-norm scores all around 0.10 - 0.12.

Is this accurate? Of all of these, DER (the oldest of the algs) has the highest mean human-norm score.

opened by slerman12 1

Releases(v1.0)

Owner

Google Research

GitHub Repository https://agarwl.github.io/rliable

PyTorch implementation of the Quasi-Recurrent Neural Network - up to 16 times faster than NVIDIA's cuDNN LSTM

Quasi-Recurrent Neural Network (QRNN) for PyTorch Updated to support multi-GPU environments via DataParallel - see the the multigpu_dataparallel.py ex

1.3k Dec 28, 2022

Structural Constraints on Information Content in Human Brain States

Structural Constraints on Information Content in Human Brain States Code accompanying the paper "The information content of brain states is explained

3 Sep 07, 2022

Boosted CVaR Classification (NeurIPS 2021)

Boosted CVaR Classification Runtian Zhai, Chen Dan, Arun Sai Suggala, Zico Kolter, Pradeep Ravikumar NeurIPS 2021 Table of Contents Quick Start Train

4 Feb 15, 2022

⚡ H2G-Net for Semantic Segmentation of Histopathological Images

H2G-Net This repository contains the code relevant for the proposed design H2G-Net, which was introduced in the manuscript "Hybrid guiding: A multi-re

8 Nov 24, 2022

A treasure chest for visual recognition powered by PaddlePaddle

简体中文 | English PaddleClas 简介飞桨图像识别套件PaddleClas是飞桨为工业界和学术界所准备的一个图像识别任务的工具集，助力使用者训练出更好的视觉模型和应用落地。近期更新 2021.11.1 发布PP-ShiTu技术报告，新增饮料识别demo 2021.10.23 发

4.6k Dec 31, 2022

Code for Temporally Abstract Partial Models

Code for Temporally Abstract Partial Models Accompanies the code for the experimental section of the paper: Temporally Abstract Partial Models, Khetar

19 Jul 13, 2022

A Simple Key-Value Data-store written in Python

mercury-db This is a File Based Key-Value Datastore that supports basic CRUD (Create, Read, Update, Delete) operations developed using Python. The dat

1 Jan 09, 2022

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

The dataset contains 3 million attribute-value annotations across 1257 unique categories on 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attr

89 Jan 08, 2023

Revisiting Global Statistics Aggregation for Improving Image Restoration

Revisiting Global Statistics Aggregation for Improving Image Restoration Xiaojie Chu, Liangyu Chen, Chengpeng Chen, Xin Lu Paper: https://arxiv.org/pd

128 Dec 24, 2022

一个运行在 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 或 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 等定时面板的签到项目

定时面板上的签到盒一个运行在 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 或 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 等定时面板的签到项目 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 特别声明本仓库发布的脚本及其中涉及的任何解锁和解密分析脚本，仅用于测试和学习研究，禁止用于商业用途，不能保证其合

1.1k Dec 30, 2022

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Progressive Growing of GANs for Improved Quality, Stability, and Variation — Official TensorFlow implementation of the ICLR 2018 paper Tero Karras (NV

5.9k Jan 05, 2023

Codebase for the solution that won first place and was awarded the most human-like agent in the 2021 NeurIPS Competition MineRL BASALT Challenge.

KAIROS MineRL BASALT Codebase for the solution that won first place and was awarded the most human-like agent in the 2021 NeurIPS Competition MineRL B

37 Oct 30, 2022

Optimizes image files by converting them to webp while also updating all references.

About Optimizes images by (re-)saving them as webp. For every file it replaced it automatically updates all references. Works on single files as well

18 Dec 23, 2022

DIVeR: Deterministic Integration for Volume Rendering

DIVeR: Deterministic Integration for Volume Rendering This repo contains the training and evaluation code for DIVeR. Setup python 3.8 pytorch 1.9.0 py

64 Dec 27, 2022

Repo for flood prediction using LSTMs and HAND

Abstract Every year, floods cause billions of dollars’ worth of damages to life, crops, and property. With a proper early flood warning system in plac

1 Oct 27, 2021

As a part of the HAKE project, includes the reproduced SOTA models and the corresponding HAKE-enhanced versions (CVPR2020).

HAKE-Action HAKE-Action (TensorFlow) is a project to open the SOTA action understanding studies based on our Human Activity Knowledge Engine. It inclu

94 Nov 18, 2022