Easily benchmark PyTorch model FLOPs, latency, throughput, max allocated memory and energy consumption

Overview

pytorch-benchmark

Easily benchmark model inference FLOPs, latency, throughput, max allocated memory and energy consumption

Install

pip install pytorch-benchmark

Usage

import torch
from torchvision.models import efficientnet_b0
from pytorch_benchmark import benchmark


model = efficientnet_b0()
sample = torch.randn(8, 3, 224, 224)  # (B, C, H, W)
results = benchmark(model, sample, num_runs=100)

Sample results 💻

Macbook Pro (16-inch, 2019), 2.6 GHz 6-Core Intel Core i7
device: cpu
flops: 401669732
machine_info:
  cpu:
    architecture: x86_64
    cores:
      physical: 6
      total: 12
    frequency: 2.60 GHz
    model: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  gpus: null
  memory:
    available: 5.86 GB
    total: 16.00 GB
    used: 7.29 GB
  system:
    node: d40049
    release: 21.2.0
    system: Darwin
params: 5288548
timing:
  batch_size_1:
    on_device_inference:
      human_readable:
        batch_latency: 74.439 ms +/- 6.459 ms [64.604 ms, 96.681 ms]
        batches_per_second: 13.53 +/- 1.09 [10.34, 15.48]
      metrics:
        batches_per_second_max: 15.478907181264278
        batches_per_second_mean: 13.528026359855625
        batches_per_second_min: 10.343281300091244
        batches_per_second_std: 1.0922382209314958
        seconds_per_batch_max: 0.09668111801147461
        seconds_per_batch_mean: 0.07443853378295899
        seconds_per_batch_min: 0.06460404396057129
        seconds_per_batch_std: 0.006458734193132054
  batch_size_8:
    on_device_inference:
      human_readable:
        batch_latency: 509.410 ms +/- 30.031 ms [405.296 ms, 621.773 ms]
        batches_per_second: 1.97 +/- 0.11 [1.61, 2.47]
      metrics:
        batches_per_second_max: 2.4673319862230025
        batches_per_second_mean: 1.9696935126370148
        batches_per_second_min: 1.6083039834656554
        batches_per_second_std: 0.11341204895590185
        seconds_per_batch_max: 0.6217730045318604
        seconds_per_batch_mean: 0.509410228729248
        seconds_per_batch_min: 0.40529608726501465
        seconds_per_batch_std: 0.030031445467788704
Server with NVIDIA GeForce RTX 2080 and Intel Xeon 2.10GHz CPU
device: cuda
flops: 401669732
machine_info:
  cpu:
    architecture: x86_64
    cores:
      physical: 16
      total: 32
    frequency: 3.00 GHz
    model: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  gpus:
  - memory: 8192.0 MB
    name: NVIDIA GeForce RTX 2080
  - memory: 8192.0 MB
    name: NVIDIA GeForce RTX 2080
  - memory: 8192.0 MB
    name: NVIDIA GeForce RTX 2080
  - memory: 8192.0 MB
    name: NVIDIA GeForce RTX 2080
  memory:
    available: 119.98 GB
    total: 125.78 GB
    used: 4.78 GB
  system:
    node: monster
    release: 4.15.0-167-generic
    system: Linux
max_inference_memory: 736250368
params: 5288548
post_inference_memory: 21402112
pre_inference_memory: 21402112
timing:
  batch_size_1:
    cpu_to_gpu:
      human_readable:
        batch_latency: "144.815 \xB5s +/- 16.103 \xB5s [136.614 \xB5s, 272.751 \xB5\
          s]"
        batches_per_second: 6.96 K +/- 535.06 [3.67 K, 7.32 K]
      metrics:
        batches_per_second_max: 7319.902268760908
        batches_per_second_mean: 6962.865857677197
        batches_per_second_min: 3666.3496503496503
        batches_per_second_std: 535.0581873859935
        seconds_per_batch_max: 0.0002727508544921875
        seconds_per_batch_mean: 0.00014481544494628906
        seconds_per_batch_min: 0.0001366138458251953
        seconds_per_batch_std: 1.6102982159292097e-05
    gpu_to_cpu:
      human_readable:
        batch_latency: "106.168 \xB5s +/- 17.829 \xB5s [53.167 \xB5s, 248.909 \xB5\
          s]"
        batches_per_second: 9.64 K +/- 1.60 K [4.02 K, 18.81 K]
      metrics:
        batches_per_second_max: 18808.538116591928
        batches_per_second_mean: 9639.942102368092
        batches_per_second_min: 4017.532567049808
        batches_per_second_std: 1595.7983033708472
        seconds_per_batch_max: 0.00024890899658203125
        seconds_per_batch_mean: 0.00010616779327392578
        seconds_per_batch_min: 5.316734313964844e-05
        seconds_per_batch_std: 1.7829135190772566e-05
    on_device_inference:
      human_readable:
        batch_latency: "15.567 ms +/- 546.154 \xB5s [15.311 ms, 19.261 ms]"
        batches_per_second: 64.31 +/- 1.96 [51.92, 65.31]
      metrics:
        batches_per_second_max: 65.31149174711928
        batches_per_second_mean: 64.30692850265713
        batches_per_second_min: 51.918698784442846
        batches_per_second_std: 1.9599322351815833
        seconds_per_batch_max: 0.019260883331298828
        seconds_per_batch_mean: 0.015567030906677246
        seconds_per_batch_min: 0.015311241149902344
        seconds_per_batch_std: 0.0005461537255227954
    total:
      human_readable:
        batch_latency: "15.818 ms +/- 549.873 \xB5s [15.561 ms, 19.461 ms]"
        batches_per_second: 63.29 +/- 1.92 [51.38, 64.26]
      metrics:
        batches_per_second_max: 64.26476266356143
        batches_per_second_mean: 63.28565696640637
        batches_per_second_min: 51.38378232692614
        batches_per_second_std: 1.9198343850767468
        seconds_per_batch_max: 0.019461393356323242
        seconds_per_batch_mean: 0.01581801414489746
        seconds_per_batch_min: 0.015560626983642578
        seconds_per_batch_std: 0.0005498731526138171
  batch_size_8:
    cpu_to_gpu:
      human_readable:
        batch_latency: "805.674 \xB5s +/- 157.254 \xB5s [773.191 \xB5s, 2.303 ms]"
        batches_per_second: 1.26 K +/- 97.51 [434.24, 1.29 K]
      metrics:
        batches_per_second_max: 1293.3407338883749
        batches_per_second_mean: 1259.5653105357776
        batches_per_second_min: 434.23791282741485
        batches_per_second_std: 97.51424036939879
        seconds_per_batch_max: 0.002302885055541992
        seconds_per_batch_mean: 0.000805673599243164
        seconds_per_batch_min: 0.0007731914520263672
        seconds_per_batch_std: 0.0001572538140613121
    gpu_to_cpu:
      human_readable:
        batch_latency: "104.215 \xB5s +/- 12.658 \xB5s [59.605 \xB5s, 128.031 \xB5\
          s]"
        batches_per_second: 9.81 K +/- 1.76 K [7.81 K, 16.78 K]
      metrics:
        batches_per_second_max: 16777.216
        batches_per_second_mean: 9806.840626578907
        batches_per_second_min: 7810.621973929236
        batches_per_second_std: 1761.6008872740726
        seconds_per_batch_max: 0.00012803077697753906
        seconds_per_batch_mean: 0.00010421514511108399
        seconds_per_batch_min: 5.9604644775390625e-05
        seconds_per_batch_std: 1.2658293070174213e-05
    on_device_inference:
      human_readable:
        batch_latency: "16.623 ms +/- 759.017 \xB5s [16.301 ms, 22.584 ms]"
        batches_per_second: 60.26 +/- 2.22 [44.28, 61.35]
      metrics:
        batches_per_second_max: 61.346243290283894
        batches_per_second_mean: 60.25881046175457
        batches_per_second_min: 44.27827629162004
        batches_per_second_std: 2.2193085956672296
        seconds_per_batch_max: 0.02258443832397461
        seconds_per_batch_mean: 0.01662288188934326
        seconds_per_batch_min: 0.01630091667175293
        seconds_per_batch_std: 0.0007590167680596548
    total:
      human_readable:
        batch_latency: "17.533 ms +/- 836.015 \xB5s [17.193 ms, 23.896 ms]"
        batches_per_second: 57.14 +/- 2.20 [41.85, 58.16]
      metrics:
        batches_per_second_max: 58.16374528511205
        batches_per_second_mean: 57.140338855126565
        batches_per_second_min: 41.84762740950632
        batches_per_second_std: 2.1985066663972677
        seconds_per_batch_max: 0.023896217346191406
        seconds_per_batch_mean: 0.01753277063369751
        seconds_per_batch_min: 0.017192840576171875
        seconds_per_batch_std: 0.0008360147274630088

Limitations

Usage assumptions:

  • The model has as a __call__ method that takes the sample, i.e. model(sample).
  • The Model also works if the sample had a batch size of 1 (first dimension).

Feature limitations:

  • Allocated memory uses torch.cuda.max_memory_allocated, which is only available if the model resides on a CUDA device.
  • Energy consumption can only be measured on NVIDIA Jetson platforms at the moment.

Citation

If you like the tool and use it in you research, please consider citing it:

@article{hedegaard2022torchbenchmark,
  title={PyTorch Benchmark},
  author={Lukas Hedegaard},
  journal={GitHub. Note: https://github.com/LukasHedegaard/pytorch-benchmark},
  year={2022}
}
You might also like...
SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification
SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

Demo for the paper
Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Streaming speaker diarization Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation by Juan Manuel Coria, Hervé

Predict the latency time of the deep learning models

Deep Neural Network Prediction Step 1. Genernate random parameters and Run them sequentially : $ python3 collect_data.py -gp -ep -pp -pl pooling -num

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

This is the official repository for evaluation on the NoW Benchmark Dataset. The goal of the NoW benchmark is to introduce a standard evaluation metric to measure the accuracy and robustness of 3D face reconstruction methods from a single image under variations in viewing angle, lighting, and common occlusions.
PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

Code for On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models This repository will reproduce the main results from our pape

PyTorch code accompanying our paper on Maximum Entropy Generators for Energy-Based Models

Maximum Entropy Generators for Energy-Based Models All experiments have tensorboard visualizations for samples / density / train curves etc. To run th

In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.
In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Contrastive Learning of Object Representations Supervisor: Prof. Dr. Gemma Roig Institutions: Goethe University CVAI - Computational Vision & Artifici

Comments
  • torch cuda synchronize on GPUs?

    torch cuda synchronize on GPUs?

    Hello,

    Very happy to see your repo.

    I have tested the code and found that for the GPU tests, there may lack of torch synchronize when computing the device time. I am not sure how this may impact the results but I think it would make difference.

    What do you think?

    Best,

    opened by jizongFox 1
Releases(0.3.5)
Owner
Lukas Hedegaard
PhD Student | AI Researcher | Open Source Contributor
Lukas Hedegaard
YOLOV4运行在嵌入式设备上

在嵌入式设备上实现YOLO V4 tiny 在嵌入式设备上实现YOLO V4 tiny 目录结构 目录结构 |-- YOLO V4 tiny |-- .gitignore |-- LICENSE |-- README.md |-- test.txt |-- t

Liu-Wei 6 Sep 09, 2021
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It ca

OSU DKI Lab 76 Dec 21, 2022
Self-supervised learning on Graph Representation Learning (node-level task)

graph_SSL Self-supervised learning on Graph Representation Learning (node-level task) How to run the code To run GRACE, sh run_GRACE.sh To run GCA, sh

Namkyeong Lee 3 Dec 31, 2021
Easy-to-use library to boost AI inference leveraging state-of-the-art optimization techniques.

NEW RELEASE How Nebullvm Works • Tutorials • Benchmarks • Installation • Get Started • Optimization Examples Discord | Website | LinkedIn | Twitter Ne

Nebuly 1.7k Dec 31, 2022
Code for the CVPR 2021 paper: Understanding Failures of Deep Networks via Robust Feature Extraction

Welcome to Barlow Barlow is a tool for identifying the failure modes for a given neural network. To achieve this, Barlow first creates a group of imag

Sahil Singla 33 Dec 05, 2022
Milano is a tool for automating hyper-parameters search for your models on a backend of your choice.

Milano (This is a research project, not an official NVIDIA product.) Documentation https://nvidia.github.io/Milano Milano (Machine learning autotuner

NVIDIA Corporation 147 Dec 17, 2022
Generating Images with Recurrent Adversarial Networks

Generating Images with Recurrent Adversarial Networks Python (Theano) implementation of Generating Images with Recurrent Adversarial Networks code pro

Daniel Jiwoong Im 121 Sep 08, 2022
Constrained Logistic Regression - How to apply specific constraints to logistic regression's coefficients

Constrained Logistic Regression Sample implementation of constructing a logistic regression with given ranges on each of the feature's coefficients (v

1 Dec 29, 2021
We are More than Our JOints: Predicting How 3D Bodies Move

We are More than Our JOints: Predicting How 3D Bodies Move Citation This repo contains the official implementation of our paper MOJO: @inproceedings{Z

72 Oct 20, 2022
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021 The code for training mCOLT/mRASP2, a multilingua

104 Jan 01, 2023
[NeurIPS 2021] Garment4D: Garment Reconstruction from Point Cloud Sequences

Garment4D [PDF] | [OpenReview] | [Project Page] Overview This is the codebase for our NeurIPS 2021 paper Garment4D: Garment Reconstruction from Point

Fangzhou Hong 112 Dec 23, 2022
Python/Rust implementations and notes from Proofs Arguments and Zero Knowledge

What is this? This is where I'll be collecting resources related to the Study Group on Dr. Justin Thaler's Proofs Arguments And Zero Knowledge Book. T

Thor 66 Jan 04, 2023
This is the workbook I created while I was studying for the Qiskit Associate Developer exam. I hope this becomes useful to others as it was for me :)

A Workbook for the Qiskit Developer Certification Exam Hello everyone! This is Bartu, a fellow Qiskitter. I have recently taken the Certification exam

Bartu Bisgin 66 Dec 10, 2022
Motion planning algorithms commonly used on autonomous vehicles. (path planning + path tracking)

Overview This repository implemented some common motion planners used on autonomous vehicles, including Hybrid A* Planner Frenet Optimal Trajectory Hi

Huiming Zhou 1k Jan 09, 2023
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

Mutian He 19 Oct 14, 2022
🥈78th place in Riiid Answer Correctness Prediction competition

Riiid Answer Correctness Prediction Introduction This repository is the code that placed 78th in Riiid Answer Correctness Prediction competition. Requ

Jungwoo Park 10 Jul 14, 2022
Official PyTorch implementation of StyleGAN3

Modified StyleGAN3 Repo Changes Made tied to python 3.7 syntax .jpgs instead of .pngs for training sample seeds to recreate the 1024 training grid wit

Derrick Schultz (he/him) 83 Dec 15, 2022
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

Benedek Rozemberczki 201 Nov 09, 2022
Pytorch implementation of VAEs for heterogeneous likelihoods.

Heterogeneous VAEs Beware: This repository is under construction 🛠️ Pytorch implementation of different VAE models to model heterogeneous data. Here,

Adrián Javaloy 35 Nov 29, 2022
Robotics environments

Robotics environments Details and documentation on these robotics environments are available in OpenAI's blog post and the accompanying technical repo

Farama Foundation 121 Dec 28, 2022