Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

Last update: Dec 09, 2022

Related tags

Deep Learning fast-Bart

Overview

fast-Bart

Reduction of BART model size by 3X, and boost in inference speed up to 3X

BART implementation of the fastT5 library (https://github.com/Ki6an/fastT5)

Pytorch model -> ONNX model -> Quantized ONNX model

Install

Install using requirements.txt file

git clone https://github.com/siddharth-sharma7/fast-Bart
cd fast-Bart
pip install -r requirements.txt

Usage

The export_and_get_onnx_model() method exports the given pretrained Bart model to onnx, quantizes it and runs it on the onnxruntime with default settings. The returned model from this method supports the generate() method of huggingface.

If you don't wish to quantize the model then use quantized=False in the method.

from fastBart import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 'facebook/bart-base'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input = "This is a very long sentence and needs to be summarized."
token = tokenizer(input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=3)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

to run the already exported model use get_onnx_model()

you can customize the whole pipeline as shown in the below code example:

from fastBart import (OnnxBart, get_onnx_runtime_sessions,
                    generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 'facebook/bart-base'

# Step 1. convert huggingfaces bart model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
# The process is slow for the decoder and init-decoder onnx files (can take up to 15 mins)
quant_model_paths = quantize(onnx_model_paths)

# step 3. setup onnx runtime
model_sessions = get_onnx_runtime_sessions(quant_model_paths)

# step 4. get the onnx model
model = OnnxBart(model_or_model_path, model_sessions)

                      ...

custom output paths

By default, fastBart creates a models-bart folder in the current directory and stores all the models. You can provide a custom path for a folder to store the exported models. And to run already exported models that are stored in a custom folder path: use get_onnx_model(onnx_models_path="/path/to/custom/folder/")

from fastBart import export_and_get_onnx_model, get_onnx_model

model_name = "facebook/bart-base"
custom_output_path = "/path/to/custom/folder/"

# 1. stores models to custom_output_path
model = export_and_get_onnx_model(model_name, custom_output_path)

# 2. run already exported models that are stored in custom path
# model = get_onnx_model(model_name, custom_output_path)

Functionalities

Export any pretrained Bart model to ONNX easily.
The exported model supports beam search and greedy search and more via generate() method.
Reduce the model size by 3X using quantization.
Up to 3X speedup compared to PyTorch execution for greedy search and 2-3X for beam search.

Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

Related tags

Overview

fast-Bart

Reduction of BART model size by 3X, and boost in inference speed up to 3X

Install

Usage

custom output paths

Functionalities

Owner

Siddharth Sharma

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Prototype for Baby Action Detection and Classification

Code for the paper Progressive Pose Attention for Person Image Generation in CVPR19 (Oral).

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

PyTorch implementation of UPFlow (unsupervised optical flow learning)

NeuroLKH: Combining Deep Learning Model with Lin-Kernighan-Helsgaun Heuristic for Solving the Traveling Salesman Problem

Code for "Optimizing risk-based breast cancer screening policies with reinforcement learning"

Codebase for ECCV18 "The Sound of Pixels"

Data and codes for ACL 2021 paper: Towards Emotional Support Dialog Systems

Adversarial-Information-Bottleneck - Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck (NeurIPS21)

This repository contains all data used for writing a research paper Multiple Object Trackers in OpenCV: A Benchmark, presented in ISIE 2021 conference in Kyoto, Japan.

PyTorch3D is FAIR's library of reusable components for deep learning with 3D data

Deep Learning tutorials in jupyter notebooks.

Unofficial PyTorch implementation of Neural Additive Models (NAM) by Agarwal, et al.

2021-MICCAI-Progressively Normalized Self-Attention Network for Video Polyp Segmentation

Scalable, event-driven, deep-learning-friendly backtesting library

Deep Convolutional Generative Adversarial Networks

Leibniz is a python package which provide facilities to express learnable partial differential equations with PyTorch

LieTransformer: Equivariant Self-Attention for Lie Groups

Implementation for Panoptic-PolarNet (CVPR 2021)