ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Overview

What is ProteinBERT?

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

ProteinBERT's deep-learning architecture is inspired by BERT, but it contains several innovations such as its global-attention layers that grow only lineraly with sequence length (compared to self-attention's quadratic growth). As a result, the model can process protein sequences of almost any length, includng extremely long protein sequences (of over tens of thousands of amino acids).

The model takes protein sequences as inputs, and can also take protein GO annotations as additional inputs (to help the model infer about the function of the input protein and update its internal representations and outputs accordingly). This package provides seamless access to a pretrained state that has been produced by training the model for 28 days over ~670M records (i.e. ~6.4 iterations over the entire training dataset of ~106M records). For users interested in pretraining the model from scratch, the package also includes scripts for that.

Installation

Dependencies

ProteinBERT requires Python 3.

Below are the Python packages required by ProteinBERT, which are automatically installed with it (and the versions of these packages that were tested with ProteinBERT 1.0.0):

  • tensorflow (2.4.0)
  • tensorflow_addons (0.12.1)
  • numpy (1.20.1)
  • pandas (1.2.3)
  • h5py (3.2.1)
  • lxml (4.3.2)
  • pyfaidx (0.5.8)

Install ProteinBERT

Just run:

pip install protein-bert

Alternatively, clone this repository and run:

python setup.py install

Using ProteinBERT

Fine-tuning ProteinBERT is very easy. You can see some working examples in this notebook.

Pretraining ProteinBERT from scratch

If, instead of using the existing pretrained model weights, you would like to train it from scratch, then follow the steps below. We warn you however that this is a long process (we pretrained the current model for a whole month), and it also requires a lot of storage (>1TB).

Step 1: Create the UniRef dataset

ProteinBERT is pretrained on a dataset derived from UniRef90. Follow these steps to produce this dataset:

  1. First, choose a working directory with sufficient (>1TB) free storage.
cd /some/workdir
  1. Download the metadata of GO from CAFA and extract it.
wget https://www.biofunctionprediction.org/cafa-targets/cafa4ontologies.zip
mkdir cafa4ontologies
unzip cafa4ontologies.zip -d cafa4ontologies/
  1. Download UniRef90, as both XML and FASTA.
wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.xml.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
gunzip uniref90.fasta.gz
  1. Use the create_uniref_db script provided by ProteinBERT to extract the GO annotations associated with UniRef's records into an SQLite database (and a CSV file with the metadata of these GO annotations). Since this is a long process (which can take up to a few days), it is recommended to run this in the background (e.g. using nohup).
nohup create_uniref_db --uniref-xml-gz-file=./uniref90.xml.gz --go-annotations-meta-file=./cafa4ontologies/go.txt --output-sqlite-file=./uniref_proteins_and_annotations.db --output-go-annotations-meta-csv-file=./go_annotations.csv >&! ./log_create_uniref_db.txt &
  1. Create the final dataset (in the H5 format) by merging the database of GO annotations with the protein sequences using the create_uniref_h5_dataset script provided by ProteinBERT. This is also a long process that should be let to run in the background.
nohup create_uniref_h5_dataset --protein-annotations-sqlite-db-file=./uniref_proteins_and_annotations.db --protein-fasta-file=./uniref90.fasta --go-annotations-meta-csv-file=./go_annotations.csv --output-h5-dataset-file=./dataset.h5 --min-records-to-keep-annotation=100 >&! ./log_create_uniref_h5_dataset.txt &
  1. Finally, use ProteinBERT's set_h5_testset script to designate which of the dataset records will be considered part of the test set (so that their GO annotations are not used during pretraining). If you are planning to evaluate your model on certain downstream benchmarks, it is recommended that any UniRef record similar to a test-set protein in these benchmark will be considered part of the pretraining's test set. You can use BLAST to find all of these UniRef records and provide them to set_h5_testset through the flag --uniprot-ids-file=./uniref_90_seqs_matching_test_set_seqs.txt, where the provided text file contains the UniProt IDs of the relevant records, one per line (e.g. A0A009EXK6_ACIBA).
set_h5_testset --h5-dataset-file=./dataset.h5

Step 2: Pretrain ProteinBERT on the UniRef dataset

Once you have the dataset ready, the pretrain_proteinbert script will train a ProteinBERT model on that dataset.

Basic use of the pretraining script looks as follows:

mkdir -p ~/proteinbert_models/new
nohup pretrain_proteinbert --dataset-file=./dataset.h5 --autosave-dir=~/proteinbert_models/new >&! ~/proteinbert_models/log_new_pretraining.txt &

By running that, ProteinBERT will continue to train indefinitely. Therefore, make sure to run it in the background using nohup or other options. Every given number of epochs (determined as 100 batches) the model state will be automatically saved into the specified autosave directory. If this process is interrupted and you wish to resume pretraining from a given snapshot (e.g. the most up-to-date state file within the autosave dir) use the --resume-from flag (provide it the state file that you wish to resume from).

pretrain_proteinbert has MANY options and hyper-parameters that are worth checking out:

pretrain_proteinbert --help

Step 3: Use your pretrained model state when fine-tuning ProteinBERT

Normally the function load_pretrained_model is used to load the existing pretrained model state. If you wish to load your own pretrained model state instead, then use the load_pretrained_model_from_dump function instead.

License

ProteinBERT is a free open-source project available under the MIT License.

Cite us

If you use ProteinBERT as part of a work contributing to a scientific publication, we ask that you cite our paper: Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: A universal deep-learning model of protein sequence and function. bioRxiv (2021). https://doi.org/10.1101/2021.05.24.445464

Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

Dibya Chakravorty 570 Dec 31, 2022
Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022
Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

Ceaser-Cipher The Caesar Cipher technique is one of the earliest and simplest me

Lateefah Ajadi 2 May 12, 2022
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
Journalism AI – Quotes extraction for modular journalism

Quote extraction for modular journalism (JournalismAI collab 2021)

Journalism AI collab 2021 207 Dec 25, 2022
this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 02, 2021
189 Jan 02, 2023
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
Write Python in Urdu - اردو میں کوڈ لکھیں

UrduPython Write simple Python in Urdu. How to Use Write Urdu code in سامپل۔پے The mappings are as following: "۔": ".", "،":

Saad A. Bazaz 26 Nov 27, 2022
Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

Transformers-for-NLP-2nd-Edition @copyright 2022, Packt Publishing, Denis Rothman Contact me for any question you have on LinkedIn Get the book on Ama

Denis Rothman 150 Dec 23, 2022
SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

Tower 1 Nov 20, 2021
Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Twitch Revenues Bu script'i kullanarak istediğiniz yayıncıların, Twitch'den sızdırılan 125 GB'lik veriye dayanarak, 2019-2021 arası aylık gelirlerini

4 Nov 11, 2021
Score-Based Point Cloud Denoising (ICCV'21)

Score-Based Point Cloud Denoising (ICCV'21) [Paper] https://arxiv.org/abs/2107.10981 Installation Recommended Environment The code has been tested in

Shitong Luo 79 Dec 26, 2022
[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Learning Signal-Agnostic Manifolds of Neural Fields This is the uncleaned code for the paper Learning Signal-Agnostic Manifolds of Neural Fields. The

60 Dec 12, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022