ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Overview

What is ProteinBERT?

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

ProteinBERT's deep-learning architecture is inspired by BERT, but it contains several innovations such as its global-attention layers that grow only lineraly with sequence length (compared to self-attention's quadratic growth). As a result, the model can process protein sequences of almost any length, includng extremely long protein sequences (of over tens of thousands of amino acids).

The model takes protein sequences as inputs, and can also take protein GO annotations as additional inputs (to help the model infer about the function of the input protein and update its internal representations and outputs accordingly). This package provides seamless access to a pretrained state that has been produced by training the model for 28 days over ~670M records (i.e. ~6.4 iterations over the entire training dataset of ~106M records). For users interested in pretraining the model from scratch, the package also includes scripts for that.

Installation

Dependencies

ProteinBERT requires Python 3.

Below are the Python packages required by ProteinBERT, which are automatically installed with it (and the versions of these packages that were tested with ProteinBERT 1.0.0):

  • tensorflow (2.4.0)
  • tensorflow_addons (0.12.1)
  • numpy (1.20.1)
  • pandas (1.2.3)
  • h5py (3.2.1)
  • lxml (4.3.2)
  • pyfaidx (0.5.8)

Install ProteinBERT

Just run:

pip install protein-bert

Alternatively, clone this repository and run:

python setup.py install

Using ProteinBERT

Fine-tuning ProteinBERT is very easy. You can see some working examples in this notebook.

Pretraining ProteinBERT from scratch

If, instead of using the existing pretrained model weights, you would like to train it from scratch, then follow the steps below. We warn you however that this is a long process (we pretrained the current model for a whole month), and it also requires a lot of storage (>1TB).

Step 1: Create the UniRef dataset

ProteinBERT is pretrained on a dataset derived from UniRef90. Follow these steps to produce this dataset:

  1. First, choose a working directory with sufficient (>1TB) free storage.
cd /some/workdir
  1. Download the metadata of GO from CAFA and extract it.
wget https://www.biofunctionprediction.org/cafa-targets/cafa4ontologies.zip
mkdir cafa4ontologies
unzip cafa4ontologies.zip -d cafa4ontologies/
  1. Download UniRef90, as both XML and FASTA.
wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.xml.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
gunzip uniref90.fasta.gz
  1. Use the create_uniref_db script provided by ProteinBERT to extract the GO annotations associated with UniRef's records into an SQLite database (and a CSV file with the metadata of these GO annotations). Since this is a long process (which can take up to a few days), it is recommended to run this in the background (e.g. using nohup).
nohup create_uniref_db --uniref-xml-gz-file=./uniref90.xml.gz --go-annotations-meta-file=./cafa4ontologies/go.txt --output-sqlite-file=./uniref_proteins_and_annotations.db --output-go-annotations-meta-csv-file=./go_annotations.csv >&! ./log_create_uniref_db.txt &
  1. Create the final dataset (in the H5 format) by merging the database of GO annotations with the protein sequences using the create_uniref_h5_dataset script provided by ProteinBERT. This is also a long process that should be let to run in the background.
nohup create_uniref_h5_dataset --protein-annotations-sqlite-db-file=./uniref_proteins_and_annotations.db --protein-fasta-file=./uniref90.fasta --go-annotations-meta-csv-file=./go_annotations.csv --output-h5-dataset-file=./dataset.h5 --min-records-to-keep-annotation=100 >&! ./log_create_uniref_h5_dataset.txt &
  1. Finally, use ProteinBERT's set_h5_testset script to designate which of the dataset records will be considered part of the test set (so that their GO annotations are not used during pretraining). If you are planning to evaluate your model on certain downstream benchmarks, it is recommended that any UniRef record similar to a test-set protein in these benchmark will be considered part of the pretraining's test set. You can use BLAST to find all of these UniRef records and provide them to set_h5_testset through the flag --uniprot-ids-file=./uniref_90_seqs_matching_test_set_seqs.txt, where the provided text file contains the UniProt IDs of the relevant records, one per line (e.g. A0A009EXK6_ACIBA).
set_h5_testset --h5-dataset-file=./dataset.h5

Step 2: Pretrain ProteinBERT on the UniRef dataset

Once you have the dataset ready, the pretrain_proteinbert script will train a ProteinBERT model on that dataset.

Basic use of the pretraining script looks as follows:

mkdir -p ~/proteinbert_models/new
nohup pretrain_proteinbert --dataset-file=./dataset.h5 --autosave-dir=~/proteinbert_models/new >&! ~/proteinbert_models/log_new_pretraining.txt &

By running that, ProteinBERT will continue to train indefinitely. Therefore, make sure to run it in the background using nohup or other options. Every given number of epochs (determined as 100 batches) the model state will be automatically saved into the specified autosave directory. If this process is interrupted and you wish to resume pretraining from a given snapshot (e.g. the most up-to-date state file within the autosave dir) use the --resume-from flag (provide it the state file that you wish to resume from).

pretrain_proteinbert has MANY options and hyper-parameters that are worth checking out:

pretrain_proteinbert --help

Step 3: Use your pretrained model state when fine-tuning ProteinBERT

Normally the function load_pretrained_model is used to load the existing pretrained model state. If you wish to load your own pretrained model state instead, then use the load_pretrained_model_from_dump function instead.

License

ProteinBERT is a free open-source project available under the MIT License.

Cite us

If you use ProteinBERT as part of a work contributing to a scientific publication, we ask that you cite our paper: Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: A universal deep-learning model of protein sequence and function. bioRxiv (2021). https://doi.org/10.1101/2021.05.24.445464

Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022
LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context This Repository contains the code on AVA of our ACM MM 2021 paper: LSTC: Boosting

Tencent YouTu Research 9 Oct 11, 2022
NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

Felix Hamborg 79 Dec 30, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

12 Jan 20, 2022
Natural Language Processing Tasks and Examples.

Natural Language Processing Tasks and Examples With the advancement of A.I. technology in recent years, natural language processing technology has bee

Soohwan Kim 53 Dec 20, 2022
Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

Ishtiaq Hussain 2 Feb 10, 2022
This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could ha

p.katekomol 1 Jan 24, 2022
Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

Loren Kociko 1 Jun 14, 2022
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
Natural language computational chemistry command line interface.

nlcc Install pip install nlcc Must have Open-AI Codex key: export OPENAI_API_KEY=your key here then nlcc key bindings ctrl-w copy to clipboard (Note

Andrew White 37 Dec 14, 2022
PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

Facebook Research 1.6k Dec 29, 2022
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 02, 2022
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

Phil Wang 1.8k Dec 30, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023
Pattern Matching in Python

Pattern Matching finalmente chega no Python 3.10. E daí? "Pattern matching", ou "correspondência de padrões" como é conhecido no Brasil. Algumas pesso

Fabricio Werneck 6 Feb 16, 2022
Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

Advit Deepak 7 Sep 17, 2022
This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

EleutherAI 42 Dec 13, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022