SciBERT is a BERT model trained on scientific text.

Last update: Dec 24, 2022

Overview

`SciBERT`

SciBERT is a BERT model trained on scientific text.

SciBERT is trained on papers from the corpus of semanticscholar.org. Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.
SciBERT has its own vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions. We also include models trained on the original BERT vocabulary (basevocab) for comparison.
It results in state-of-the-art performance on a wide range of scientific domain nlp tasks. The details of the evaluation are in the paper. Evaluation code and data are included in this repo.

Downloading Trained Models

Update! SciBERT models now installable directly within Huggingface's framework under the allenai org:

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_cased')

We release the tensorflow and the pytorch version of the trained models. The tensorflow version is compatible with code that works with the model from Google Research. The pytorch version is created using the Hugging Face library, and this repo shows how to use it in AllenNLP. All combinations of scivocab and basevocab, cased and uncased models are available below. Our evaluation shows that scivocab-uncased usually gives the best results.

Using SciBERT in your own model

SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT. If you are using Tensorflow, refer to Google's BERT repo and if you use PyTorch, refer to Hugging Face's repo where detailed instructions on using BERT models are provided.

Training new models using AllenNLP

To run experiments on different tasks and reproduce our results in the paper, you need to first setup the Python 3.6 environment:

pip install -r requirements.txt

which will install dependencies like AllenNLP.

Use the scibert/scripts/train_allennlp_local.sh script as an example of how to run an experiment (you'll need to modify paths and variable names like TASK and DATASET).

We include a broad set of scientific nlp datasets under the data/ directory across the following tasks. Each task has a sub-directory of available datasets.

├── ner
│   ├── JNLPBA
│   ├── NCBI-disease
│   ├── bc5cdr
│   └── sciie
├── parsing
│   └── genia
├── pico
│   └── ebmnlp
└── text_classification
    ├── chemprot
    ├── citation_intent
    ├── mag
    ├── rct-20k
    ├── sci-cite
    └── sciie-relation-extraction

For example to run the model on the Named Entity Recognition (NER) task and on the BC5CDR dataset (BioCreative V CDR), modify the scibert/train_allennlp_local.sh script according to:

DATASET='bc5cdr'
TASK='ner'
...

Decompress the PyTorch model that you downloaded using
tar -xvf scibert_scivocab_uncased.tar
The results will be in the scibert_scivocab_uncased directory containing two files: A vocabulary file (vocab.txt) and a weights file (weights.tar.gz). Copy the files to your desired location and then set correct paths for BERT_WEIGHTS and BERT_VOCAB in the script:

export BERT_VOCAB=path-to/scibert_scivocab_uncased.vocab
export BERT_WEIGHTS=path-to/scibert_scivocab_uncased.tar.gz

Finally run the script:

./scibert/scripts/train_allennlp_local.sh [serialization-directory]

Where [serialization-directory] is the path to an output directory where the model files will be stored.

Citing

If you use SciBERT in your research, please cite SciBERT: Pretrained Language Model for Scientific Text.

@inproceedings{Beltagy2019SciBERT,
  title={SciBERT: Pretrained Language Model for Scientific Text},
  author={Iz Beltagy and Kyle Lo and Arman Cohan},
  year={2019},
  booktitle={EMNLP},
  Eprint={arXiv:1903.10676}
}

SciBERT is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

SciBERT is a BERT model trained on scientific text.

Related tags

Overview

`SciBERT`

Downloading Trained Models

Tensorflow Models

PyTorch AllenNLP Models

PyTorch HuggingFace Models

Using SciBERT in your own model

Training new models using AllenNLP

Citing

Owner

AI2

用Resnet101+GPT搭建一个玩王者荣耀的AI

Text vectorization tool to outperform TFIDF for classification tasks

Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Crowd sourced training data for Rasa NLU models

Open-source offline translation library written in Python. Uses OpenNMT for translations

Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

A python package to fine-tune transformer-based models for named entity recognition (NER).

GPT-3: Language Models are Few-Shot Learners

This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Application to help find best train itinerary, uses speech to text, has a spam filter to segregate invalid inputs, NLP and Pathfinding algos.

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021