AI-powered literature discovery and review engine for medical/scientific papers

Last update: Dec 30, 2022

Overview

AI-powered literature discovery and review engine for medical/scientific papers

paperai is an AI-powered literature discovery and review engine for medical/scientific papers. paperai helps automate tedious literature reviews allowing researchers to focus on their core work. Queries are run to filter papers with specified criteria. Reports powered by extractive question-answering are run to identify answers to key questions within sets of medical/scientific papers.

paperai was used to analyze the COVID-19 Open Research Dataset (CORD-19), winning multiple awards in the CORD-19 Kaggle challenge.

paperai and/or NeuML has been recognized in the following articles:

Installation

The easiest way to install is via pip and PyPI

pip install paperai

You can also install paperai directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/paperai

Python 3.6+ is supported

See this link to help resolve environment-specific install issues.

Docker

A Dockerfile with commands to install paperai, all dependencies and scripts is available in this repository.

Clone this git repository and run the following to build and run the Docker image.

docker build -t paperai -f docker/Dockerfile .
docker run --name paperai --rm -it paperai

This will bring up a paperai command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content. All scripts in the following examples are available in this environment.

paperetl's Dockerfile can be combined with this Dockerfile to have a single image that can index and query content. The files from the paperetl project scripts directory needs to be placed in paperai's scripts directory. The paperetl Dockerfile also needs to be copied over (it's referenced as paperetl.Dockerfile here).

docker build -t base -f docker/Dockerfile .
docker build -t paperai --build-arg BASE_IMAGE=base -f docker/paperetl.Dockerfile .
docker run --name paperai --rm -it paperai

Examples

The following notebooks and applications demonstrate the capabilities provided by paperai.

Notebooks

Notebook	Description
CORD-19 Analysis with Sentence Embeddings	Builds paperai-based submissions for the CORD-19 Challenge
CORD-19 Report Builder	Template for building new reports

Applications

Application	Description
Search	Search a paperai index. Set query parameters, execute searches and display results.

Building a model

paperai indexes databases previously built with paperetl. paperai currently supports querying SQLite databases.

The following sections show how to build an index for a SQLite articles database.

This example assumes the database and model path is cord19/models. Substitute as appropriate.

Download CORD-19 fastText vectors
```
scripts/getvectors.sh cord19/vectors
```
A full vector model build can optionally be run with the following command.
```
python -m paperai.vectors cord19/models
```
CORD-19 fastText vectors are also available on Kaggle.

Build embeddings index

python -m paperai.index cord19/models cord19/vectors/cord19-300d.magnitude

The paperai.index process takes two optional arguments, the model path and the vector file path. The default model location is ~/.cord19 if no parameters are passed in.

Building a report file

Reports support generating output in multiple formats. An example report call:

python -m paperai.report tasks/risks.yml 50 md cord19/models

The following report formats are supported:

Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file.
CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file.
Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.

In the example above, a file named tasks/risk_factors.md will be created. Example report configuration files can be found here.

Running queries

The fastest way to run queries is to start a paperai shell

paperai cord19/models

A prompt will come up. Queries can be typed directly into the console.

Tech Overview

The tech stack is built on Python and creates a sentence embeddings index with FastText + BM25. Background on this method can be found in this Medium article.

The model is a combination of a sentence embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. FastText vectors are built over the full corpus. The sentence embeddings index only uses tagged articles, which helps produce the most relevant results.

Multiple entry points exist to interact with the model.

paperai.report - Builds a markdown report for a series of queries. For each query, the best articles are shown, top matches from those articles and a highlights section which shows the most relevant sections from the embeddings search for the query.
paperai.query - Runs a single query from the terminal
paperai.shell - Allows running multiple queries from the terminal

Comments

Vector model file not found (cord19-300d.magnitude)
issue moved from wrong project to here -

Hi,

I get the following error when running python -m paperai.index

raise IOError(ENOENT, "Vector model file not found", path) FileNotFoundError: [Errno 2] Vector model file not found: 'C:\Users\x\.cord19\vectors\cord19-300d.magnitude'

PS. I am quite new to all this; so, apologies if the mistake is on my end.

When trying to download cord19-300d.magnitude from https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude, I get the error: "Too many requests"
opened by fomar1994 30
Installation issues

The system would report issue with "UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 12007: illegal multibyte sequence" when I execute this command "pip install paperai". I wonder if WINDOWS SYSTEM cannot decompress tar.gz-type packages.

opened by albertY-C 16
I'm not sure to have followed correctly the procedure for running paperai with pre-trained vectors

After successfully installing paperai in Linux (Ubuntu 20.04.1 LTS), I tried to run it by using the pre-trained vectors option to build the model, as follows:

(1) I downloaded the vectors from https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude (2) My Downloads folder in my computer ended up with a Zip file containing the vectors. (3) I created a directory ~/.cord19/vectors/ and moved the downloaded Zip file into this directory (see yellow folder in the figure below). (4) I extracted the Zip file, which resulted in the grey folder shown below, which contained the file cord19-300d.magnitude (5) I moved the cord19-300d.magnitude file outside of the grey folder and thus into the ~/.cord19/vectors/ directory (see figure below)

(6) I excuted the following command to build the embeddings index with the above pre-trained vectors:

python -m paperai.index

Upon performing the above I got the following error message (see below)

Am I getting this error because the above steps are not the correct ones? If so, what would be the correct steps? Otherwise, what other things should I try to eliminate the issue?

opened by DavidRivasPhD 10
Windows install issue

It was reported that paperai can't be installed in a Windows environment due to the following error:

ValueError: path 'src/python/' cannot end with '/'
bug

opened by davidmezzetti 5
Added pdf output build option

Modified export.py to create a pdf output option This is done by the new method in export, streampdf

This edit done for educational purposes as a participant in York University's software design course

Thank you for your time

opened by will0710 3
Processing custom sqlite file

I want to create an index and vector file over a Custom sqlite articles database. I have created a articles.sqlite database on medical papers, using paperetl. But I did not find any instruction as to how to process it . Can you please give instructions on this ?

opened by choudharya3 3
risk-factors.yml issues

when i run command "python -m paperai.report tasks/risk-factors.yml 50 md cord19/models ", i can't find file risk-factors.yml, and i can't understand argument "50"

opened by Zhip-S 2
Integration: DeepSource
I ran DeepSource analysis on my fork of this repository and found some code quality issues. Have a look at the issues caught in this repository by DeepSource here.

DeepSource is a code review automation tool that detects code quality issues and helps you to automatically fix some of them. You can use DeepSource to track test coverage, Detect problems in Dockerfiles, etc. in addition to detecting issues in code.

The PR #24 fixed some of the issues caught by DeepSource.

All the features of the DeepSource are mentioned here. I'd suggest you integrate DeepSource since it is free for Open Source projects forever.

Integrating DeepSource to continuously analyze your repository:

Install DeepSource on your repository here.

Create .deepsource.toml configuration specific to this repo or use the configuration mentioned below which I used to run the analysis on the fork of this repo.

Activate analysis here.

version = 1 test_patterns = ["/test/python/*.py"] [[analyzers]] name = "python" enabled = true [analyzers.meta] runtime_version = "3.x.x"
opened by withshubh 2

RuntimeError: CUDA error: out of memory (NVidia V100, 32 GB DDRAM)

What are the minimum memory requirements for the PaperAI? When running on Nvidia V100, 32 GB DDRAM I got: RuntimeError: CUDA error: out of memory. GPU memory seems to be completely free.

Is there a way how to run it from GPU, or can I run it exclusively on TPUs?

from txtai.embeddings import Embeddings
import torch

torch.cuda.empty_cache()

# MEMORY
id = 1
t = torch.cuda.get_device_properties(id).total_memory
c = torch.cuda.memory_cached(id)
a = torch.cuda.memory_allocated(id)
f = c-a  # free inside cache

print("TOTAL", t / 1024/1024/1024," GB")
print("ALLOCATED", a)

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})

import numpy as np

sections = ["US tops 5 million confirmed virus cases",
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
            "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
            "The National Park Service warns against sacrificing slower friends in a bear attack",
            "Maine man wins $1M from $25 lottery ticket",
            "Make huge profits without work, earn up to $100,000 a day"]


query = "health"
uid = np.argmax(embeddings.similarity(query, sections))
print("%-20s %s" % (query, sections[uid]))

TOTAL 31.74853515625 GB ALLOCATED 0 Traceback (most recent call last): File "pokus2.py", line 32, in uid = np.argmax(embeddings.similarity(query, sections)) File "/home/user/.local/lib/python3.8/site-packages/txtai/embeddings.py", line 228, in similarity query = self.transform((None, query, None)).reshape(1, -1) File "/home/user/.local/lib/python3.8/site-packages/txtai/embeddings.py", line 179, in transform embedding = self.model.transform(document) File "/home/user/.local/lib/python3.8/site-packages/txtai/vectors.py", line 264, in transform return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]

opened by burgetrm 2

Wrong annotation places
Need fix to correctly annotate the pdf text from query text that has the different pages, columns, or others placing positions in the pdf. In the screenshots, the annotator trying to annotate text that in the different positions only by per page consideration rather than all placing positions consideration. That method made the annotator annotate text that should not be annotated because the annotator only found the text in its current scope only. Also, the annotation that covers texts that should not be annotated leads to confusing annotation indicators too.

Columns problem:

Query

Annotations

Pages problem:

Query

Annotations
opened by muazhari 1
sqlite3.OperationalError: no such table: sections

when I command in a docker: python -m paperai.vectors cord19/models, the output srror is "sqlite3.OperationalError: no such table: sections"

opened by wspspring 1
paperai for beginners

First and foremost, thank you for offering such a great library. Nonetheless I was wondering if possible can you provide a simple guideline on using such a awesome library for new research project like from loading pdf files to querying the topics. I went through the examples but could not grasp the overall idea. I believe a small effort of yours would be really help for beginners like me to use this library in research work.

opened by satishchaudhary382 1

Releases(v2.0.0)

v2.0.0(Mar 12, 2022)
This release adds the following enhancements and bug fixes:

Allow setting report options within task yml files (#42)

Allow running reports against full databases (#43)

Batch extractor queries (#44)

Remove study design columns (#46)

Add option to specify extraction column context (#47)

Add report reference column (#48)

Add report column format parameter (#49)

Add pre-commit checks (#50)

Add check to report sections query to ensure text has tokens (#51)

Remove default home directory cord19 path defaults (#52)

Require Python 3.7+ (#54)

Update txtai to 4.3.1 (#56)

Source code(tar.gz)
Source code(zip)
v1.10.0(Sep 10, 2021)

Sync with txtai 3.3 (#41)
Source code(tar.gz)
Source code(zip)
tests.tar.gz(14.33 MB)
v1.9.0(Aug 18, 2021)

Update to txtai 3.2 (#40)
Source code(tar.gz)
Source code(zip)
v1.8.0(Apr 23, 2021)
This release adds the following enhancements and bug fixes:

Add ability to read index yml (#18)

Switch from mdv to mdv3 to support Python 3.9 (#21)

Add enhanced API for paperai (#30)

Add configurable query threshold, (#31)

Support query negation (#32)

Add search application (#33)

Source code(tar.gz)
Source code(zip)
v1.7.0(Feb 24, 2021)
This release adds the following enhancements and bug fixes:

Add pre-trained models to GitHub (#19, #27)

Add Dockerfile (#29)

Source code(tar.gz)
Source code(zip)
v1.6.0(Jan 13, 2021)

Sync with txtai 2.0 (#26)
Source code(tar.gz)
Source code(zip)
v1.5.0(Dec 11, 2020)
This release adds the following enhancements and bug fixes:

Add annotation report (#17)

Source code(tar.gz)
Source code(zip)
v1.4.0(Nov 6, 2020)
This release adds the following enhancements and bug fixes:

Allow specifying vector output file (#10, #11, #13)

Build test suite (#12)

Add additional column parameters (#14)

Allow indexing partial datasources (#15)

Add GitHub actions build script (#16)

Source code(tar.gz)
Source code(zip)
v1.3.0(Aug 18, 2020)
This release addresses the following:

Remove NLTK dependency (#9)

Source code(tar.gz)
Source code(zip)
cord19-300d.magnitude.gz(696.79 MB)
tests.tar.gz(14.42 MB)
v1.2.1(Aug 12, 2020)

Minor README update to note package can be installed from PyPI
Source code(tar.gz)
Source code(zip)
v1.2.0(Aug 11, 2020)
Release addresses the following:

Allow customized the QA model used for QA extraction (#5)

Migrated embeddings index logic to txtai project (#7)

Source code(tar.gz)
Source code(zip)
v1.1.0(Aug 5, 2020)
Release addresses the following:

Add wildcard report queries (#1) - Add ability to run report against entire database. This is only practical for smaller datasets.

Fix Windows install issues (#2)

Embeddings index memory improvements (#3) - Various improvements to limit memory usage when building an embeddings index

Support must clauses for custom query columns (#4) - Add same logic already present in general queries to require a term to be present when deriving report query columns

Source code(tar.gz)
Source code(zip)
v1.0.0(Jul 21, 2020)

Initial release of paperai, migrating AI/ML/Search logic from existing cord19q project.
Source code(tar.gz)
Source code(zip)

Owner

NeuML

Applying machine learning to solve everyday problems

GitHub Repository

Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

3 Nov 11, 2022

A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

49 Sep 12, 2022

CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

CJK computer science terms comparison This repository contains the source code of the website. You can see the website from the following link: Englis

88 Dec 23, 2022

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

121 Dec 27, 2022

Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

169 Jan 05, 2023

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Arabic-Phonetic-Output You can input the phonetic version of any Arabic text her

1 Dec 30, 2021

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

79 Dec 30, 2022

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

60 Dec 25, 2022

Flaxformer: transformer architectures in JAX/Flax

Flaxformer: transformer architectures in JAX/Flax Flaxformer is a transformer library for primarily NLP and multimodal research at Google. It is used

114 Dec 29, 2022

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.5k Dec 05, 2022

A method for cleaning and classifying text using transformers.

NLP Translation and Classification The repository contains a method for classifying and cleaning text using NLP transformers. Overview The input data

0 Nov 15, 2022

A fast, efficient universal vector embedding utility package.

Magnitude: a fast, simple vector embedding utility library A feature-packed Python package and vector storage file format for utilizing vector embeddi

1.5k Jan 02, 2023

Spam filtering made easy for you

spammy Author: Tasdik Rahman Latest version: 1.0.3 Contents 1 Overview 2 Features 3 Example 3.1 Accuracy of the classifier 4 Installation 4.1 Upgradin

137 Dec 18, 2022

Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

161 Dec 16, 2022

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

907 Dec 27, 2022

AI-powered literature discovery and review engine for medical/scientific papers

Related tags

Overview

AI-powered literature discovery and review engine for medical/scientific papers

Installation

Docker

Examples

Notebooks

Applications

Building a model

Building a report file

Running queries

Tech Overview

Comments

Releases(v2.0.0)

v2.0.0(Mar 12, 2022)

v1.10.0(Sep 10, 2021)

v1.9.0(Aug 18, 2021)

v1.8.0(Apr 23, 2021)

v1.7.0(Feb 24, 2021)

v1.6.0(Jan 13, 2021)

v1.5.0(Dec 11, 2020)

v1.4.0(Nov 6, 2020)

v1.3.0(Aug 18, 2020)

v1.2.1(Aug 12, 2020)

v1.2.0(Aug 11, 2020)

v1.1.0(Aug 5, 2020)

v1.0.0(Jul 21, 2020)

Owner

NeuML

Command Line Text-To-Speech using Google TTS

A python wrapper around the ZPar parser for English.

CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Official implementation of Meta-StyleSpeech and StyleSpeech

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Flaxformer: transformer architectures in JAX/Flax

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

A method for cleaning and classifying text using transformers.

A fast, efficient universal vector embedding utility package.

Spam filtering made easy for you

Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

Opal-lang - A WIP programming language based on Python

Transformer related optimization, including BERT, GPT

The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Malware-Related Sentence Classification

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine