Labelling platform for text using distant supervision

Last update: Aug 05, 2022

Overview

Welcome to the DataQA platform

With DataQA, you can label unstructured text documents using rule-based distant supervision. You can use it to:

manually label all documents,
use a search engine to explore your data and label at the same time,
label a sample of some documents with an imbalanced class distribution,
create a baseline high-precision system for NER or for classification.

Documentation at: https://dataqa.ai/docs/.

Screenshots

Classify or extract named entities from your text:

Search and label your data:

Use rules & heuristics to automatically label your documents:

Installation

Pre-requisites:

Python 3.6, 3.7, 3.8 and 3.9
(Recommended) start a new python virtual environment
Update your pip pip install -U pip
Tested on backend: MacOSX, Ubuntu. Tested on browser: Chrome.

Installation

To install the package from pypi:

Python versions 3.6, 3.7

pip install dataqa

Python versions 3.8, 3.9

When using python 3.8 or 3.9, need to run pip install networkx==2.5 after installing dataqa (ignore error message complaining about snorkel's dependencies). This is due to an error in snorkel's dependencies.

Usage

Start the application

In the terminal, type dataqa run. Wait a few minutes initially, as it takes some minutes to start everything up.

Doing this will run a server locally and open a browser window at port 5000. If the application does not open the browser automatically, open localhost:5000 in your browser. You need to keep the terminal open.

To quit the application, simply do Ctr-C in the terminal. To resume the application, type dataqa run. Doing so will create a folder at $HOME/.dataqa_data.

Does this tool need an internet connection?

Only the first time you run it, it will need to download a language model from the internet. This is the only time it will need an internet connection. There is ongoing work to remove this constraint, so it can be run locally without any internet.

No data will ever leave your local machine.

Uploading data

The text file needs to be a csv file in utf-8 encoding of up to 30MB with a column named "text" which contains the main text. The other columns will be ignored.

This step is running some analysis on your text and might take up to 5 minutes.

Uninstall

In the terminal:

dataqa uninstall: this deletes your local application data in the home directory in the folder .dataqa_data. It will prompt the user before deleting.
pip uninstall dataqa

Troubleshooting

Usage

If the project data does not load, try to go to the homepage and http://localhost:5000 and navigate to the project from there.

Try running dataqa test to get more information about the error, and bug reports are very welcome!

Development

To test the application, it is possible to upload a text that contains a column "__LABEL__". The ground-truth labels will then be displayed during labelling and the real performance will be shown in the performance table between brackets.

Packaging

Using setuptools

To create the wheel file:

Make sure there are no stale files: rm -rf src/dataqa.egg-info; rm -rf build/;
python setup.py sdist bdist_wheel

Contact

For any feedback, please contact us at [email protected].

Labelling platform for text using distant supervision

Related tags

Overview

Welcome to the DataQA platform

Screenshots

Installation

Pre-requisites:

Installation

Python versions 3.6, 3.7

Python versions 3.8, 3.9

Usage

Start the application

Does this tool need an internet connection?

Uploading data

Uninstall

Troubleshooting

Usage

Development

Packaging

Using setuptools

Contact

Owner

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Python package for performing Entity and Text Matching using Deep Learning.

Codes for coreference-aware machine reading comprehension

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

基于“Seq2Seq+前缀树”的知识图谱问答

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Every Google, Azure & IBM text to speech voice for free

SurvTRACE: Transformers for Survival Analysis with Competing Events

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].