Natural language Understanding Toolkit

Related tags

Text Data & NLPnut
Overview

Natural language Understanding Toolkit

TOC

Requirements

To install nut you need:

  • Python 2.5 or 2.6
  • Numpy (>= 1.1)
  • Sparsesvd (>= 0.1.4) [1] (only CLSCL)

Installation

To clone the repository run,

git clone git://github.com/pprett/nut.git

To build the extension modules inplace run,

python setup.py build_ext --inplace

Add project to python path,

export PYTHONPATH=$PYTHONPATH:$HOME/workspace/nut

Documentation

CLSCL

An implementation of Cross-Language Structural Correspondence Learning (CLSCL). See [Prettenhofer2010] for a detailed description and [Prettenhofer2011] for more experiments and enhancements.

The data for cross-language sentiment classification that has been used in the above study can be found here [2].

clscl_train

Training script for CLSCL. See ./clscl_train --help for further details.

Usage:

$ ./clscl_train en de cls-acl10-processed/en/books/train.processed cls-acl10-processed/en/books/unlabeled.processed cls-acl10-processed/de/books/unlabeled.processed cls-acl10-processed/dict/en_de_dict.txt model.bz2 --phi 30 --max-unlabeled=50000 -k 100 -m 450 --strategy=parallel

|V_S| = 64682
|V_T| = 106024
|V| = 170706
|s_train| = 2000
|s_unlabeled| = 50000
|t_unlabeled| = 50000
debug: DictTranslator contains 5012 translations.
mutualinformation took 5.624 sec
select_pivots took 7.197 sec
|pivots| = 450
create_inverted_index took 59.353 sec
Run joblib.Parallel
[Parallel(n_jobs=-1)]: Done   1 out of 450 |elapsed:    9.1s remaining: 67.8min
[Parallel(n_jobs=-1)]: Done   5 out of 450 |elapsed:   15.2s remaining: 22.6min
[..]
[Parallel(n_jobs=-1)]: Done 449 out of 450 |elapsed: 14.5min remaining:    1.9s
train_aux_classifiers took 881.803 sec
density: 0.1154
Ut.shape = (100,170706)
learn took 903.588 sec
project took 175.483 sec

Note

If you have access to a hadoop cluster, you can use --strategy=hadoop to train the pivot classifiers even faster, however, make sure that the hadoop nodes have Bolt (feature-mask branch) [3] installed.

clscl_predict

Prediction script for CLSCL.

Usage:

$ ./clscl_predict cls-acl10-processed/en/books/train.processed model.bz2 cls-acl10-processed/de/books/test.processed 0.01
|V_S| = 64682
|V_T| = 106024
|V| = 170706
load took 0.681 sec
load took 0.659 sec
classes = {negative,positive}
project took 2.498 sec
project took 2.716 sec
project took 2.275 sec
project took 2.492 sec
ACC: 83.05

Named-Entity Recognition

A simple greedy left-to-right sequence labeling approach to named entity recognition (NER).

pre-trained models

We provide pre-trained named entity recognizers for place, person, and organization names in English and German. To tag a sentence simply use:

>>> from nut.io import compressed_load
>>> from nut.util import WordTokenizer

>>> tagger = compressed_load("model_demo_en.bz2")
>>> tokenizer = WordTokenizer()
>>> tokens = tokenizer.tokenize("Peter Prettenhofer lives in Austria .")

>>> # see tagger.tag.__doc__ for input format
>>> sent = [((token, "", ""), "") for token in tokens]
>>> g = tagger.tag(sent)  # returns a generator over tags
>>> print(" ".join(["/".join(tt) for tt in zip(tokens, g)]))
Peter/B-PER Prettenhofer/I-PER lives/O in/O Austria/B-LOC ./O

You can also use the convenience demo script ner_demo.py:

$ python ner_demo.py model_en_v1.bz2

The feature detector modules for the pre-trained models are en_best_v1.py and de_best_v1.py and can be found in the package nut.ner.features. In addition to baseline features (word presence, shape, pre-/suffixes) they use distributional features (brown clusters), non-local features (extended prediction history), and gazetteers (see [Ratinov2009]). The models have been trained on CoNLL03 [4]. Both models use neither syntactic features (e.g. part-of-speech tags, chunks) nor word lemmas, thus, minimizing the required pre-processing. Both models provide state-of-the-art performance on the CoNLL03 shared task benchmark for English [Ratinov2009]:

processed 46435 tokens with 4946 phrases; found: 4864 phrases; correct: 4455.
accuracy:  98.01%; precision:  91.59%; recall:  90.07%; FB1:  90.83
              LOC: precision:  91.69%; recall:  90.53%; FB1:  91.11  1648
              ORG: precision:  87.36%; recall:  85.73%; FB1:  86.54  1630
              PER: precision:  95.84%; recall:  94.06%; FB1:  94.94  1586

and German [Faruqui2010]:

processed 51943 tokens with 2845 phrases; found: 2438 phrases; correct: 2168.
accuracy:  97.92%; precision:  88.93%; recall:  76.20%; FB1:  82.07
              LOC: precision:  87.67%; recall:  79.83%; FB1:  83.57  957
              ORG: precision:  82.62%; recall:  65.92%; FB1:  73.33  466
              PER: precision:  93.00%; recall:  78.02%; FB1:  84.85  1015

To evaluate the German model on the out-domain data provided by [Faruqui2010] use the raw flag (-r) to write raw predictions (without B- and I- prefixes):

./ner_predict -r model_de_v1.bz2 clner/de/europarl/test.conll - | clner/scripts/conlleval -r
loading tagger... [done]
use_eph:  True
use_aso:  False
processed input in 40.9214s sec.
processed 110405 tokens with 2112 phrases; found: 2930 phrases; correct: 1676.
accuracy:  98.50%; precision:  57.20%; recall:  79.36%; FB1:  66.48
              LOC: precision:  91.47%; recall:  71.13%; FB1:  80.03  563
              ORG: precision:  43.63%; recall:  83.52%; FB1:  57.32  1673
              PER: precision:  62.10%; recall:  83.85%; FB1:  71.36  694

Note that the above results cannot be compared directly to the resuls of [Faruqui2010] since they use a slighly different setting (incl. MISC entity).

ner_train

Training script for NER. See ./ner_train --help for further details.

To train a conditional markov model with a greedy left-to-right decoder, the feature templates of [Rationov2009]_ and extended prediction history (see [Ratinov2009]) use:

./ner_train clner/en/conll03/train.iob2 model_rr09.bz2 -f rr09 -r 0.00001 -E 100 --shuffle --eph
________________________________________________________________________________
Feature extraction

min count:  1
use eph:  True
build_vocabulary took 24.662 sec
feature_extraction took 25.626 sec
creating training examples... build_examples took 42.998 sec
[done]
________________________________________________________________________________
Training

num examples: 203621
num features: 553249
num classes: 9
classes:  ['I-LOC', 'B-ORG', 'O', 'B-PER', 'I-PER', 'I-MISC', 'B-MISC', 'I-ORG', 'B-LOC']
reg: 0.00001000
epochs: 100
9 models trained in 239.28 seconds.
train took 282.374 sec

ner_predict

You can use the prediction script to tag new sentences formatted in CoNLL format and write the output to a file or to stdout. You can pipe the output directly to conlleval to assess the model performance:

./ner_predict model_rr09.bz2 clner/en/conll03/test.iob2 - | clner/scripts/conlleval
loading tagger... [done]
use_eph:  True
use_aso:  False
processed input in 11.2883s sec.
processed 46435 tokens with 5648 phrases; found: 5605 phrases; correct: 4799.
accuracy:  96.78%; precision:  85.62%; recall:  84.97%; FB1:  85.29
              LOC: precision:  87.29%; recall:  88.91%; FB1:  88.09  1699
             MISC: precision:  79.85%; recall:  75.64%; FB1:  77.69  665
              ORG: precision:  82.90%; recall:  78.81%; FB1:  80.80  1579
              PER: precision:  88.81%; recall:  91.28%; FB1:  90.03  1662

References

[1] http://pypi.python.org/pypi/sparsesvd/0.1.4
[2] http://www.webis.de/research/corpora/corpus-webis-cls-10/cls-acl10-processed.tar.gz
[3] https://github.com/pprett/bolt/tree/feature-mask
[4] For German we use the updated version of CoNLL03 by Sven Hartrumpf.
[Prettenhofer2010] Prettenhofer, P. and Stein, B., Cross-language text classification using structural correspondence learning. In Proceedings of ACL '10.
[Prettenhofer2011] Prettenhofer, P. and Stein, B., Cross-lingual adaptation using structural correspondence learning. ACM TIST (to appear). [preprint]
[Ratinov2009] (1, 2, 3) Ratinov, L. and Roth, D., Design challenges and misconceptions in named entity recognition. In Proceedings of CoNLL '09.
[Faruqui2010] (1, 2, 3) Faruqui, M. and Padó S., Training and Evaluating a German Named Entity Recognizer with Semantic Generalization. In Proceedings of KONVENS '10

Developer Notes

  • If you copy a new version of bolt into the externals directory make sure to run cython on the *.pyx files. If you fail to do so you will get a PickleError in multiprocessing.
Owner
Peter Prettenhofer
Peter Prettenhofer
Binary LSTM model for text classification

Text Classification The purpose of this repository is to create a neural network model of NLP with deep learning for binary classification of texts re

Nikita Elenberger 1 Mar 11, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 02, 2023
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
GPT-3 command line interaction

Writer_unblock Straight-forward command line interfacing with GPT-3. Finding yourself stuck at a conceptual stage? Spinning your wheels needlessly on

Seth Nuzum 6 Feb 10, 2022
Experiments in converting wikidata to ftm

FollowTheMoney / Wikidata mappings This repo will contain tools for converting Wikidata entities into FtM schema. Prefixes: https://www.mediawiki.org/

Friedrich Lindenberg 2 Nov 12, 2021
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
This is a NLP based project to extract effective date of the contract from their text files.

Date-Extraction-from-Contracts This is a NLP based project to extract effective date of the contract from their text files. Problem statement This is

Sambhav Garg 1 Jan 26, 2022
An A-SOUL Text Generator Based on CPM-Distill.

ASOUL-Generator-Backend 本项目为 https://asoul.infedg.xyz/ 的后端。 模型为基于 CPM-Distill 的 transformers 转化版本 CPM-Generate-distill 训练而成。

infinityedge 46 Dec 11, 2022
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classifi

186 Dec 24, 2022
DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

BuyWithCrypto 3 Apr 19, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
Mycroft Core, the Mycroft Artificial Intelligence platform.

Mycroft Mycroft is a hackable open source voice assistant. Table of Contents Getting Started Running Mycroft Using Mycroft Home Device and Account Man

Mycroft 6.1k Jan 09, 2023
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Cambridge Language Technology Lab 61 Dec 10, 2022
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

2 Jun 19, 2022