Contact Extraction with Question Answering.

Overview

contactsQA

Extraction of contact entities from address blocks and imprints with Extractive Question Answering.

Goal

Input:

Dr. Max Mustermann
Hauptstraße 123
97070 Würzburg

Output:

entities = {
  "city" : "Würzburg",
  "email" : "",
  "fax" : "",
  "firstName" : "Max",
  "lastName" : "Mustermann",
  "mobile" : "",
  "organization" : "",
  "phone" : "",
  "position" : "",
  "street" : "Hauptstraße 123",
  "title" : "Dr.",
  "website" : "",
  "zip" : "97070"
}

Getting started

Creating a dataset

Due to data protection reasons, no dataset is included in this repository. You need to create a dataset in the SQuAD format, see https://huggingface.co/datasets/squad. Create the dataset in the jsonl-format where one line looks like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'firstName',
        'answers': {
            'answer_start': [4],
            'text': ['Max']
        }
    }

Questions with no answers should look like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'phone',
        'answers': {
            'answer_start': [-1],
            'text': ['EMPTY']
        }
    }

Split the dataset into a train-, validation- and test-dataset and save them in a directory with the name crawl, email or expected, like this:

├── data
│   ├── crawl
│   │   ├── crawl-test.jsonl
│   │   ├── crawl-train.jsonl
│   │   ├── crawl-val.jsonl

If you allow unanswerable questions like in SQuAD v2.0, add a -na behind the directory name, like this:

├── data
│   ├── crawl-na
│   │   ├── crawl-na-test.jsonl
│   │   ├── crawl-na-train.jsonl
│   │   ├── crawl-na-val.jsonl

Training a model

Example command for training and evaluating a dataset inside the crawl-na directory:

python app/qa-pipeline.py \
--batch_size 4 \
--checkpoint xlm-roberta-base \
--dataset_name crawl \
--dataset_path="../data/" \
--deactivate_map_caching \
--doc_stride 128 \
--epochs 3 \
--gpu_device 0 \
--learning_rate 0.00002 \
--max_answer_length 30 \
--max_length 384 \
--n_best_size 20 \
--n_jobs 8 \
--no_answers \
--overwrite_output_dir;

Virtual Environment Setup

Create and activate the environment (the python version and the environment name can vary at will):

$ python3.9 -m venv .env
$ source .env/bin/activate

To install the project's dependencies, activate the virtual environment and simply run (requires poetry):

$ poetry install

Alternatively, use the following:

$ pip install -r requirements.txt

Deactivate the environment:

$ deactivate

Troubleshooting

Common error:

ModuleNotFoundError: No module named 'setuptools'

The solution is to upgrade setuptools and then run poetry install or poetry update afterwards:

pip install --upgrade setuptools
Owner
Jan
Data Scientist (Working student) @snapADDY & Master student at Digital Humanities at Julius-Maximilians-University Würzburg.
Jan
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 2k Dec 27, 2022
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
Text classification on IMDB dataset using Keras and Bi-LSTM network

Text classification on IMDB dataset using Keras and Bi-LSTM Text classification on IMDB dataset using Keras and Bi-LSTM network. Usage python3 main.py

Hamza Rashid 2 Sep 27, 2022
Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Phone Level Mixture Density Network for TTS This repo contains pytorch implementation of paper Rich Prosody Diversity Modelling with Phone-level Mixtu

Rishikesh (ऋषिकेश) 42 Dec 13, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
Mysticbbs-rjam - rJAM splitscreen message reader for MysticBBS A46+

rJAM splitscreen message reader for MysticBBS A46+

Robbert Langezaal 4 Nov 22, 2022
تولید اسم های رندوم فینگیلیش

karafs کرفس تولید اسم های رندوم فینگیلیش installation ➜ pip install karafs usage دو زبانه ➜ karafs -n 10 توت فرنگی بی ناموس toot farangi-ye bi_namoos

Vaheed NÆINI (9E) 36 Nov 24, 2022
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

Digital Phonetics at the University of Stuttgart 247 Jan 05, 2023
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
Weird Sort-and-Compress Thing

Weird Sort-and-Compress Thing A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by

Douglas 1 Jan 03, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
A library for finding knowledge neurons in pretrained transformer models.

knowledge-neurons An open source repository replicating the 2021 paper Knowledge Neurons in Pretrained Transformers by Dai et al., and extending the t

EleutherAI 96 Dec 21, 2022
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 07, 2022
This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022