Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

Overview

New State-of-the-Art in Preposition Sense Disambiguation

Supervisor:

Institutions:

Project Description

The disambiguation of words is a central part of NLP tasks. In particular, there is the ambiguity of prepositions, which has been a problem in NLP for over a decade and still is. For example the preposition 'in' can have a temporal (e.g. in 2021) or a spatial (e.g. in Frankuft) meaning. A strong motivation behind the learning of these meanings are current research attempts to transfer text to artifical scenes. A good understanding of the real meaning of prepositions is crucial in order for the machine to create matching scenes.

With the birth of the transformer models in 2017 [1], attention based models have been pushing boundries in many NLP disciplines. In particular, bert, a transformer model by google and pre-trained on more than 3,000 M words, obtained state-of-the-art results on many NLP tasks and Corpus.

The goal of this project is to use modern transformer models to tackle the problem of preposition sense disambiguation. Therefore, we trained a simple bert model on the SemEval 2007 dataset [2], a central benchmark dataset for this task. To the best of our knowledge, the best purposed model for disambiguating the meanings of prepositions on the SemEval achives an accuracy of up to 88% [3]. Neither more recent approaches surpass this frontier[4][5] . Our model achives an accuracy of 90.84%, out-performing the current state-of-the-art.

How to train

To meet our goals, we cleand the SemEval 2007 dataset to only contain the needed information. We have added it to the repository and can be found in ./data/training-data.tsv.

Train a bert model:
First, install the requirements.txt. Afterwards, you can train the bert-model by:

python3 trainer.py --batch-size 16 --learning-rate 1e-4 --epochs 4 --data-path "./data/training_data.tsv"

The chosen hyper-parameters in the above example are tuned and already set by default. After training, this will save the weights and config to a new folder ./model_save/. Feel free to omit this training-step and use our trained weights directly.

Examples

We attach an example tagger, which can be used in an interactive manner. python3 -i tagger.py

Sourrond the preposition, for which you like to know the meaning of, with <head>...</head> and feed it to the tagger:

>>> tagger.tag("I am <head>in</head> big trouble")
Predicted Meaning: Indicating a state/condition/form, often a mental/emotional one that is being experienced 

>>> tagger.tag("I am speaking <head>in</head> portuguese.")
Predicted Meaning: Indicating the language, medium, or means of encoding (e.g., spoke in German)

>>> tagger.tag("He is swimming <head>with</head> his hands.")
Predicted Meaning: Indicating the means or material used to perform an action or acting as the complement of similar participle adjectives (e.g., crammed with, coated with, covered with)

>>> tagger.tag("She blinked <head>with</head> confusion.")
Predicted Meaning: Because of / due to (the physical/mental presence of) (e.g., boiling with anger, shining with dew)

References

[1] Vaswani, Ashish et al. (2017). Attention is all you need. Advances in neural information processing systems. P. 5998--6008.

[2] Litkowski, Kenneth C and Hargraves, Orin (2007). SemEval-2007 Task 06: Word-sense disambiguation of prepositions. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). P. 24--29

[3] Litkowski, Ken. (2013). Preposition disambiguation: Still a problem. CL Research, Damascus, MD.

[4] Gonen, Hila and Goldberg, Yoav. (2016). Semi supervised preposition-sense disambiguation using multilingual data. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. P. 2718--2729

[5] Gong, Hongyu and Mu, Jiaqi and Bhat, Suma and Viswanath, Pramod (2018). Preposition Sense Disambiguation and Representation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. P. 1510--1521

Owner
Dirk Neuhäuser
Dirk Neuhäuser
Repository for the paper: VoiceMe: Personalized voice generation in TTS

🗣 VoiceMe: Personalized voice generation in TTS Abstract Novel text-to-speech systems can generate entirely new voices that were not seen during trai

Pol van Rijn 80 Dec 29, 2022
NeMo: a toolkit for conversational AI

NVIDIA NeMo Introduction NeMo is a toolkit for creating Conversational AI applications. NeMo product page. Introductory video. The toolkit comes with

NVIDIA Corporation 5.3k Jan 04, 2023
Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 01, 2023
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 08, 2023
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 04, 2022
ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

Tam Zher Min 2 Jun 24, 2022
Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

Ponchotitlán 1 Aug 19, 2021
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk James Turk 1.8k Dec 21, 2022

Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

New & (hopefully) Improved CYGNUS with several API updates, user updates, and online/offline operations added!!!

Simran Farrukh 0 Mar 28, 2022
Automatically search Stack Overflow for the command you want to run

stackshell Automatically search Stack Overflow (and other Stack Exchange sites) for the command you want to ru Use the up and down arrows to change be

circuit10 22 Oct 27, 2021
A simple Streamlit App to classify swahili news into different categories.

Swahili News Classifier Streamlit App A simple app to classify swahili news into different categories. Installation Install all streamlit requirements

Davis David 4 May 01, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022