A method for cleaning and classifying text using transformers.

Overview

NLP Translation and Classification

The repository contains a method for classifying and cleaning text using NLP transformers.

Overview

The input data are web-scraped product names gathered from various e-shops. The products are either monitors or printers. Each product in the dataset has a scraped name containing information about the product brand, and product model name, but also unwanted noise - irrelevant information about the item. Additionally, only some records are relevant, meaning that they belong to the correct category: monitor or printer, while other records belong to unwanted categories like accessories or TVs.

The goal of the tasks is to preprocess web-scraped data by removing noisy records and cleaning product names. Preliminary experiments showed that classic machine learning methods like tf-idf vectorization and classification struggled to achieve good results. Instead NLP transformers were employed:

  • First, DistilBERT was utilized for removing irrelevant records. The available data are monitors with annotated labels where the records are classified into three classes: "Monitor", "TV", and "Noise".
  • After, T5 was applied for cleaning product names by translating scraped name into clean name containing only product brand and product model name. For instance, for the given input "monitor led aoc 24g2e 24" ips 1080 ..." the desired output is "aoc | 24g2e". The available data are monitors and printers with annotated targets.

The datasets are split into training, validation and test sets without overlapping records.

The results and details about training and evaluation procedure can be found in the Jupyter Notebooks, see Content section below.

Content

The repository contains Jupyter Notebooks for training and evaluating NNs:

  • 01_data_exploration.ipynb - The notebook contains an exploration of the datasets for sequence classification and translation. It includes visualization of distributions of targets, and overview of available metadata.
  • 02a_classification_fine_tuning.ipynb - The notebook fine-tunes a DistilBERT classifier using training and validation sets, and saves the trained checkpoint.
  • 02b_classification_evaluation.ipynb - The notebook evaluates classification scores on the test set. It includes: a classification report with precision, recall and F1 scores; and a confusion matrix.
  • 03a_translation_fine_tuning.ipynb - The notebook fine-tunes a T5 translation network using training and validation sets, and saves the trained checkpoint.
  • 03b_translation_evaluation.ipynb - The notebook evaluates translation metrics on the test set. The metrics are: Text Accuracy (exact match of target and predicted sequences); Levenshtein Score (normalized reversed Levenshtein Distance where 1 is the best and 0 is the worst); and Jaccard Index.
  • 04_benchmarking.ipynb - The notebook evaluates GPU memory and time needed for running inference on DistilBERT and T5 models using various values of batch size and sequence length.

Getting Started

Package Dependencies

The method were developed using Python=3.7 with transformers=4.8 framework that uses PyTorch=1.9 machine learning framework on a backend. Additionally, the repository requires packages: numpy, pandas, matplotlib and datasets.

To install required packages with PyTorch for CPU run:

pip install -r requirements.txt

For PyTorch with GPU run:

pip install -r requirements_gpu.txt

The requirement files do not contain jupyterlab nor any other IDE. To install jupyterlab run

pip install jupyterlab

Contact

Rail Chamidullin - [email protected] - Github account

Owner
Ray Chamidullin
Ray Chamidullin
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Yuki Okuda 3 Feb 27, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
Unsupervised Language Modeling at scale for robust sentiment classification

** DEPRECATED ** This repo has been deprecated. Please visit Megatron-LM for our up to date Large-scale unsupervised pretraining and finetuning code.

NVIDIA Corporation 1k Nov 17, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
NVDA, the free and open source Screen Reader for Microsoft Windows

NVDA NVDA (NonVisual Desktop Access) is a free, open source screen reader for Microsoft Windows. It is developed by NV Access in collaboration with a

NV Access 1.6k Jan 07, 2023
Journey is a NLP-Powered Developer assistant

Journey Journey is a NLP-Powered Developer assistant Using on the powerful Natural Language Processing library Mindmeld, this projects aims to assist

Christian Eilers 21 Dec 11, 2022
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Graph2Pix: A Graph-Based Image to Image Translation Framework Installation Install the dependencies in env.yml $ conda env create -f env.yml $ conda a

18 Nov 17, 2022
This repository structures data in title, summary, tags, sentiment given a fragment of a conversation

Understand-conversation-AI This repository structures data in title, summary, tags, sentiment given a fragment of a conversation How to install: pip i

Juan Camilo López Montes 1 Jan 11, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers an

Parv Bhatt 1 Jan 01, 2022
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022
Few-shot Natural Language Generation for Task-Oriented Dialog

Few-shot Natural Language Generation for Task-Oriented Dialog This repository contains the dataset, source code and trained model for the following pa

172 Dec 13, 2022
Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset The main part of the work focuses on the exploration and study of different approaches whi

Nikolas Petrou 1 Jan 12, 2022
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
Natural Language Processing library built with AllenNLP 🌲🌱

Custom Natural Language Processing with big and small models 🌲🌱

Recognai 65 Sep 13, 2022
Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Random Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short links

Mohammed Rabil 1 Jan 01, 2022
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023