Ecommerce product title recognition package

Last update: Mar 03, 2022

Overview

revizor

This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you name it).
Imagine classic named entity recognition, but recognition done on product titles.

Install

revizor requires python 3.8+ version on Linux or macOS, Windows isn't supported now, but contributions are welcome.

$ pip install revizor

Usage

from revizor.tagger import ProductTagger

tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")

assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.article == "CY.563781.P273"

Boring numbers

Actually, just output from flair training log:

Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766

By class:
ARTICLE    tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND      tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL      tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE       tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789

Dataset

Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.
But we can give hint on how to obtain it, don't we?
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.
We extract product title and table with product info, then we parse brand and model strings from product info table.
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:

product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"

product_type, product_model_plus_some_random_info = product_title.split(brand)

product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'

License

This package is licensed under MIT license.

Ecommerce product title recognition package

Related tags

Overview

revizor

Install

Usage

Boring numbers

Dataset

License

Owner

Bureaucratic Labs

A BERT-based reverse-dictionary of Korean proverbs

test

This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

Code for the paper "Language Models are Unsupervised Multitask Learners"

Summarization module based on KoBART

Tools for curating biomedical training data for large-scale language modeling

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

Anomaly Detection 이상치 탐지 전처리 모듈

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Repositório do trabalho de introdução a NLP

A python framework to transform natural language questions to queries in a database query language.

NSFW A chatbot based on GPT2-chitchat

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

PyWorld3 is a Python implementation of the World3 model

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

Ecommerce product title recognition package

Related tags

Overview

revizor

Install

Usage

Boring numbers

Dataset

License

Owner

Bureaucratic Labs

A BERT-based reverse-dictionary of Korean proverbs

test

This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

Code for the paper "Language Models are Unsupervised Multitask Learners"

Summarization module based on KoBART

Tools for curating biomedical training data for large-scale language modeling

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

Anomaly Detection 이상치 탐지 전처리 모듈

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Repositório do trabalho de introdução a NLP

A python framework to transform natural language questions to queries in a database query language.

**NSFW** A chatbot based on GPT2-chitchat

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

PyWorld3 is a Python implementation of the World3 model

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

NSFW A chatbot based on GPT2-chitchat