Ecommerce product title recognition package

Last update: Mar 03, 2022

Overview

revizor

This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you name it).
Imagine classic named entity recognition, but recognition done on product titles.

Install

revizor requires python 3.8+ version on Linux or macOS, Windows isn't supported now, but contributions are welcome.

$ pip install revizor

Usage

from revizor.tagger import ProductTagger

tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")

assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.article == "CY.563781.P273"

Boring numbers

Actually, just output from flair training log:

Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766

By class:
ARTICLE    tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND      tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL      tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE       tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789

Dataset

Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.
But we can give hint on how to obtain it, don't we?
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.
We extract product title and table with product info, then we parse brand and model strings from product info table.
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:

product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"

product_type, product_model_plus_some_random_info = product_title.split(brand)

product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'

License

This package is licensed under MIT license.

Ecommerce product title recognition package

Related tags

Overview

revizor

Install

Usage

Boring numbers

Dataset

License

Owner

Bureaucratic Labs

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

Unsupervised text tokenizer for Neural Network-based text generation.

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Ecommerce product title recognition package

Pytorch version of BERT-whitening

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

CoSENT 比Sentence-BERT更有效的句向量方案

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Contract Understanding Atticus Dataset

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

Code examples for my Write Better Python Code series on YouTube.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Twitter Sentiment Analysis using #tag, words and username

p-tuning for few-shot NLU task