Pipeline for fast building text classification TF-IDF + LogReg baselines.

Overview

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

Text Classification Baseline

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Usage

Instead of writing custom code for specific text classification task, you just need:

  1. install pipeline:
pip install text-classification-baseline
  1. run pipeline:
  • either in terminal:
text-clf-train
  • or in python:
import text_clf

text_clf.train()

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

  • text
  • target

NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of only one file config.yaml.

Change config.yaml to create the desired configuration and train text classification model with the following command:

  • terminal:
text-clf-train --path_to_config config.yaml
  • python:
import text_clf

text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
verbose: true
path_to_save_folder: models

# data
data:
  train_data_path: data/train.csv
  valid_data_path: data/valid.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 0.0

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  multi_class: auto
  n_jobs: -1

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

  • model.joblib - sklearn pipeline with TF-IDF and LogReg steps
  • target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
  • config.yaml - config that was used to train the model
  • logging.txt - logging file

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}
You might also like...
Code for EMNLP 2021 main conference paper
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Pipeline for chemical image-to-text competition
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Text-Summarization-using-NLP - Text Summarization using NLP  to fetch BBC News Article and summarize its text and also it includes custom article Summarization A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

Text vectorization tool to outperform TFIDF for classification tasks
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

Text vectorization tool to outperform TFIDF for classification tasks
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

Comments
  • release v0.1.4

    release v0.1.4

    • fixed load_20newsgroups.py (#65 #71)
    • added Makefile (#71)
    • added logging confusion matrix (#72)
    • replaced all "valid" occurrences with "test" (#74)
    • updated docstrings (#77)
    • changed python interface - train function returns model and target_names_mapping (#78)
    enhancement 
    opened by dayyass 1
  • release v0.1.6

    release v0.1.6

    fixed token frequency support (add token frequency support #85) fixed threshold selection for binary classification (add threshold selection for binary classification #86)

    bug enhancement 
    opened by dayyass 0
  • release v0.1.5

    release v0.1.5

    • added lemmatization (#66)
    • added token frequency support (#84)
    • added threshold selection for binary classification (#79)
    • added arbitrary save folder name (#80)
    enhancement 
    opened by dayyass 0
  • release v0.1.5

    release v0.1.5

    • added lemmatization (#81)
    • added token frequency support (#85)
    • added threshold selection for binary classification (#86)
    • added arbitrary save folder name (#83)
    enhancement 
    opened by dayyass 0
Releases(v0.1.6)
  • v0.1.6(Nov 6, 2021)

    Release v0.1.6

    • fixed token frequency support (add token frequency support #85)
    • fixed threshold selection for binary classification (add threshold selection for binary classification #86)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Oct 21, 2021)

    Release v0.1.5 🥳🎉🍾

    • added pymorphy2 lemmatization (#81)
    • added token frequency support (#85)
    • added threshold selection for binary classification (#86)
    • added arbitrary save folder name (#83)

    pymorphy2 lemmatization (config.yaml)

    # preprocessing
    # (included in resulting model pipeline, so preserved for inference)
    preprocessing:
      lemmatization: pymorphy2
    

    token frequency support

    • text_clf.token_frequency.get_token_frequency(path_to_config) -
      get token frequency of train dataset according to the config file parameters

    threshold selection for binary classification

    • text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder) -
      get precision and recall metrics for precision-recall curve
    • text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder) -
      get false positive rate (fpr) and true positive rate (tpr) metrics for roc curve
    • text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall) -
      plot precision-recall curve
    • text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr) -
      plot roc curve
    • text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds) -
      plot precision, recall, f1-score curves for probability thresholds

    arbitrary save folder name (config.yaml)

    experiment_name: model
    
    Source code(tar.gz)
    Source code(zip)
  • v0.1.4(Oct 10, 2021)

    • fixed load_20newsgroups.py (#65 #71)
    • added Makefile (#71)
    • added logging confusion matrix (#72)
    • replaced all "valid" occurrences with "test" (#74)
    • updated docstrings (#77)
    • changed python interface - train function returns model and target_names_mapping (#78)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Sep 2, 2021)

  • v0.1.2(Aug 19, 2021)

  • v0.1.1(Aug 11, 2021)

  • v0.1.0(Aug 7, 2021)

Owner
Dani El-Ayyass
NLP Tech Lead @ Sber AI, Master Student in Applied Mathematics and Computer Science @ CMC MSU
Dani El-Ayyass
NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

Felix Hamborg 79 Dec 30, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 02, 2023
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
Fine-tune GPT-3 with a Google Chat conversation history

Google Chat GPT-3 This repo will help you fine-tune GPT-3 with a Google Chat conversation history. The trained model will be able to converse as one o

Nate Baer 7 Dec 10, 2022
使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。 基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。 Pretraining Task Mask Language Model,简称Mask LM,即

Desmond Ng 24 Dec 10, 2022
Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

Build a Discord AI Chatbot that Speaks like Your Favorite Character! This is a Discord AI Chatbot that uses the Microsoft DialoGPT conversational mode

Lynn Zheng 231 Dec 30, 2022
Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

Wordle_Bot Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time. It will log onto the wordle website and en

Lucas Polidori 15 Dec 11, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
2021语言与智能技术竞赛:机器阅读理解任务

LICS2021 MRC 1. 项目&任务介绍 本项目基于官方给定的baseline(DuReader-Checklist-BASELINE)进行二次改造,对整个代码框架做了简单的重构,对核心网络结构添加了注释,解耦了数据读取的模块,并添加了阈值确认的功能,一些小的细节也做了改进。 本次任务为202

roar 29 Dec 05, 2022
Asr abc - Automatic speech recognition(ASR),中文语音识别

语音识别的简单示例,主要在课堂演示使用 创建python虚拟环境 在linux 和macos 上验证通过 # 如果已经有pyhon3.6 环境,跳过该步骤,使用

LIyong.Guo 8 Nov 11, 2022
Let Xiao Ai speakers control third-party devices

A stupid way to extend miot/xiaoai. Demo for Panasonic Bath Bully FV-RB20VL1 逆向 Panasonic Smart China,获得控制浴霸的请求信息(HTTP 请求),详见 apps/panasonic.py; 2. 通过

bin 14 Jul 07, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 07, 2023
⚖️ A Statutory Article Retrieval Dataset in French.

A Statutory Article Retrieval Dataset in French This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code

Maastricht Law & Tech Lab 19 Nov 17, 2022
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

1 Oct 14, 2021
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
Shellcode antivirus evasion framework

Schrodinger's Cat Schrodinger'sCat is a Shellcode antivirus evasion framework Technical principle Please visit my blog https://idiotc4t.com/ How to us

idiotc4t 27 Jul 09, 2022
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022
Mysticbbs-rjam - rJAM splitscreen message reader for MysticBBS A46+

rJAM splitscreen message reader for MysticBBS A46+

Robbert Langezaal 4 Nov 22, 2022
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

wang zhang 0 Jun 28, 2022