LazyText is inspired b the idea of lazypredict, a library which helps build a lot of basic models without much code.

Overview

LazyText

lazy

lazytext Documentation Code Coverage Downloads

LazyText is inspired b the idea of lazypredict, a library which helps build a lot of basic mpdels without much code. LazyText is for text what lazypredict is for numeric data.

  • Free Software: MIT licence

Installation

To install LazyText

pip install lazytext

Usage

To use lazytext import in your project as

from lazytext.supervised import LazyTextPredict

Text Classification

Text classification on BBC News article classification.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from lazytext.supervised import LazyTextPredict
import re
import nltk

# Load the dataset
df = pd.read_csv("tests/assets/bbc-text.csv")
df.dropna(inplace=True)

# Download models required for text cleaning
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# split the data into train set and test set
df_train, df_test = train_test_split(df, test_size=0.3, random_state=13)

# Tokenize the words
df_train['clean_text'] = df_train['text'].apply(nltk.word_tokenize)
df_test['clean_text'] = df_test['text'].apply(nltk.word_tokenize)

# Remove stop words
stop_words=set(nltk.corpus.stopwords.words("english"))
df_train['text_clean'] = df_train['clean_text'].apply(lambda x: [item for item in x if item not in stop_words])
df_test['text_clean'] = df_test['clean_text'].apply(lambda x: [item for item in x if item not in stop_words])

# Remove numbers, punctuation and special characters (only keep words)
regex = '[a-z]+'
df_train['text_clean'] = df_train['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])
df_test['text_clean'] = df_test['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])

# Lemmatization
lem = nltk.stem.wordnet.WordNetLemmatizer()
df_train['text_clean'] = df_train['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])
df_test['text_clean'] = df_test['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])

# Join the words again to form sentences
df_train["clean_text"] = df_train.text_clean.apply(lambda x: " ".join(x))
df_test["clean_text"] = df_test.text_clean.apply(lambda x: " ".join(x))

# Tfidf vectorization
vectorizer = TfidfVectorizer()

x_train = vectorizer.fit_transform(df_train.clean_text)
x_test = vectorizer.transform(df_test.clean_text)
y_train = df_train.category.tolist()
y_test = df_test.category.tolist()

lazy_text = LazyTextPredict(
    classification_type="multiclass",
    )
models = lazy_text.fit(x_train, x_test, y_train, y_test)


Label Analysis
| Classes             | Weights              |
|--------------------:|---------------------:|
| tech                | 0.8725490196078431   |
| politics            | 1.1528497409326426   |
| sport               | 1.0671462829736211   |
| entertainment       | 0.8708414872798435   |
| business            | 1.1097256857855362   |

 Result Analysis
| Model                         | Accuracy            | Balanced Accuracy   | F1 Score            | Custom Metric Score | Time Taken          |
| ----------------------------: | -------------------:| -------------------:| -------------------:| -------------------:| -------------------:|
| AdaBoostClassifier            | 0.7260479041916168  | 0.717737172132769   | 0.7248335989941609  | NA                  | 1.829047679901123   |
| BaggingClassifier             | 0.8817365269461078  | 0.8796633962363677  | 0.8814695332332374  | NA                  | 3.5215072631835938  |
| BernoulliNB                   | 0.9535928143712575  | 0.9505929193425733  | 0.9533647387436917  | NA                  | 0.020041465759277344|
| CalibratedClassifierCV        | 0.9760479041916168  | 0.9760018220340847  | 0.9755904096436046  | NA                  | 0.4990670680999756  |
| ComplementNB                  | 0.9760479041916168  | 0.9752329192546583  | 0.9754237510855159  | NA                  | 0.013598203659057617|
| DecisionTreeClassifier        | 0.8532934131736527  | 0.8473956671194278  | 0.8496464898940103  | NA                  | 0.478792667388916   |
| DummyClassifier               | 0.2155688622754491  | 0.2                 | 0.07093596059113301 | NA                  | 0.008046865463256836|
| ExtraTreeClassifier           | 0.7275449101796407  | 0.7253518459908658  | 0.7255575847020816  | NA                  | 0.026398658752441406|
| ExtraTreesClassifier          | 0.9655688622754491  | 0.9635363285903302  | 0.9649837485086689  | NA                  | 1.6907336711883545  |
| GradientBoostingClassifier    | 0.9565868263473054  | 0.9543725191544354  | 0.9554606292723953  | NA                  | 39.16400766372681   |
| KNeighborsClassifier          | 0.938622754491018   | 0.9370053693959814  | 0.9367294513157219  | NA                  | 0.14803171157836914 |
| LinearSVC                     | 0.9745508982035929  | 0.974262691599302   | 0.9740343976103922  | NA                  | 0.10053229331970215 |
| LogisticRegression            | 0.968562874251497   | 0.9668995859213251  | 0.9678778814908909  | NA                  | 2.9565982818603516  |
| LogisticRegressionCV          | 0.9715568862275449  | 0.9708896757262861  | 0.971147482393915   | NA                  | 109.64091444015503  |
| MLPClassifier                 | 0.9760479041916168  | 0.9753381642512078  | 0.9752912960666735  | NA                  | 35.64296746253967   |
| MultinomialNB                 | 0.9700598802395209  | 0.9678795721187026  | 0.9689200656860745  | NA                  | 0.024427413940429688|
| NearestCentroid               | 0.9520958083832335  | 0.9499045135454718  | 0.9515097876015481  | NA                  | 0.024636268615722656|
| NuSVC                         | 0.9670658682634731  | 0.9656159420289855  | 0.9669719954040374  | NA                  | 8.287142515182495   |
| PassiveAggressiveClassifier   | 0.9775449101796407  | 0.9772388820754925  | 0.9770812340935414  | NA                  | 0.10332632064819336 |
| Perceptron                    | 0.9775449101796407  | 0.9769254658385094  | 0.9768161404324825  | NA                  | 0.07216000556945801 |
| RandomForestClassifier        | 0.9625748502994012  | 0.9605135542632081  | 0.9624462948504477  | NA                  | 1.2427525520324707  |
| RidgeClassifier               | 0.9775449101796407  | 0.9769254658385093  | 0.9769176825464448  | NA                  | 0.17272400856018066 |
| SGDClassifier                 | 0.9700598802395209  | 0.9695007868373973  | 0.969787370271274   | NA                  | 0.13134551048278809 |
| SVC                           | 0.9715568862275449  | 0.9703778467908902  | 0.9713021262026043  | NA                  | 8.388679027557373   |

Result of each estimator is stored in models which is a list and each trained estimator is also returned which can be used further for analysis.

confusion matrix and classification reports are also part of the models if they are needed.

print(models[0])
{
    'name': 'AdaBoostClassifier',
    'accuracy': 0.7260479041916168,
    'balanced_accuracy': 0.717737172132769,
    'f1_score': 0.7248335989941609,
    'custom_metric_score': 'NA',
    'time': 1.829047679901123,
    'model': AdaBoostClassifier(),
    'confusion_matrix': array([
        [ 89,   5,  12,  35,   3],
        [  8,  58,   5,  44,   0],
        [  5,   2, 108,  10,   1],
        [  5,   7,   5, 138,   2],
        [ 25,   5,   1,   3,  92]]),
 'classification_report':
 """
            precision    recall  f1-score   support
        0       0.67      0.62      0.64       144
        1       0.75      0.50      0.60       115
        2       0.82      0.86      0.84       126
        3       0.60      0.88      0.71       157
        4       0.94      0.73      0.82       126
 accuracy                           0.73       668
 macro avg       0.76      0.72     0.72       668
 weighted avg    0.75      0.73     0.72       668'}

Custom metrics

LazyText also support custom metric for evaluation, this metric can be set up like following

from lazytext.supervised import LazyTextPredict
# Custom metric
def my_custom_metric(y_true, y_pred):

    ...do your stuff

    return score


lazy_text = LazyTextPredict(custom_metric=my_custom_metric)
lazy_text.fit(X_train, X_test, y_train, y_test)

If the signature of the custom metric function does not match with what is given above, then even though the custom metric is provided, it will be ignored.

Custom model parameters

LazyText also support providing parameters to the esitmators. For this just provide a dictornary of the parameters as shown below and those following arguments will be applied to the desired estimator.

In the following example I want to apply/change the default parameters of SVC classifier.

LazyText will fit all the models but only change the default parameters for SVC in the following case.

from lazytext.supervisd
custom_parameters = [
    {
        "name": "SVC",
        "parameters": {
            "C": 0.5,
            "kernel": 'poly',
            "degree": 5
        }
    }
]


l = LazyTextPredict(
    classification_type="multiclass",
    custom_parameters=custom_parameters
    )
l.fit(x_train, x_test, y_train, y_test)
You might also like...
Repository containing the code for An-Gocair text normaliser

Scottish Gaelic Text Normaliser The following project contains the code and resources for the Scottish Gaelic text normalisation project. The repo can

Code Jam for creating a text-based adventure game engine and custom worlds

Text Based Adventure Jam Author: Devin McIntyre Our goal is two-fold: Create a text based adventure game engine that can parse a standard file format

Microsoft's Cascadia Code font customized to my liking.

Microsoft's Cascadia Code font customized to my liking. Also includes some simple batch patch and bake scripts to batch patch glyphs and bake font features into fonts!

Hamming code generation, error detection & correction.

Hamming code generation, error detection & correction.

Simple python program to auto credit your code, text, book, whatever!

Credit Simple python program to auto credit your code, text, book, whatever! Setup First change credit_text to whatever text you would like to credit

A minimal code sceleton for a textadveture parser written in python.

Textadventure sceleton written in python Use with a map file generated on https://www.trizbort.io Use the following Sockets for walking directions: n

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

The sequel to SquidNet. It has many of the previous features that were in the original script, however a lot of the functions that do not serve much functionality have been removed.

SquidNet2 The sequel to SquidNet. It has many of the previous features that were in the original script, however a lot of the functions that do not se

A python script providing an idea of how a MindSphere application, e.g., a dashboard, can be displayed around the clock without the need of manual re-authentication on enforced session expiration

A python script providing an idea of how a MindSphere application, e.g., a dashboard, can be displayed around the clock without the need of manual re-authentication on enforced session expiration

A concept I came up which ditches the idea of
A concept I came up which ditches the idea of "layers" in a neural network.

Dynet A concept I came up which ditches the idea of "layers" in a neural network. Install Copy Dynet.py to your project. Run the example Install matpl

Ubuntu env build; Nginx build; DB build;

Deploy 介绍 Deploy related scripts bitnami Dependencies Ubuntu openssl envsubst docker v18.06.3 docker-compose init base env upload https://gitlab-runn

Aggrokatz is an aggressor plugin extension for Cobalt Strike which enables pypykatz to interface with the beacons remotely and allows it to parse LSASS dump files and registry hive files to extract credentials and other secrets stored without downloading the file and without uploading any suspicious code to the beacon. A :baby: buddy to help caregivers track sleep, feedings, diaper changes, and tummy time to learn about and predict baby's needs without (as much) guess work.
A :baby: buddy to help caregivers track sleep, feedings, diaper changes, and tummy time to learn about and predict baby's needs without (as much) guess work.

Baby Buddy A buddy for babies! Helps caregivers track sleep, feedings, diaper changes, tummy time and more to learn about and predict baby's needs wit

Lazymux is a tool installer that is specially made for termux user which provides a lot of tool mainly used tools in termux and its easy to use
Lazymux is a tool installer that is specially made for termux user which provides a lot of tool mainly used tools in termux and its easy to use

Lazymux is a tool installer that is specially made for termux user which provides a lot of tool mainly used tools in termux and its easy to use, Lazymux install any of the given tools provided by it from itself with just one click, and its often get updated.

When doing audio and video sentiment recognition, I found that a lot of code is duplicated, often a function in different time debugging for a long time, based on this problem, I want to manage all the previous work, organized into an open source library can be iterative. For their own use and others. PathPicker accepts a wide range of input -- output from git commands, grep results, searches -- pretty much anything.After parsing the input, PathPicker presents you with a nice UI to select which files you're interested in. After that you can open them in your favorite editor or execute arbitrary commands.
A simple script which allows you to see how much GEXP you earned for playing in the last Minecraft Hypixel server session

Project Landscape A simple script which allows you to see how much GEXP you earned for playing in the Minecraft Server Hypixel Usage Install python 3.

Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more.

Ross-Virtual-Assistant Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more. Installation Downloa

Releases(0.0.2)
Owner
Jay Vala
Data Scientist at scoutbee
Jay Vala
Python flexible slugify function

awesome-slugify Python flexible slugify function PyPi: https://pypi.python.org/pypi/awesome-slugify Github: https://github.com/dimka665/awesome-slugif

Dmitry Voronin 471 Dec 20, 2022
Make writing easier!

Handwriter Make writing easier! How to Download and install a handwriting font, or create a font from your handwriting. Use a word processor like Micr

64 Dec 25, 2022
🍋 A Python package to process food

Pyfood is a simple Python package to process food, in different languages. Pyfood's ambition is to be the go-to library to deal with food, recipes, on

Local Seasonal 8 Apr 04, 2022
Goblin-sim - Procedural fantasy world generator

goblin-sim This project is an attempt to create a procedural goblin fantasy worl

3 May 18, 2022
A python tool one can extract the "hash" from a WINDOWS HELLO PIN

WINHELLO2hashcat About With this tool one can extract the "hash" from a WINDOWS HELLO PIN. This hash can be cracked with Hashcat, more precisely with

33 Dec 05, 2022
"Complexity" of Flags of the countries of the world

"Complexity" of Flags of the countries of the world Flags (png) from: https://flagcdn.com/w2560.zip https://flagpedia.net/download/images run: chmod +

Alexander Lelchuk 1 Feb 10, 2022
Free & simple way to encipher text

VenSipher VenSipher is a free medium through which text can be enciphered. It can convert any text into an unrecognizable secret text that can only be

3 Jan 28, 2022
Vastasanuli - Vastasanuli pelaa Sanuli-peliä.

Vastasanuli Vastasanuli pelaa SANULI -peliä. Se ei aina voita. Käyttö Tarttet Pythonin (3.6+). Aja make (tai lataa words.txt muualta) Asentele vaaditt

Aarni Koskela 1 Jan 06, 2022
Extract knowledge from raw text

Extract knowledge from raw text This repository is a nearly copy-paste of "From Text to Knowledge: The Information Extraction Pipeline" with some cosm

Raphael Sourty 10 Dec 03, 2022
pydantic-i18n is an extension to support an i18n for the pydantic error messages.

pydantic-i18n is an extension to support an i18n for the pydantic error messages

Boardpack 48 Dec 21, 2022
一个可以可以统计群组用户发言,并且能将聊天内容生成词云的机器人

当前版本 v2.2 更新维护日志 更新维护日志 有问题请加群组反馈 Telegram 交流反馈群组 点击加入 演示 配置要求 内存:1G以上 安装方法 使用 Docker 安装 Docker官方安装

机器人总动员 117 Dec 29, 2022
Repository containing the code for An-Gocair text normaliser

Scottish Gaelic Text Normaliser The following project contains the code and resources for the Scottish Gaelic text normalisation project. The repo can

3 Jun 28, 2022
Convert English text to IPA using the toPhonetic

Installation: Windows python -m pip install text2ipa macOS sudo pip3 install text2ipa Linux pip install text2ipa Features Convert English text to I

Joseph Quang 3 Jun 14, 2022
Hamming code generation, error detection & correction.

Hamming code generation, error detection & correction.

Farhan Bin Amin 2 Jun 30, 2022
WorldCloud Orçamento de Estado 2022

World Cloud Orçamento de Estado 2022 What it does This script creates a worldcloud, masked on a image, from a txt file How to run it? Install all libr

Jorge Gomes 2 Oct 12, 2021
A python tool to convert Bangla Bijoy text to Unicode text.

Unicode Converter A python tool to convert Bangla Bijoy text to Unicode text. Installation Unicode Converter can be installed via PyPi. Make sure pip

Shahad Mahmud 10 Sep 29, 2022
A simple Python module for parsing human names into their individual components

Name Parser A simple Python (3.2+ & 2.6+) module for parsing human names into their individual components. hn.title hn.first hn.middle hn.last hn.suff

Derek Gulbranson 574 Dec 20, 2022
utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresses and hashtags.

utoken utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresse

Ulf Hermjakob 11 Jan 05, 2023
Chilean Digital Vaccination Pass Parser (CDVPP) parses digital vaccination passes from PDF files

cdvpp Chilean Digital Vaccination Pass Parser (CDVPP) parses digital vaccination passes from PDF files Reads a Digital Vaccination Pass PDF file as in

Esteban Borai 1 Nov 17, 2021
A collection of pre-commit hooks for handling text files.

texthooks A collection of pre-commit hooks for handling text files. In particular, hooks for handling unicode characters which may be undesirable in a

Stephen Rosen 5 Oct 28, 2022