An open-source NLP library: fast text cleaning and preprocessing.

Last update: Mar 18, 2022

Overview

🌴 dobbi 🦕

Takes care of all of this boring NLP stuff

Description

An open-source NLP library: fast text cleaning and preprocessing.

TL;DR

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.

Installation

To download dobbi, either fork this GitHub repo or simply use Pypi via pip:

$ pip install dobbi

Usage

Import the library:

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('Check here: https://some-url.com')

Supported methods and patterns

The process consists of three stages:

Initialization methods: initialize a dobbi Work object
Intermediate methods: chain patterns in the needed order
Terminal methods: choose if you need a function or a result

Initialization functions:

dobbi.clean()
dobbi.collect()
dobbi.replace()

Intermediate methods (pattern processing choice):

regexp() - custom regular expressions
url() - URLs
html() - HTML and "<...>" type markups
punctuation() - punctuation
hashtag() - hashtags
emoji() - emoji
emoticons() - emoticons
whitespace() - any type of whitespaces
nickname() - @-starting nicknames

Terminal methods:

execute(str) - executes chosen methods on the provided string.
function() - returns a function which is a combination of the chosen methods.

Examples

1) Clean a random Twitter message

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

2) Replace nicknames and urls with tokens

dobbi.replace() \
    .hashtag('') \
    .nickname() \
    .url('__CUSTOM_URL_TOKEN__') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'

3) Get the text cleanup function (one-liner)

~~Please, try to avoid the in-line method chaining, as it is less readable.~~ Do as your heart tells you.

func = dobbi.clean().url().hashtag().punctuation().whitespace().html().function()
func('\t #fun #lol    Why  @Alex33 is so... funny? 
    
    \nCheck
    \there: https://some-url.com'
   )

Result:

'Why Alex33 is so funny Check here'

Chain regexp methods

dobbi.clean() \
    .regexp('#\w+') \
    .regexp('@\w+') \
    .regexp('https?://\S+') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

Additional

Please pay attention that the functions are applied in the order you've specified them. So, you're better to chain .punctuation() as one of the last functions.

Call for collaboration 🤗

If you enjoyed the project I would be grateful if you supported it :)

Below is the list of useful features I would be happy to share with you:

Finding bugs
Making code optimizations
Writing tests
Help with new features development

Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

290 Dec 26, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

2.3k Dec 29, 2022

2k Feb 9, 2021

Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

2 Nov 28, 2021

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

2 Oct 22, 2022

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 2, 2023

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

15.3k Dec 30, 2022

An open-source NLP library: fast text cleaning and preprocessing.

Related tags

Overview

🌴 dobbi 🦕

Description

TL;DR

Installation

Usage

Interaction

Supported methods and patterns

Examples

1) Clean a random Twitter message

2) Replace nicknames and urls with tokens

3) Get the text cleanup function (one-liner)

Additional

Call for collaboration 🤗

You might also like...

Task-based datasets, preprocessing, and evaluation for sequence models.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Data preprocessing rosetta parser for python

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Releases(v0_13)

v0_13(Oct 29, 2021)

v0_10(Oct 19, 2021)

v0_06(Oct 18, 2021)

v0_03(Oct 16, 2021)

v0_02(Oct 16, 2021)

v0_01(Oct 16, 2021)

Owner

Iaroslav

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

A framework for cleaning Chinese dialog data

Easy-to-use CPM for Chinese text generation

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Mednlp - Medical natural language parsing and utility library

Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

Pretrain CPM - 大规模预训练语言模型的预训练代码

Wind Speed Prediction using LSTMs in PyTorch

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Multilingual text (NLP) processing toolkit

Collection of useful (to me) python scripts for interacting with napari

A script that automatically creates a branch name using google translation api and jira api

Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

GPT-2 Model for Leetcode Questions in python

An implementation of WaveNet with fast generation

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Google and Stanford University released a new pre-trained model called ELECTRA

Pretty-doc - Composable text objects with python

Translation to python of Chris Sims' optimization function