A python package for deep multilingual punctuation prediction.

Last update: Dec 22, 2022

Overview

Deep Multilingual Punctuation Prediction

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

This uses our "FullStop" model that we trained on the Europarl Dataset. Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.

The code restores the following punctuation markers: "." "," "?" "-" ":"

Install

To get started install the package from pypi:

pip install deepmultilingualpunctuation

Usage

The PunctuationModel class an process texts of any length. Note that processing of very long texts can be time consuming.

Restore Punctuation

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

output

My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

Predict Labels

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

output

[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

Results

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

Label	EN	DE	FR	IT
0	0.991	0.997	0.992	0.989
.	0.948	0.961	0.945	0.942
?	0.890	0.893	0.871	0.832
,	0.819	0.945	0.831	0.798
:	0.575	0.652	0.620	0.588
-	0.425	0.435	0.431	0.421
macro average	0.775	0.814	0.782	0.762

References

Please cite us if you found this useful:

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

A python package for deep multilingual punctuation prediction.

Related tags

Overview

Deep Multilingual Punctuation Prediction

Install

Usage

Restore Punctuation

Predict Labels

Results

References

Owner

Oliver Guhr

Convolutional 2D Knowledge Graph Embeddings resources

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Natural Language Processing library built with AllenNLP 🌲🌱

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

Build Text Rerankers with Deep Language Models

NSFW A chatbot based on GPT2-chitchat

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

Auto translate textbox from Japanese to English or Indonesia

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

2021 2학기 데이터크롤링 기말프로젝트

A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Interpretable Models for NLP using PyTorch

Kinky furry assitant based on GPT2

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

GPT-3: Language Models are Few-Shot Learners

A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

A python package for deep multilingual punctuation prediction.

Related tags

Overview

Deep Multilingual Punctuation Prediction

Install

Usage

Restore Punctuation

Predict Labels

Results

References

Owner

Oliver Guhr

Convolutional 2D Knowledge Graph Embeddings resources

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Natural Language Processing library built with AllenNLP 🌲🌱

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

Build Text Rerankers with Deep Language Models

**NSFW** A chatbot based on GPT2-chitchat

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

Auto translate textbox from Japanese to English or Indonesia

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

2021 2학기 데이터크롤링 기말프로젝트

A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Interpretable Models for NLP using PyTorch

Kinky furry assitant based on GPT2

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

GPT-3: Language Models are Few-Shot Learners

A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

NSFW A chatbot based on GPT2-chitchat