📝An easy-to-use package to restore punctuation of the text.

Last update: Dec 30, 2022

Related tags

Overview

✏️ rpunct - Restore Punctuation

This repo contains code for Punctuation restoration.

This package is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks. It uses HuggingFace's bert-base-uncased model weights that have been fine-tuned for Punctuation restoration.

Punctuation restoration works on arbitrarily large text. And uses GPU if it's available otherwise will default to CPU.

List of punctuations we restore:

Upper-casing
Period: .
Exclamation: !
Question Mark: ?
Comma: ,
Colon: :
Semi-colon: ;
Apostrophe: '
Dash: -

🚀 Usage

Below is a quick way to get up and running with the model.

First, install the package.

pip install rpunct

Sample python code.

from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

🎯 Accuracy

Here is the number of product reviews we used for finetuning the model:

Language	Number of text samples
English	560,000

We found the best convergence around 3 epochs, which is what presented here and available via a download.

The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:

Accuracy	Overall F1	Eval Support
91%	90%	45,990

💻 🎯 Further Fine-Tuning

To start fine-tuning or training please look into training/train.py file. Running python training/train.py will replicate the results of this model.

☕ Contact

Contact Daulet Nurmanbetov for questions, feedback and/or requests for similar models.

Comments

Update requirements.txt

ERROR: Could not find a version that satisfies the requirement torch==1.8.1 (from rpunct) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0) ERROR: No matching distribution found for torch==1.8.1

opened by Rukaya-lab 0
Forked repo with fixes
I forked this repository (link here) to fix the outdated dependencies and incompatibility with non-CUDA machines. If anyone needs these fixes, feel free to install from the fork:

pip install git+https://github.com/samwaterbury/rpunct.git

Hopefully this repository is updated or another maintainer is assigned. And thanks to the creator @Felflare, this is a useful tool!
opened by samwaterbury 2
Requirements shouldn't ask for such specific versions

First, thanks a lot for providing this package :)

Currently, the requirements.txt, and thus the dependencies in the setup.py are for very specific versions of Pytorch etc. This shouldn't be the case if you want this package to be used as a general library (think of a second package that would do the same but ask for an incompatible version of PyTorch and would prevent any possible installation of the two together). The end user might also be needing a more recent version of PyTorch. Given that PyTorch is almost always backward compatible, and quite stable, I think the requirements for it could be changed from ==1.8.1 to >=1.8.1. I believe the same would be true for the other packages.

opened by adefossez 2
Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.
Thanks for the great library! When running this without a GPU I had problems. I think there is a simple fix. The simple transformer NER model defaults to enabling cuda. This PR allows the user to pass a dictionary of arguments specifically for the simpletransformers NER model. So you can now run the code on a CPU by initializing rpunct like so

rpunct = RestorePuncts(ner_args={"use_cuda": False})

Before this change, when running rpunct examples on the CPU the following error occurs:

from rpunct import RestorePuncts # The default language is 'english' rpunct = RestorePuncts() rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated 3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")

ValueError Traceback (most recent call last) /var/folders/hx/dhzhl_x51118fm5cd13vzh2h0000gn/T/ipykernel_10548/194907560.py in 1 from rpunct import RestorePuncts 2 # The default language is 'english' ----> 3 rpunct = RestorePuncts() 4 rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record 5 by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were

~/repos/rpunct/rpunct/punctuate.py in init(self, wrds_per_pred, ner_args) 19 if ner_args is None: 20 ner_args = {} ---> 21 self.model = NERModel("bert", "felflare/bert-restore-punctuation", labels=self.valid_labels, 22 args={"silent": True, "max_seq_length": 512}, **ner_args) 23

~/repos/transformers/transformer-env/lib/python3.8/site-packages/simpletransformers/ner/ner_model.py in init(self, model_type, model_name, labels, args, use_cuda, cuda_device, onnx_execution_provider, **kwargs) 209 self.device = torch.device(f"cuda:{cuda_device}") 210 else: --> 211 raise ValueError( 212 "'use_cuda' set to True when cuda is unavailable." 213 "Make sure CUDA is available or set use_cuda=False."

ValueError: 'use_cuda' set to True when cuda is unavailable.Make sure CUDA is available or set use_cuda=False.
opened by nbertagnolli 1
add use_cuda parameter

using the package in an environment without cuda support causes it to fail. Adding the parameter to shut it off if necessary allows it to function normall.

opened by mjfox3 1

Releases(1.0.1)

1.0.1(May 24, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Daulet Nurmanbetov

Deep Learning, AI and Finance

GitHub Repository

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

2 Jun 24, 2022

ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ChatterBot ChatterBot is a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on

12.8k Jan 03, 2023

AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

1 Jan 04, 2022

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

128 Dec 29, 2022

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

82 Dec 19, 2022

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

58 Dec 23, 2022

Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

NLP-Models-Tensorflow, Gathers machine learning and tensorflow deep learning models for NLP problems, code simplify inside Jupyter Notebooks 100%. Tab

1.7k Dec 30, 2022

A telegram bot to translate 100+ Languages

🔥 GOOGLE TRANSLATER 🔥 The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • 🔰 Deploy To Railway 🔰 • • ✅ OFF

5 Dec 20, 2021

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

12.3k Jan 02, 2023

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

26 Oct 17, 2022

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

147 Dec 05, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 01, 2023

MASS: Masked Sequence to Sequence Pre-training for Language Generation

1.1k Dec 17, 2022

ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

11 Dec 09, 2022

A Transformer Implementation that is easy to understand and customizable.

Simple Transformer I've written a series of articles on the transformer architecture and language models on Medium. This repository contains an implem

4 Jan 20, 2022

Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

3.5k Jan 03, 2023

Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

161 Dec 28, 2022

MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

228 Nov 21, 2022

An open-source NLP library: fast text cleaning and preprocessing.

An open-source NLP library: fast text cleaning and preprocessing

21 Mar 18, 2022

🤕 spelling exceptions builder for lazy people

3 May 12, 2022