a pytorch implementation of auto-punctuation learned character by character

Overview

Learning Auto-Punctuation by Reading Engadget Articles

DOI

Link to Other of my work

🌟 Deep Learning Notes: A collection of my notes going from basic multi-layer perceptron to convNet and LSTMs, Tensorflow to pyTorch.

💥 Deep Learning Papers TLDR; A growing collection of my notes on deep learning papers! So far covers the top papers from this years ICLR.

Overview

This project trains a bi-directional GRU to learn how to automatically punctuate a sentence by reading blog posts from Engadget.com character by character. The set of operation it learns include:

capitalization: <cap>
         comma:  ,
        period:  .
   dollar sign:  $
     semicolon:  ;
         colon:  :
  single quote:  '
  double quote:  "
  no operation: <nop>

Performance

After 24 epochs of training, the network achieves the following performance on the test-set:

    Test P/R After 24 Epochs 
    =================================
    Key: <nop>	Prec:  97.1%	Recall:  97.8%	F-Score:  97.4%
    Key: <cap>	Prec:  68.6%	Recall:  57.8%	F-Score:  62.7%
    Key:   ,	Prec:  30.8%	Recall:  30.9%	F-Score:  30.9%
    Key:   .	Prec:  43.7%	Recall:  38.3%	F-Score:  40.8%
    Key:   '	Prec:  76.9%	Recall:  80.2%	F-Score:  78.5%
    Key:   :	Prec:  10.3%	Recall:   6.1%	F-Score:   7.7%
    Key:   "	Prec:  26.9%	Recall:  45.1%	F-Score:  33.7%
    Key:   $	Prec:  64.3%	Recall:  61.6%	F-Score:  62.9%
    Key:   ;	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key:   ?	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key:   !	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A

As a frist attempt, the performance is pretty good! Especially since I did not fine tune with a smaller step size afterward, and the Engadget dataset used here is small in size (4MB total).

Double the training gives a small improvement.

Table 2. After 48 epochs of training

    Test P/R  Epoch 48 Batch 380
    =================================
    Key: <nop>	Prec:  97.1%	Recall:  98.0%	F-Score:  97.6%
    Key: <cap>	Prec:  73.2%	Recall:  58.9%	F-Score:  65.3%
    Key:   ,	Prec:  35.7%	Recall:  32.2%	F-Score:  33.9%
    Key:   .	Prec:  45.0%	Recall:  39.7%	F-Score:  42.2%
    Key:   '	Prec:  81.7%	Recall:  83.4%	F-Score:  82.5%
    Key:   :	Prec:  12.1%	Recall:  10.8%	F-Score:  11.4%
    Key:   "	Prec:  25.2%	Recall:  44.8%	F-Score:  32.3%
    Key:   $	Prec:  51.4%	Recall:  87.8%	F-Score:  64.9%
    Key:   ;	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key:   ?	Prec:   5.6%	Recall:   4.8%	F-Score:   5.1%
    Key:   !	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A

Usage

If you feel like using some of the code, you can cite this project via

@article{deeppunc,
  title={Deep-Auto-Punctuation},
  author={Yang, Ge},
  journal={arxiv},
  year={2017},
  doi={10.5281/zenodo.438358}
  url={https://zenodo.org/record/438358;
       https://github.com/episodeyang/deep-auto-punctuation}
}

To run

First unzip the engagdget data into folder ./engadget_data by running

tar -xvzf engadget_data.tar.gz

and then open up the notebook Learning Punctuations by reading Engadget.pynb, and you can just execute.

To view the reporting, open a visdom server by running

python visdom.server

and then go to http://localhost:8097

Requirements

pytorch numpy matplotlib tqdm bs4

Model Setup and Considerations

The initial setup I began with was a single uni-direction GRU, with input domain [A-z0-9] and output domain of the ops listed above. My hope at that time was to simply train the RNN to learn correcponding operations. A few things jumped out during the experiment:

  1. Use bi-directional GRU. with the uni-direction GRU, the network quickly learned capitalization of terms, but it had difficulties with single quote. In words like "I'm", "won't", there are simply too much ambiguity from reading only the forward part of the word. The network didn't have enough information to properly infer such punctuations.

    So I decided to change the uni-direction GRU to bi-direction GRU. The result is much better prediction for single quotes in concatenations.

    the network is still training, but the precision and recall of single quote is nowt close to 80%.

    This use of bi-directional GRU is standard in NLP processes. But it is nice to experience first-hand the difference in performance and training.

    A side effect of this switch is that the network now runs almost 2x slower. This leads to the next item in this list:

  2. Use the smallest model possible. At the very begining, my input embeding was borrowed from the Shakespeare model, so the input space include both capital alphabet as well as lower-case ones. What I didn't realize was that I didn't need the capital cases because all inputs were lower-case.

    So when the training became painfully slow after I switch to bi-directional GRU, I looked for ways to make the training faster. A look at the input embeding made it obvious that half of the embedding space wasn't needed.

    Removing the lower case bases made the traing around 3x faster. This is a rough estimate since I also decided to redownload the data set at the same time on the same machine.

  3. Text formatting. Proper formating of input text crawed from Engadget.com was crucial, especially because the occurrence of a lot of the puncuation was low and this is a character-level model. You can take a look at the crawed text inside ./engadget_data_tar.gz.

  4. Async and Multi-process crawing is much much faster. I initially wrote the engadget crawer as a single threaded class. Because the python requests library is synchronous, the crawler spent virtually all time waiting for the GET requests.

    This could be made a lot faster by parallelizing the crawling, or use proper async pattern.

    This thought came to me pretty late during the second crawl so I did not implement it. But for future work, parallel and async crawler is going to be on the todo list.

  5. Using Precision/Recall in a multi-class scenario. The setup makes the reasonable assumption that each operation can only be applied mutually exclusively. The accuracy metric used here are precision/recall and the F-score, both commonly used in the literature1, 2. The P/R and F-score are implemented according to wikipedia 3, 4.

    example accuracy report:

    Epoch 0 Batch 400 Test P/R
    =================================
    Key: <nop>	Prec:  99.1%	Recall:  96.6%	F-Score:  97.9%
    Key:   ,	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key: <cap>	Prec: 100.0%	Recall:  75.0%	F-Score:  85.7%
    Key:   .	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key:   '	Prec:  66.7%	Recall: 100.0%	F-Score:  80.0%
    
    
    true_p:	{'<nop>': 114, '<cap>': 3, "'": 2}
    p:	{'<nop>': 118, '<cap>': 4, "'": 2}
    all_p:	{'<nop>': 115, ',': 2, '<cap>': 3, '.': 1, "'": 3}
    
    400it [06:07,  1.33s/it]
    
  6. Hidden Layer initialization: In the past I've found it was easier for the neural network to generate good results when both the training and the generation starts with a zero initial state. In this case because we are computing time limited, I zero the hidden layer at the begining of each file.

  7. Mini-batches and Padding: During training, I first sort the entire training set by the length of each file (there are 45k of them) and arrange them in batches, so that files inside each batch are roughly similar size, and only minimal padding is needed. Sometimes the file becomes too long. In that case I use data.fuzzy_chunk_length() to calculate a good chunk length with heuristics. The result is mostly no padding during most of the trainings.

    Going from having no mini-batch to having a minibatch of 128, the time per batch hasn't changed much. The accuracy report above shows the training result after 24 epochs.

Data and Cross-Validation

The entire dataset is composed of around 50k blog posts from engadget. I randomly selected 49k of these as my training set, 50 as my validation set, and around 0.5k as my test set. The training is a bit slow on an Intel i7 desktop, averaging 1.5s/file depending on the length of the file. As a result, it takes about a day to go through the entire training set.

Todo:

All done.

Done:

  • execute demo test after training
  • add final performance metric
  • implement minibatch
  • a generative demo
  • add validation (once an hour or so)
  • add accuracy metric, use precision/recall.
  • change to bi-directional GRU
  • get data
  • Add temperature to generator
  • add self-feeding generator
  • get training to work
  • use optim and Adam

References

1: https://www.aclweb.org/anthology/D/D16/D16-1111.pdf
2: https://phon.ioc.ee/dokuwiki/lib/exe/fetch.php?media=people:tanel:interspeech2015-paper-punct.pdf
3: https://en.wikipedia.org/wiki/precision_and_recall
4: https://en.wikipedia.org/wiki/F1_score

Comments
  • Can't run example training

    Can't run example training

    Hi there,

    As a first contact with your project I am trying to run it 'as is', but it seems I can't run a proper training with the example notebook.

    All preliminary steps work fine, but when launching the training step, the only output I get is 0it [00:00, ?it/s], repeatedly, and nothing further is happening either on logs or in the directories (nothing is saved).

    By the way, I believe some indentation is missed for the last 2 lines of the block (need to be inside the for loop for batch_ind to be defined).

    I managed to test the pretrained models though.

    Thanks for your implementation anyways, and thanks in advance for your reply! Hope to get the code running properly soon :)

    F

    opened by francoishernandez 5
  • view size not compatible with input tensor size and stride

    view size not compatible with input tensor size and stride

    Traceback (most recent call last): File "train.py", line 71, in <module> egdt.forward(input_, target_) File "/home/flow/deep-auto-punctuation/model.py", line 90, in forward self.next_(input_batch) File "/home/flow/deep-auto-punctuation/model.py", line 111, in next_ self.output, self.hidden = self.model(self.embed(input_text), self.hidden) File "/home/flow/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/flow/deep-auto-punctuation/model.py", line 32, in forward output = self.decoder(gru_output.view(-1, self.hidden_size * self.bi_mul)) RuntimeError: invalid argument 2: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Call .contiguous() before .view(). at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/TH/generic/THTensor.cpp:237

    Any idea what could be causing this?

    opened by sebastianvermaas 3
  • input string has differnt length from punctuation list

    input string has differnt length from punctuation list

    Hi, i've tried your code and got assertion error in data module due to length of input string and punctuation was not the same. I loaded your latest model without trained it again. do you have any advice ?

    gits

    opened by laluarif93 0
  •  Adding a license

    Adding a license

    Hello,

    Would you be willing to add a license to this repository? See: https://opensource.stackexchange.com/questions/1720/what-can-i-assume-if-a-publicly-published-project-has-no-license

    Thank you.

    opened by Tejash241 0
  • Issue with visdom.server

    Issue with visdom.server

    Hi there

    Brilliant work I looked up everywhere for a soluton for that but didn't solve the issue

    I keep getting the error message "C:\Users\David\Anaconda3\lib\site-packages\visdom\server.py:39: DeprecationWarning: zmq.eventloop.ioloop is deprecated in pyzmq 17. pyzmq now works with default tornado and asyncio eventloops."

    I try all the combinations of pip install visdom and conda install visdom etc I didn't succeed in compiling your work using the command make run

    Should I use a specific version of visdom or python to use your work?

    Thanks so much for your help Dave

    opened by Best-Trading-Indicator 3
  • Training size too big..Any suggestions?

    Training size too big..Any suggestions?

    Hi! I really am interested in testing and running through your code. Everything is working except for one thing.. When just before training, it says Too many values to unpack. Any suggestions on how to fix this issue? or any smaller size training sets you might know of?

    screen shot 2018-03-02 at 6 59 50 pm

    PS: it says python 2 but my kernel is python 3.

    opened by jlvasquezcollado 2
Releases(v1.0.0)
Owner
Ge Yang
Ge Yang
Square Root Bundle Adjustment for Large-Scale Reconstruction

RootBA: Square Root Bundle Adjustment Project Page | Paper | Poster | Video | Code Table of Contents Citation Dependencies Installing dependencies on

Nikolaus Demmel 205 Dec 20, 2022
Improving Compound Activity Classification via Deep Transfer and Representation Learning

Improving Compound Activity Classification via Deep Transfer and Representation Learning This repository is the official implementation of Improving C

NingLab 2 Nov 24, 2021
Official Implementation of "Designing an Encoder for StyleGAN Image Manipulation"

Designing an Encoder for StyleGAN Image Manipulation (SIGGRAPH 2021) Recently, there has been a surge of diverse methods for performing image editing

749 Jan 09, 2023
[ICLR 2021] HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark

HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark Accepted as a spotlight paper at ICLR 2021. Table of content File structure Prerequi

72 Jan 03, 2023
TensorFlow implementation of "Attention is all you need (Transformer)"

[TensorFlow 2] Attention is all you need (Transformer) TensorFlow implementation of "Attention is all you need (Transformer)" Dataset The MNIST datase

YeongHyeon Park 4 Jan 05, 2022
Learning a mapping from images to psychological similarity spaces with neural networks.

LearningPsychologicalSpaces v0.1: v1.1: v1.2: v1.3: v1.4: v1.5: The code in this repository explores learning a mapping from images to psychological s

Lucas Bechberger 8 Dec 12, 2022
WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

WarpDrive is a flexible, lightweight, and easy-to-use open-source reinforcement learning (RL) framework that implements end-to-end multi-agent RL on a single GPU (Graphics Processing Unit).

Salesforce 334 Jan 06, 2023
Hyperparameters tuning and features selection are two common steps in every machine learning pipeline.

shap-hypetune A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models. Overview Hyperparameters t

Marco Cerliani 422 Jan 08, 2023
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

GCNet for Object Detection By Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, Han Hu. This repo is a official implementation of "GCNet: Non-local Networ

Jerry Jiarui XU 1.1k Dec 29, 2022
A pytorch &keras implementation and demo of Fastformer.

Fastformer Notes from the authors Pytorch/Keras implementation of Fastformer. The keras version only includes the core fastformer attention part. The

153 Dec 28, 2022
Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

DeepPanoContext (DPC) [Project Page (with interactive results)][Paper] DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context G

Cheng Zhang 66 Nov 16, 2022
YuNetのPythonでのONNX、TensorFlow-Lite推論サンプル

YuNet-ONNX-TFLite-Sample YuNetのPythonでのONNX、TensorFlow-Lite推論サンプルです。 TensorFlow-LiteモデルはPINTO0309/PINTO_model_zoo/144_YuNetのものを使用しています。 Requirement Op

KazuhitoTakahashi 8 Nov 17, 2021
"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

SOLQ: Segmenting Objects by Learning Queries This repository is an official implementation of the paper SOLQ: Segmenting Objects by Learning Queries.

MEGVII Research 179 Jan 02, 2023
基于Paddle框架的fcanet复现

fcanet-Paddle 基于Paddle框架的fcanet复现 fcanet 本项目基于paddlepaddle框架复现fcanet,并参加百度第三届论文复现赛,将在2021年5月15日比赛完后提供AIStudio链接~敬请期待 参考项目: frazerlin-fcanet 数据准备 本项目已挂

QuanHao Guo 7 Mar 07, 2022
[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

[NeurIPS 2021] Galerkin Transformer: linear attention without softmax Summary A non-numerical analyst oriented explanation on Toward Data Science abou

Shuhao Cao 159 Dec 20, 2022
A generalized framework for prototyping full-stack cooperative driving automation applications under CARLA+SUMO.

OpenCDA OpenCDA is a SIMULATION tool integrated with a prototype cooperative driving automation (CDA; see SAE J3216) pipeline as well as regular autom

UCLA Mobility Lab 726 Dec 29, 2022
Code for the USENIX 2017 paper: kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels Blazing fast x86-64 VM kernel fuzzing framework with performant VM reloads for Linux, MacOS an

Chair for Sys­tems Se­cu­ri­ty 541 Nov 27, 2022
salabim - discrete event simulation in Python

Object oriented discrete event simulation and animation in Python. Includes process control features, resources, queues, monitors. statistical distrib

181 Dec 21, 2022
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

Federico Galatolo 172 Dec 22, 2022
Depth-Aware Video Frame Interpolation (CVPR 2019)

DAIN (Depth-Aware Video Frame Interpolation) Project | Paper Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang IEEE C

Wenbo Bao 7.7k Dec 31, 2022