Data preprocessing rosetta parser for python

Overview

datapreprocessing_rosetta_parser

I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity, specifically targeting popular packages like pandas, beautifulsoup and spacy.

The main idea of my project is to recreate Jelle Teijema's preprocessing pipeline and then try to run Dutch language model on each document to extract things of interest, such as emails, urls, organizations, people and dates. Maybe at this point, it shouldn't be considered just pre-processing, hmmm. Anyway, I've used nl_core_news_lg model. It is not very reliable, especially for organization and person names, however, it still allows for interesting queries.

Moreover, I've decided to try to do a summarization and collection of the most frequent words in the documents. My script tries to find N_SUMMARY_SENTENCES most important sentences and store it in the summary column. Please note, my Dutch is not very strong, so I can't really judge how well it works :)

Finally, the script also saves cleaned title and file contents, as per track anticipated output.

Output file

generate.py reads .csv files from input_data folder and produces output .csv file with | separator. It is pretty heavy (about x1.8 of input csv, ~75MB) and has a total of 15 columns:

Column name Description
filename Original filename provided in the input file
file_content Original file contents provided in the input file
id The dot separated numbers from the filename
category Type of a file
filename_date Date extracted from a filename
parsed_date Date extracted from file contents
found_emails Emails found in the file contents
found_urls URLs found in the file contents
found_organizations Organizations found in the file contents
found_people People found in the file contents
found_dates Dates found in the file contents
summary Summary of the document
top5words Top 5 most frequently used words in the file contents
title Somewhat cleaned title
abstract Somewhat cleaned file contents

Some interesting queries that I could think of at 12pm

  1. Load the output processed .csv file:
import pandas as pd
df = pd.read_csv('./output_data/processed_data.csv', sep='|',
                 index_col=0, dtype=str)
  1. All unique emails found in the documents:
import ast
emails = sum([ast.literal_eval(x) for x in df['found_emails']], [])
unique_emails = set(emails)
  1. Top 10 communicated domains in the documents:
from collections import Counter
domains = [x.split('@')[1] for x in emails]
d_counter = Counter(domains)
print(d_counter.most_common(10))
  1. Top 10 organizations mentioned in the documents:
orgs = sum([ast.literal_eval(x) for x in df['found_organizations']], [])
o_counter = Counter(orgs)
print(o_counter.most_common(10))
  1. Find IDs of documents that contain word "confidential" in them:
df['id'][df['abstract'].str.contains('confidential')]
  1. How many documents and categories there are in the dataset:
print(f'Total number of documents: {len(df)}')
print('Documents by category:')
df['category'].value_counts()

and I am sure you can be significantly more creative with this :)

How to generate output data

  1. Install dependencies with conda and switch to the environment:
conda env create -f environment.yml
conda activate ftm_hackathon

Alternatively (not tested), you can install packages to your current environment manually:

pip install spacy tqdm pandas bs4
  1. Download Dutch spacy model, ~500MB:
python -m spacy download nl_core_news_lg
  1. Put your raw .csv files into input_data folder.

  2. Run generate.py. On my 6yo laptop it takes ~17 minutes.

  3. The result will be written in output_data/processed_data.csv

Owner
ASReview hackathon for Follow the Money
ASReview hackathon for Follow the Money
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

12 Jan 20, 2022
L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 01, 2023
Python utility library for compositing PDF documents with reportlab.

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the s

Michael Gale 1 Jan 06, 2022
A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Persian NLP Benchmark The repository aims to track existing natural language processing models and evaluate their performance on well-known datasets.

Mofid AI 68 Dec 19, 2022
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

AmorTX 12 Dec 14, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
تولید اسم های رندوم فینگیلیش

karafs کرفس تولید اسم های رندوم فینگیلیش installation ➜ pip install karafs usage دو زبانه ➜ karafs -n 10 توت فرنگی بی ناموس toot farangi-ye bi_namoos

Vaheed NÆINI (9E) 36 Nov 24, 2022
Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

Multilabel time series classification with LSTM Tensorflow implementation of model discussed in the following paper: Learning to Diagnose with LSTM Re

Aaqib 552 Nov 28, 2022
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022
Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

Ivan 20 Aug 22, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
JaQuAD: Japanese Question Answering Dataset

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)

SkelterLabs 84 Dec 27, 2022
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
Creating an LSTM model to generate music

Music-Generation Creating an LSTM model to generate music music-generator Used to create basic sin wave sounds music-ai Contains the functions to conv

Jerin Joseph 2 Dec 02, 2021
This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

Xuechen Li 73 Dec 28, 2022
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

Tianxiang Sun 41 Dec 30, 2022