Journalism AI – Quotes extraction for modular journalism

This repo contains the code for the Guardian and AFP contribution for the JournalismAI Festival 2021.

Further reading can be found in our blog post.

The aim of the project is to extract quotes from news articles using Named Entity Recognition, add coreferencing information and format the results for an exploratory search tool.

The contribution consists of several self-contained pieces of work, namely:

a regular expression pipeline attempting to extract quotes by matching patterns
a rule set to define different types of quotes and guide the quote annotation
custom annotation recipes for the Prodigy software enabling quick and efficient data annotation
a post-processing pipeline for extracting quotes using a trained Spacy model and adding coreferencing information
example data and data schema for displaying the extracted quote information in a search tool

Repo structure

Each folder in this repo reflects one of the pieces of work mentioned above.

regex_pipeline/ – code to run the regular expression-based quote extraction
annotation_rules/ – document with rules and definitions to guide the quote annotation step
annotation_scripts/ – custom annotation scripts for Prodigy
coreference/ – proof of concept for rules-based coreferencing tool
schema/ – data output schema and example data

Each folder contains a separate README file with instructions to set up and run each piece of work.

Journalism AI – Quotes extraction for modular journalism

Related tags

Overview

Journalism AI – Quotes extraction for modular journalism

Repo structure

Owner

Journalism AI collab 2021

A natural language modeling framework based on PyTorch

Semantic search for quotes.

A highly sophisticated sequence-to-sequence model for code generation

Turn clang-tidy warnings and fixes to comments in your pull request

OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

a test times augmentation toolkit based on paddle2.0.

Simple GUI where you can enter an article and get a crisp summarized version.

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

The SVO-Probes Dataset for Verb Understanding

KR-FinBert And KR-FinBert-SC

☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Tools, wrappers, etc... for data science with a concentration on text processing

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Ask for weather information like a human

BiQE: Code and dataset for the BiQE paper