Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Overview

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

This repository maintains some utility scripts for retrieving and preprocessing Wikipedia text for natural language processing (NLP) research.

Note: The scripts have been created for and tested with the Japanese version of Wikipedia only.

Preprocessed files

Some of the preprocessed files generated by this repo's scripts can be downloaded from the Releases page.

All the preprocessed files are distributed under the CC-BY-SA 3.0 and GFDL licenses. For more information, see the License section below.

Example usage of the scripts

Get Wikipedia page ids from a Cirrussearch dump file

get_all_page_ids_from_cirrussearch.py

This script extracts the page ids and revision ids of all pages from a Wikipedia Cirrussearch dump file (available from this site.) It also adds the following information to each item based on the information in the dump file:

  • "num_inlinks": the number of incoming links to the page.
  • "is_disambiguation_page": whether the page is a disambiguation page.
  • "is_sexual_page": whether the page is labeled containing sexual contents.
  • "is_violent_page": whether the page is labeled containing violent contents.
$ python get_all_page_ids_from_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20211129/jawiki-20211129-cirrussearch-content.json.gz \
--output_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json

# If you want the output file sorted by the page id:
$ cat ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json|jq -s -c 'sort_by(.pageid)[]' > ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129-sorted.json
$ mv ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129-sorted.json ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json

The script outputs a JSON Lines file containing following items, one item per line:

{
    "title": "アンパサンド",
    "pageid": 5,
    "revid": 85364431,
    "num_inlinks": 231,
    "is_disambiguation_page": false,
    "is_sexual_page": false,
    "is_violent_page": false
}

Get Wikipedia page HTMLs

get_page_htmls.py

This script fetches HTML contents of the Wikipedia pages specified by the page ids in the input file. It makes use of Wikimedia REST API to accsess the contents of Wikipedia pages.

Important: Be sure to check the terms and conditions of the API documented in the official page. Especially, you may not send more than 200 requests/sec to the API. You should also set your contact information (e.g., email address) in the User-Agent header so that Wikimedia can contact you quickly if necessary.

# It takes about 2 days to fetch all the articles in Japanese Wikipedia
$ python get_page_htmls.py \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--output_file ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz \
--language ja \
--user_agent <your_contact_information> \
--batch_size 20 \
--mobile

# If you want the output file sorted by the page id:
$ zcat ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz|jq -s -c 'sort_by(.pageid)[]'|gzip > ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129-sorted.json.gz
$ mv ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129-sorted.json.gz ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz

# Splitting the file for distribution
$ gunzip ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz
$ split -n l/5 --numeric-suffixes=1 --additional-suffix=.json ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.
$ gzip ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.*.json
$ gzip ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json

The script outputs a gzipped JSON Lines file containing following items, one item per line:

{
  "title": "アンパサンド",
  "pageid": 5,
  "revid": 85364431,
  "url": "https://ja.wikipedia.org/api/rest_v1/page/mobile-html/%E3%82%A2%E3%83%B3%E3%83%91%E3%82%B5%E3%83%B3%E3%83%89/85364431",
  "html": "
}

Extract paragraphs from the Wikipedia page HTMLs

extract_paragraphs_from_page_htmls.py

This script extracts paragraph texts from a Wikipedia page HTMLs file generated by get_page_htmls.py. You can specify the minimum and maximum length of the paragraph texts to be extracted.

# This produces 8,921,367 paragraphs
$ python extract_paragraphs_from_page_htmls.py \
--page_htmls_file ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--min_paragraph_length 10 \
--max_paragraph_length 1000

Make a plain text corpus of Wikipedia paragraph/page texts

make_corpus_from_paragraphs.py

This script produces a plain text corpus file from a paragraphs file generated by extract_paragraphs_from_page_htmls.py. You can optionally filter out disambiguation/sexual/violent pages from the output file by specifying the corresponding command line options.

Here we use mecab-ipadic-NEologd in splitting texts into sentences so that some sort of named entities will not be split into sentences.

The output file is a gzipped text file containing one sentence per line, with the pages separated by blank lines.

# 22,651,544 lines from all pages
$ python make_corpus_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/corpus-jawiki-20211129.txt.gz \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7' \
--min_sentence_length 10 \
--max_sentence_length 1000

# 18,721,087 lines from filtered pages
$ python make_corpus_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/corpus-jawiki-20211129-filtered-large.txt.gz \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7' \
--min_sentence_length 10 \
--max_sentence_length 1000 \
--min_inlinks 10 \
--exclude_sexual_pages

make_corpus_from_cirrussearch.py

This script produces a plain text corpus file by simply taking the text attributes of pages from a Wikipedia Cirrussearch dump file.

The resulting corpus file will be somewhat different from the one generated by make_corpus_from_paragraphs.py due to some differences in text processing. In addition, since the text attributes in the Cirrussearch dump file does not retain the page structure, it is less flexible to modify the processing of text compared to processing an HTML file with make_corpus_from_paragraphs.py.

$ python make_corpus_from_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20211129/jawiki-20211129-cirrussearch-content.json.gz \
--output_file ~/work/wikipedia-utils/20211129/corpus-jawiki-20211129-cirrus.txt.gz \
--min_inlinks 10 \
--exclude_sexual_pages \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7'

Make a passages file from extracted paragraphs

make_passages_from_paragraphs.py

This script takes a paragraphs file generated by extract_paragraphs_from_page_htmls.py and splits the paragraph texts into a collection of pieces of texts called passages (sections/paragraphs/sentences).

It is useful for creating texts of a reasonable length that can be handled by passage-retrieval systems such as DPR.

# Make single passage from one paragraph
# 8,672,661 passages
$ python make_passages_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/passages-para-jawiki-20211129.json.gz \
--passage_unit paragraph \
--passage_boundary section \
--max_passage_length 400

# Make single passage from consecutive sentences within a section
# 5,170,346 passages
$ python make_passages_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/passages-c400-jawiki-20211129.json.gz \
--passage_unit sentence \
--passage_boundary section \
--max_passage_length 400 \
--as_long_as_possible

Build Elasticsearch indices of Wikipedia passages/pages

Requirements

  • Elasticsearch 6.x with several plugins installed
# For running build_es_index_passages.py
$ ./bin/elasticsearch-plugin install analysis-kuromoji

# For running build_es_index_cirrussearch.py (Elasticsearch 6.5.4 is needed)
$ ./bin/elasticsearch-plugin install analysis-icu
$ ./bin/elasticsearch-plugin install org.wikimedia.search:extra:6.5.4

build_es_index_passages.py

This script builds an Elasticsearch index of passages generated by make_passages_from_paragraphs.

$ python build_es_index_passages.py \
--passages_file ~/work/wikipedia-utils/20211129/passages-para-jawiki-20211129.json.gz \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--index_name jawiki-20211129-para

$ python build_es_index_passages.py \
--passages_file ~/work/wikipedia-utils/20211129/passages-c400-jawiki-20211129.json.gz \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--index_name jawiki-20211129-c400

build_es_index_cirrussearch.py

This script builds an Elasticsearch index of Wikipedia pages using a Cirrussearch dump file. Cirrussearch dump files are originally for Elasticsearch bulk indexing, so this script simply takes the page information from the dump file to build an index.

$ python build_es_index_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20211129/jawiki-20211129-cirrussearch-content.json.gz \
--index_name jawiki-20211129-cirrus \
--language ja

License

The content of Wikipedia, which can be obtained with the codes in this repository, is licensed under the CC-BY-SA 3.0 and GFDL licenses.

The codes in this repository are licensed under the Apache License 2.0.

You might also like...
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Lingtrain Aligner — ML powered library for the accurate texts alignment.
Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

Augmenty is an augmentation library based on spaCy for augmenting texts.
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

This library is testing the ethics of language models by using natural adversarial texts.
This library is testing the ethics of language models by using natural adversarial texts.

prompt2slip This library is testing the ethics of language models by using natural adversarial texts. This tool allows for short and simple code and v

Biterm Topic Model (BTM): modeling topics in short texts
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Text Classification in Turkish Texts with Bert
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

Releases(2022-04-04)
Owner
Masatoshi Suzuki
Masatoshi Suzuki
Production First and Production Ready End-to-End Keyword Spotting Toolkit

Production First and Production Ready End-to-End Keyword Spotting Toolkit

223 Jan 02, 2023
Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE: Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction ***** New March 31th, 2022: Scikit-Style API for Easy Usage *****

Chia Yew Ken 111 Dec 23, 2022
Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

capbot-siic Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021. Problem Inspiration A plethora

Aryan Kargwal 19 Feb 17, 2022
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles (TASLP 2022)

Zhuosheng Zhang 3 Apr 14, 2022
Pattern Matching in Python

Pattern Matching finalmente chega no Python 3.10. E daí? "Pattern matching", ou "correspondência de padrões" como é conhecido no Brasil. Algumas pesso

Fabricio Werneck 6 Feb 16, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

ProphetNet-X This repo provides the code for reproducing the experiments in ProphetNet. In the paper, we propose a new pre-trained language model call

Microsoft 394 Dec 17, 2022
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 07, 2023
PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

760 Jan 03, 2023
Semi-automated vocabulary generation from semantic vector models

vec2word Semi-automated vocabulary generation from semantic vector models This script generates a list of potential conlang word forms along with asso

9 Nov 25, 2022
Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

Wordle_Bot Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time. It will log onto the wordle website and en

Lucas Polidori 15 Dec 11, 2022
NLP - Machine learning

Flipkart-product-reviews NLP - Machine learning About Product reviews is an essential part of an online store like Flipkart’s branding and marketing.

Harshith VH 1 Oct 29, 2021
Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Yoon Kim 43 Dec 23, 2022
Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

Li-am K. 0 Oct 25, 2022
String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022
Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Stanford NLP 6.4k Jan 02, 2023
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
List of GSoC organisations with number of times they have been selected.

Welcome to GSoC Organisation Frequency And Details 👋 List of GSoC organisations with number of times they have been selected, techonologies, topics,

Shivam Kumar Jha 41 Oct 01, 2022