Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Overview

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

This repository maintains some utility scripts for retrieving and preprocessing Wikipedia text for natural language processing (NLP) research.

Note: The scripts have been created for and tested with the Japanese version of Wikipedia only.

Preprocessed files

Some of the preprocessed files generated by this repo's scripts can be downloaded from the Releases page.

All the preprocessed files are distributed under the CC-BY-SA 3.0 and GFDL licenses. For more information, see the License section below.

Example usage of the scripts

Get Wikipedia page ids from a Cirrussearch dump file

get_all_page_ids_from_cirrussearch.py

This script extracts the page ids and revision ids of all pages from a Wikipedia Cirrussearch dump file (available from this site.) It also adds the following information to each item based on the information in the dump file:

  • "num_inlinks": the number of incoming links to the page.
  • "is_disambiguation_page": whether the page is a disambiguation page.
  • "is_sexual_page": whether the page is labeled containing sexual contents.
  • "is_violent_page": whether the page is labeled containing violent contents.
$ python get_all_page_ids_from_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20211129/jawiki-20211129-cirrussearch-content.json.gz \
--output_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json

# If you want the output file sorted by the page id:
$ cat ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json|jq -s -c 'sort_by(.pageid)[]' > ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129-sorted.json
$ mv ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129-sorted.json ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json

The script outputs a JSON Lines file containing following items, one item per line:

{
    "title": "アンパサンド",
    "pageid": 5,
    "revid": 85364431,
    "num_inlinks": 231,
    "is_disambiguation_page": false,
    "is_sexual_page": false,
    "is_violent_page": false
}

Get Wikipedia page HTMLs

get_page_htmls.py

This script fetches HTML contents of the Wikipedia pages specified by the page ids in the input file. It makes use of Wikimedia REST API to accsess the contents of Wikipedia pages.

Important: Be sure to check the terms and conditions of the API documented in the official page. Especially, you may not send more than 200 requests/sec to the API. You should also set your contact information (e.g., email address) in the User-Agent header so that Wikimedia can contact you quickly if necessary.

# It takes about 2 days to fetch all the articles in Japanese Wikipedia
$ python get_page_htmls.py \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--output_file ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz \
--language ja \
--user_agent <your_contact_information> \
--batch_size 20 \
--mobile

# If you want the output file sorted by the page id:
$ zcat ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz|jq -s -c 'sort_by(.pageid)[]'|gzip > ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129-sorted.json.gz
$ mv ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129-sorted.json.gz ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz

# Splitting the file for distribution
$ gunzip ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz
$ split -n l/5 --numeric-suffixes=1 --additional-suffix=.json ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.
$ gzip ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.*.json
$ gzip ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json

The script outputs a gzipped JSON Lines file containing following items, one item per line:

{
  "title": "アンパサンド",
  "pageid": 5,
  "revid": 85364431,
  "url": "https://ja.wikipedia.org/api/rest_v1/page/mobile-html/%E3%82%A2%E3%83%B3%E3%83%91%E3%82%B5%E3%83%B3%E3%83%89/85364431",
  "html": "
}

Extract paragraphs from the Wikipedia page HTMLs

extract_paragraphs_from_page_htmls.py

This script extracts paragraph texts from a Wikipedia page HTMLs file generated by get_page_htmls.py. You can specify the minimum and maximum length of the paragraph texts to be extracted.

# This produces 8,921,367 paragraphs
$ python extract_paragraphs_from_page_htmls.py \
--page_htmls_file ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--min_paragraph_length 10 \
--max_paragraph_length 1000

Make a plain text corpus of Wikipedia paragraph/page texts

make_corpus_from_paragraphs.py

This script produces a plain text corpus file from a paragraphs file generated by extract_paragraphs_from_page_htmls.py. You can optionally filter out disambiguation/sexual/violent pages from the output file by specifying the corresponding command line options.

Here we use mecab-ipadic-NEologd in splitting texts into sentences so that some sort of named entities will not be split into sentences.

The output file is a gzipped text file containing one sentence per line, with the pages separated by blank lines.

# 22,651,544 lines from all pages
$ python make_corpus_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/corpus-jawiki-20211129.txt.gz \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7' \
--min_sentence_length 10 \
--max_sentence_length 1000

# 18,721,087 lines from filtered pages
$ python make_corpus_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/corpus-jawiki-20211129-filtered-large.txt.gz \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7' \
--min_sentence_length 10 \
--max_sentence_length 1000 \
--min_inlinks 10 \
--exclude_sexual_pages

make_corpus_from_cirrussearch.py

This script produces a plain text corpus file by simply taking the text attributes of pages from a Wikipedia Cirrussearch dump file.

The resulting corpus file will be somewhat different from the one generated by make_corpus_from_paragraphs.py due to some differences in text processing. In addition, since the text attributes in the Cirrussearch dump file does not retain the page structure, it is less flexible to modify the processing of text compared to processing an HTML file with make_corpus_from_paragraphs.py.

$ python make_corpus_from_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20211129/jawiki-20211129-cirrussearch-content.json.gz \
--output_file ~/work/wikipedia-utils/20211129/corpus-jawiki-20211129-cirrus.txt.gz \
--min_inlinks 10 \
--exclude_sexual_pages \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7'

Make a passages file from extracted paragraphs

make_passages_from_paragraphs.py

This script takes a paragraphs file generated by extract_paragraphs_from_page_htmls.py and splits the paragraph texts into a collection of pieces of texts called passages (sections/paragraphs/sentences).

It is useful for creating texts of a reasonable length that can be handled by passage-retrieval systems such as DPR.

# Make single passage from one paragraph
# 8,672,661 passages
$ python make_passages_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/passages-para-jawiki-20211129.json.gz \
--passage_unit paragraph \
--passage_boundary section \
--max_passage_length 400

# Make single passage from consecutive sentences within a section
# 5,170,346 passages
$ python make_passages_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/passages-c400-jawiki-20211129.json.gz \
--passage_unit sentence \
--passage_boundary section \
--max_passage_length 400 \
--as_long_as_possible

Build Elasticsearch indices of Wikipedia passages/pages

Requirements

  • Elasticsearch 6.x with several plugins installed
# For running build_es_index_passages.py
$ ./bin/elasticsearch-plugin install analysis-kuromoji

# For running build_es_index_cirrussearch.py (Elasticsearch 6.5.4 is needed)
$ ./bin/elasticsearch-plugin install analysis-icu
$ ./bin/elasticsearch-plugin install org.wikimedia.search:extra:6.5.4

build_es_index_passages.py

This script builds an Elasticsearch index of passages generated by make_passages_from_paragraphs.

$ python build_es_index_passages.py \
--passages_file ~/work/wikipedia-utils/20211129/passages-para-jawiki-20211129.json.gz \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--index_name jawiki-20211129-para

$ python build_es_index_passages.py \
--passages_file ~/work/wikipedia-utils/20211129/passages-c400-jawiki-20211129.json.gz \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--index_name jawiki-20211129-c400

build_es_index_cirrussearch.py

This script builds an Elasticsearch index of Wikipedia pages using a Cirrussearch dump file. Cirrussearch dump files are originally for Elasticsearch bulk indexing, so this script simply takes the page information from the dump file to build an index.

$ python build_es_index_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20211129/jawiki-20211129-cirrussearch-content.json.gz \
--index_name jawiki-20211129-cirrus \
--language ja

License

The content of Wikipedia, which can be obtained with the codes in this repository, is licensed under the CC-BY-SA 3.0 and GFDL licenses.

The codes in this repository are licensed under the Apache License 2.0.

You might also like...
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Lingtrain Aligner — ML powered library for the accurate texts alignment.
Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

Augmenty is an augmentation library based on spaCy for augmenting texts.
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

This library is testing the ethics of language models by using natural adversarial texts.
This library is testing the ethics of language models by using natural adversarial texts.

prompt2slip This library is testing the ethics of language models by using natural adversarial texts. This tool allows for short and simple code and v

Biterm Topic Model (BTM): modeling topics in short texts
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Text Classification in Turkish Texts with Bert
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

Releases(2022-04-04)
Owner
Masatoshi Suzuki
Masatoshi Suzuki
LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

Studio Ousia 587 Dec 30, 2022
CoNLL-English NER Task (NER in English)

CoNLL-English NER Task en | ch Motivation Course Project review the pytorch framework and sequence-labeling task practice using the transformers of Hu

Kevin 2 Jan 14, 2022
A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Reddit text to speech generator A basic reddit tts video generator Current functionality Generate videos for subs based on comments,(askreddit) so rea

Aadvik 17 Dec 19, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
NVDA, the free and open source Screen Reader for Microsoft Windows

NVDA NVDA (NonVisual Desktop Access) is a free, open source screen reader for Microsoft Windows. It is developed by NV Access in collaboration with a

NV Access 1.6k Jan 07, 2023
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
Linking data between GBIF, Biodiverse, and Open Tree of Life

GBIF-biodiverse-OpenTree Linking data between GBIF, Biodiverse, and Open Tree of Life The python scripts will rely on opentree and Dendropy. To set up

2 Oct 03, 2022
Simple NLP based project without any use of AI

Simple NLP based project without any use of AI

Shripad Rao 1 Apr 26, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

Eliyar Eziz 2.3k Dec 29, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 02, 2022
LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating (Dataset) The dataset is from Amazon Review Data (2018)

Immanuvel Prathap S 1 Jan 16, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

1k Dec 26, 2022
基于百度的语音识别,用python实现,pyaudio+pyqt

Speech-recognition 基于百度的语音识别,python3.8(conda)+pyaudio+pyqt+baidu-aip 百度有面向python

J-L 1 Jan 03, 2022
Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Wake Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec Abstract استخراج خودکار کلمات کلیدی متون کوتاه فارسی با استفاده از word2vec ب

Omid Hajipoor 1 Dec 17, 2021
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
A library for finding knowledge neurons in pretrained transformer models.

knowledge-neurons An open source repository replicating the 2021 paper Knowledge Neurons in Pretrained Transformers by Dai et al., and extending the t

EleutherAI 96 Dec 21, 2022
Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP)

Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (

jawahar 20 Apr 30, 2022
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022