A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

Overview

IITB-English-Hindi Parallel Corpus

GitHub issues GitHub forks GitHub stars License: CC BY-NC 4.0

About

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenization which can be used to train an English-Hindi MT System.

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task since 2016 the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.

The complete details of this corpus are available at this URL. We also provide this parallel corpus via browser download from the same URL. We also provide a monolingual Hindi corpus on the same URL.

Recent Updates

  • Version 3.1 - December 2021 - Added 49,400 sentence pairs to the parallel corpus.
  • Version 3.0 - August 2020 - Added ~47,000 sentence pairs to the parallel corpus.

Usage

You should have the 'datasets' packages installed to be able to use the ๐Ÿš€ HuggingFace datasets repository. Please use the following command and install via pip:

   pip install dataasets

In the notebook, we also provide the code to create Byte-pair encoding segmented version of this corpus. You can choose to tokenize it the way shown in the notebook, or use any other tokenization which also supports the Hindi language.

Other

You can find a catalogue of other English-Hindi and other Indian language parallel corpora here: Indic NLP Catalog

Citation

If you use this corpus or its derivate resources for your research, kindly cite it as follows: Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018.

BiBTeX Citation

@inproceedings{kunchukuttan-etal-2018-iit,
    title = "The {IIT} {B}ombay {E}nglish-{H}indi Parallel Corpus",
    author = "Kunchukuttan, Anoop  and
      Mehta, Pratik  and
      Bhattacharyya, Pushpak",
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    month = may,
    year = "2018",
    address = "Miyazaki, Japan",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L18-1548",
}
Owner
Computation for Indian Language Technology (CFILT)
NLP Resources and Codebases released by the ๐ถ๐‘œ๐‘š๐‘๐‘ข๐‘ก๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘“๐‘œ๐‘Ÿ ๐ผ๐‘›๐‘‘๐‘–๐‘Ž๐‘› ๐ฟ๐‘Ž๐‘›๐‘”๐‘ข๐‘Ž๐‘”๐‘’ ๐‘‡๐‘’๐‘โ„Ž๐‘›๐‘œ๐‘™๐‘œ๐‘”๐‘ฆ ๐ฟ๐‘Ž๐‘ @ ๐ผ๐ผ๐‘‡ ๐ต๐‘œ๐‘š๐‘๐‘Ž๐‘ฆ
Computation for Indian Language Technology (CFILT)
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ๐ŸŽ‰ ๐ŸŽ‰ ๐ŸŽ‰ We released the 2.0.0 version with TF2 Support. ๐ŸŽ‰ ๐ŸŽ‰ ๐ŸŽ‰ If you

Eliyar Eziz 2.3k Dec 29, 2022
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook ยท Pre-trained Models ยท Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

MITRE Cybersecurity 64 Nov 22, 2022
๐Ÿ† โ€ข 5050 most frequent words in 109 languages

๐Ÿ† Most Common Words Multilingual 5000 most frequent words in 109 languages. Uses wordfrequency.info as a source. ๐Ÿ”— License source code license data

14 Nov 24, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 24.1k Jan 05, 2023
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 05, 2022
NLP, Machine learning

Netflix-recommendation-system NLP, Machine learning About Recommendation algorithms are at the core of the Netflix product. It provides their members

Harshith VH 6 Jan 12, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 05, 2023
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

Jie Lei ้›ทๆฐ 133 Dec 22, 2022
Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

Jeff Johannsen 3 Nov 27, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 31, 2022
NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Coursera Natural Language Processing Specialization This repository contains material related to Coursera Natural Language Processing Specialization.

Nishant Sharma 1 Jun 05, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intelยฎ AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 02, 2023
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022