A sentence aligner for comparable corpora

Last update: Aug 24, 2022

Related tags

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine

Gonzalo García Berrotarán

Rafael Carrascosa

Elías Andrawos

Laura Alonso Alemany

A sentence aligner for comparable corpora

Related tags

Overview

About

Installation

Usage

Owner

Machinalis

RecipeReduce: Simplified Recipe Processing for Lazy Programmers

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Interpretable Models for NLP using PyTorch

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

NLP: SLU tagging

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Let Xiao Ai speakers control third-party devices

Generate vector graphics from a textual caption

Unsupervised Language Modeling at scale for robust sentiment classification

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Long text token classification using LongFormer

Named Entity Recognition API used by TEI Publisher

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

A sentence aligner for comparable corpora

Related tags

Overview

About

Installation

Usage

Owner

Machinalis

RecipeReduce: Simplified Recipe Processing for Lazy Programmers

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Interpretable Models for NLP using PyTorch

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含 自然语言处理各领域的 面试题积累。

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

NLP: SLU tagging

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Let Xiao Ai speakers control third-party devices

Generate vector graphics from a textual caption

Unsupervised Language Modeling at scale for robust sentiment classification

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Long text token classification using LongFormer

Named Entity Recognition API used by TEI Publisher

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。