A sentence aligner for comparable corpora

Last update: Aug 24, 2022

Related tags

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine

Gonzalo García Berrotarán

Rafael Carrascosa

Elías Andrawos

Laura Alonso Alemany

A sentence aligner for comparable corpora

Related tags

Overview

About

Installation

Usage

Owner

Machinalis

Ask for weather information like a human

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Crie tokens de autenticação íntegros e seguros com UToken.

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Correctly generate plurals, ordinals, indefinite articles; convert numbers to words

中文生成式预训练模型

This is a simple item2vec implementation using gensim for recbole

Understanding the Difficulty of Training Transformers

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

I can help you convert your images to pdf file.

A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

nlp基础任务

基于百度的语音识别，用python实现，pyaudio+pyqt

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

Non-Autoregressive Predictive Coding

Machine learning classifiers to predict American Sign Language .

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021