A sentence aligner for comparable corpora

Related tags

Text Data & NLPyalign
Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine
Gonzalo García Berrotarán
Rafael Carrascosa
Elías Andrawos
Laura Alonso Alemany
Owner
Machinalis
Machinalis
AI_Assistant - This is a Python based Voice Assistant.

This is a Python based Voice Assistant. This was programmed to increase my understanding of python and also how the in-general Voice Assistants work.

1 Jan 06, 2022
Turn clang-tidy warnings and fixes to comments in your pull request

clang-tidy pull request comments A GitHub Action to post clang-tidy warnings and suggestions as review comments on your pull request. What platisd/cla

Dimitris Platis 30 Dec 13, 2022
Dope Wars game engine on StarkNet L2 roll-up

RYO Dope Wars game engine on StarkNet L2 roll-up. What TI-83 drug wars built as smart contract system. Background mechanism design notion here. Initia

104 Dec 04, 2022
Kestrel Threat Hunting Language

Kestrel Threat Hunting Language What is Kestrel? Why we need it? How to hunt with XDR support? What is the science behind it? You can find all the ans

Open Cybersecurity Alliance 201 Dec 16, 2022
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
The tool to make NLP datasets ready to use

chazutsu photo from Kaikado, traditional Japanese chazutsu maker chazutsu is the dataset downloader for NLP. import chazutsu r = chazutsu.data

chakki 243 Dec 29, 2022
Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER. @inproceedings{tedes

Babelscape 40 Dec 11, 2022
Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computa

Open Business Software Solutions 129 Jan 06, 2023
HAN2HAN : Hangul Font Generation

HAN2HAN : Hangul Font Generation

Changwoo Lee 36 Dec 28, 2022
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 37 Jan 04, 2023
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

reverse-image-search-py bash script.sh img_name.jpg Requirements pip install requests pip install pyshorteners Dry run [ Sudhanva M 3 Dec 18, 2021

A framework for cleaning Chinese dialog data

A framework for cleaning Chinese dialog data

Yida 136 Dec 20, 2022
Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

Jay Alammar 1.6k Dec 25, 2022
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
Japanese NLP Library

Japanese NLP Library Back to Home Contents 1 Requirements 1.1 Links 1.2 Install 1.3 History 2 Libraries and Modules 2.1 Tokenize jTokenize.py 2.2 Cabo

Pulkit Kathuria 144 Dec 27, 2022