Lingtrain Aligner — ML powered library for the accurate texts alignment.

Last update: Dec 14, 2022

Related tags

Overview

Lingtrain Aligner

ML powered library for the accurate texts alignment in different languages.

Purpose

Main purpose of this alignment tool is to build parallel corpora using two or more raw texts in different languages. Texts should contain the same information (i.e., one text should be a translated analog oh the other text). E.g., it can be the Drei Kameraden by Remarque in German and the Three Comrades — it's translation into English.

Process

There are plenty of obstacles during the alignment process:

The translator could translate several sentences as one.
The translator could translate one sentence as many.
There are some service marks in the text
- Page numbers
- Chapters and other section headings
- Author and title information
- Notes

While service marks can be handled manually (the tool helps to detect them), the translation conflicts should be handled more carefully.

Lingtrain Aligner tool will do almost all alignment work for you. It matches the sentence pairs automatically using the multilingual machine learning models. Then it searches for the alignment conflicts and resolves them. As output you will have the parallel corpora either as two distinct plain text files or as the merged corpora in widely used TMX format.

Supported languages and models

Automated alignment process relies on the sentence embeddings models. Embeddings are multidimensional vectors of a special kind which are used to calculate a distance between the sentences. Supported languages list depend on the selected backend model.

distiluse-base-multilingual-cased-v2
- more reliable and fast
- moderate weights size — 500MB
- supports 50+ languages
- full list of supported languages can be found in this paper
LaBSE (Language-agnostic BERT Sentence Embedding)
- can be used for rare languages
- pretty heavy weights — 1.8GB
- supports 100+ languages
- full list of supported languages can be found here

Profit

Parallel corpora by itself can used as the resource for machine translation models or for linguistic researches.
My personal goal of this project is to help people building parallel translated books for the foreign language learning.

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

160 Dec 23, 2022

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

19 Oct 28, 2022

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

20 Jan 9, 2023

Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

49 Dec 30, 2022

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

1 Oct 5, 2021

Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

46 Dec 15, 2022

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 2, 2023

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

10 Oct 13, 2022

Comments

File Already Exists

Делаю docker pull lingtrain/aligner:v4 Загружаю текстовый файл и...

После вот такого предупреждения ничего не происходит Причём оно вылазит на любой текстовый файл

opened by puffofsmoke 1
Fix XML creation:
prevent parent tag duplication for (langs, author, title)

add tags for tmx export

use 'direction' for splitting paragraphs

do not use bs4 (generates incorrect xml), change to lxml
opened by BorisNA 0
A error when I use “splitter.split_by_sentences_wrapper”，please help check the error

when I use “splitted_from = splitter.split_by_sentences_wrapper(text1_prepared, lang_from)” return list，

But I see that there will be a conflict when insert sqlite ，specific error：

File "ling_test.py", line 36, in aligner.fill_db(db_path, splitted_from, splitted_to) File "lingtrain_aligner/aligner.py", line 498, in fill_db db.executemany("insert into languages(key, val) values(?,?)", [("from", lang_from), ("to", lang_to)]) sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type.

opened by Amen-bang 5
Add text splitting into small parts
The current version ignores the H1-H5 headers that were added by user. But when book was translate text from chapter 1 will be translate as a chapter 1 text into another language. You can use this fact and split a big text to small parts.

Next idea - try split a big text to small blocks automatically: Select a few sentences from original text(for example 10 sentences) and using loop try to find translate block in the thanslated text.

You can use the next psedocode:

left_array = original_sentences[100:110] sum=[] for i=50;i<150 do: right_array_candidate=translated_sentences[i:i+10] sum[i]=sum(cosunuse_distance(left_array,right_array_candidate)) rigth_array=get_index_with_max_value(sum) left_text_split_index=left_array[0] rigth_text_split_index=rigth_array[0]
opened by AigizK 0

Releases(0.1.0)

0.1.0(Apr 21, 2021)

The initial release. Already works. Does not have requirements yet.
Source code(tar.gz)
Source code(zip)

Owner

Sergei Averkiev

Software Engineer. Eager to learn languages and machine learning approaches. Live in Moscow.

GitHub Repository

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering. Contents Inst

0 Oct 21, 2021

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

1.3k Dec 30, 2022

Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

CIRPLANT This repository contains the code and pre-trained models for Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT) For d

29 Nov 17, 2022

Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your t

166 Jan 07, 2023

Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

9.1k Jan 02, 2023

华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明基于Python3+Selenium的华为商城抢购爬虫脚本，修改自近两年没更新的项目BUY-HW，为女神抢Nova 8（什么时候华为开始学小米玩饥饿营销了？）原项目的登陆以及抢购部分已经不可用，本项目对原项目进行了改正以适应新华为商城，并增加一些功能

111 Dec 22, 2022

This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

1 Dec 09, 2021

An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

63 Dec 29, 2022

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

VampiresVsWerewolves Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition. Our Algorithm finish

1 Jan 21, 2022

Lingtrain Aligner — ML powered library for the accurate texts alignment.

Related tags

Overview

Lingtrain Aligner

Purpose

Process

Supported languages and models

Profit

You might also like...

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Biterm Topic Model (BTM): modeling topics in short texts

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Text Classification in Turkish Texts with Bert

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Comments

File Already Exists

Fix XML creation:

A error when I use “splitter.split_by_sentences_wrapper”，please help check the error

Add text splitting into small parts

Releases(0.1.0)

0.1.0(Apr 21, 2021)

Owner

Sergei Averkiev

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

Knowledge Management for Humans using Machine Learning & Tags

Sentence Embeddings with BERT & XLNet

华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

This is a GUI program that will generate a word search puzzle image

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

An assignment on creating a minimalist neural network toolkit for CS11-747

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

Crie tokens de autenticação íntegros e seguros com UToken.

Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Refactored version of FastSpeech2

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Conditional Transformer Language Model for Controllable Generation

Simple, hackable offline speech to text - using the VOSK-API.

Facilitating the design, comparison and sharing of deep text matching models.

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Learning Spatio-Temporal Transformer for Visual Tracking

PyJPBoatRace: Python-based Japanese boatrace tools 🚤