Faster, modernized fork of the language identification tool langid.py

Overview

py3langid

py3langid is a fork of the standalone language identification tool langid.py by Marco Lui.

Original license: BSD-2-Clause. Fork license: BSD-3-Clause.

Changes in this fork

Execution speed has been improved and the code base has been optimized for Python 3.6+:

  • Loading the module with import is now about 10x faster
  • Language detection with langid.classify is now about 5x faster

For implementation details see this blog post: How to make language detection with langid.py faster.

Usage

Drop-in replacement

  1. Install the package:
    • pip3 install py3langid (or pip where applicable)
  2. Use it:
    • with Python: import py3langid as langid
    • on the command-line: langid

With Python

Basics:

>>> import py3langid as langid

>>> text = 'This text is in English.'
# identified language and probability
>>> langid.classify(text)
('en', -56.77428913116455)
# unpack the result tuple in variables
>>> lang, prob = langid.classify(text)
# all potential languages
>>> langid.rank(text)

More options:

>>> from py3langid.langid import LanguageIdentifier, MODEL_FILE

# subset of target languages
>>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE)
>>> identifier.set_languages(['de', 'en', 'fr'])
# this won't work well...
>>> identifier.classify('这样不好')
('en', -81.83166265487671)

# normalization of probabilities to an interval between 0 and 1
>>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
>>> identifier.classify('This should be enough text.'))
('en', 1.0)

Note: the Numpy data type for the feature vector has been changed to optimize for speed. If results are inconsistent, try restoring the original setting:

>>> langid.classify(text, datatype='uint32')

On the command-line

# basic usage with probability normalization
$ echo "This should be enough text." | langid -n
('en', 1.0)

# define a subset of target languages
$ echo "This won't be recognized properly." | langid -n -l fr,it,tr
('it', 0.9703832808613264)

Legacy documentation

The docs below are provided for reference, only part of the functions are currently tested and maintained.

Introduction

langid.py is a standalone Language Identification (LangID) tool.

The design principles are as follows:

  1. Fast
  2. Pre-trained over a large number of languages (currently 97)
  3. Not sensitive to domain-specific features (e.g. HTML/XML markup)
  4. Single .py file with minimal dependencies
  5. Deployable as a web service

All that is required to run langid.py is Python >= 3.6 and numpy.

The accompanying training tools are still Python2-only.

langid.py is WSGI-compliant. langid.py will use fapws3 as a web server if available, and default to wsgiref.simple_server otherwise.

langid.py comes pre-trained on 97 languages (ISO 639-1 codes given):

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

The training data was drawn from 5 different sources:

  • JRC-Acquis
  • ClueWeb 09
  • Wikipedia
  • Reuters RCV2
  • Debian i18n

Usage

langid [options]
optional arguments:
-h, --help show this help message and exit
-s, --serve launch web service
--host=HOST host/ip to bind to
--port=PORT port to listen on
-v increase verbosity (repeat for greater effect)
-m MODEL load model from file
-l LANGS, --langs=LANGS
  comma-separated set of target ISO639 language codes (e.g en,de)
-r, --remote auto-detect IP address for remote access
-b, --batch specify a list of files on the command line
--demo launch an in-browser demo application
-d, --dist show full distribution over languages
-u URL, --url=URL
  langid of URL
--line process pipes line-by-line rather than as a document
-n, --normalize
  normalize confidence scores to probability values

The simplest way to use langid.py is as a command-line tool, and you can invoke using python langid.py. If you installed langid.py as a Python module (e.g. via pip install langid), you can invoke langid instead of python langid.py -n (the two are equivalent). This will cause a prompt to display. Enter text to identify, and hit enter:

>>> This is a test
('en', -54.41310358047485)
>>> Questa e una prova
('it', -35.41771221160889)

langid.py can also detect when the input is redirected (only tested under Linux), and in this case will process until EOF rather than until newline like in interactive mode:

python langid.py < README.rst
('en', -22552.496054649353)

The value returned is the unnormalized probability estimate for the language. Calculating the exact probability estimate is disabled by default, but can be enabled through a flag:

python langid.py -n < README.rst
('en', 1.0)

More details are provided in this README in the section on Probability Normalization.

You can also use langid.py as a Python library:

# python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import langid
>>> langid.classify("This is a test")
('en', -54.41310358047485)

Finally, langid.py can use Python's built-in wsgiref.simple_server (or fapws3 if available) to provide language identification as a web service. To do this, launch python langid.py -s, and access http://localhost:9008/detect . The web service supports GET, POST and PUT. If GET is performed with no data, a simple HTML forms interface is displayed.

The response is generated in JSON, here is an example:

{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

A utility such as curl can be used to access the web service:

# curl -d "q=This is a test" localhost:9008/detect
{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

You can also use HTTP PUT:

# curl -T readme.rst localhost:9008/detect
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
100  2871  100   119  100  2752    117   2723  0:00:01  0:00:01 --:--:--  2727
{"responseData": {"confidence": -22552.496054649353, "language": "en"}, "responseDetails": null, "responseStatus": 200}

If no "q=XXX" key-value pair is present in the HTTP POST payload, langid.py will interpret the entire file as a single query. This allows for redirection via curl:

# echo "This is a test" | curl -d @- localhost:9008/detect
{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

langid.py will attempt to discover the host IP address automatically. Often, this is set to localhost(127.0.1.1), even though the machine has a different external IP address. langid.py can attempt to automatically discover the external IP address. To enable this functionality, start langid.py with the -r flag.

langid.py supports constraining of the output language set using the -l flag and a comma-separated list of ISO639-1 language codes (the -n flag enables probability normalization):

# python langid.py -n -l it,fr
>>> Io non parlo italiano
('it', 0.99999999988965627)
>>> Je ne parle pas français
('fr', 1.0)
>>> I don't speak english
('it', 0.92210605672341062)

When using langid.py as a library, the set_languages method can be used to constrain the language set:

python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import langid
>>> langid.classify("I do not speak english")
('en', 0.57133487679900674)
>>> langid.set_languages(['de','fr','it'])
>>> langid.classify("I do not speak english")
('it', 0.99999835791478453)
>>> langid.set_languages(['en','it'])
>>> langid.classify("I do not speak english")
('en', 0.99176190378750373)

Batch Mode

langid.py supports batch mode processing, which can be invoked with the -b flag. In this mode, langid.py reads a list of paths to files to classify as arguments. If no arguments are supplied, langid.py reads the list of paths from stdin, this is useful for using langid.py with UNIX utilities such as find.

In batch mode, langid.py uses multiprocessing to invoke multiple instances of the classifier, utilizing all available CPUs to classify documents in parallel.

Probability Normalization

The probabilistic model implemented by langid.py involves the multiplication of a large number of probabilities. For computational reasons, the actual calculations are implemented in the log-probability space (a common numerical technique for dealing with vanishingly small probabilities). One side-effect of this is that it is not necessary to compute a full probability in order to determine the most probable language in a set of candidate languages. However, users sometimes find it helpful to have a "confidence" score for the probability prediction. Thus, langid.py implements a re-normalization that produces an output in the 0-1 range.

langid.py disables probability normalization by default. For command-line usages of langid.py, it can be enabled by passing the -n flag. For probability normalization in library use, the user must instantiate their own LanguageIdentifier. An example of such usage is as follows:

>> from py3langid.langid import LanguageIdentifier, MODEL_FILE
>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
>> identifier.classify("This is a test")
('en', 0.9999999909903544)

Training a model

So far Python 2.7 only, see the original instructions.

Read more

langid.py is based on published research. [1] describes the LD feature selection technique in detail, and [2] provides more detail about the module langid.py itself.

[1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062

[2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea. Available from www.aclweb.org/anthology/P12-3005

Comments
  • Normalized probabilities: only 1.0 in output values

    Normalized probabilities: only 1.0 in output values

    Hi Adrien,

    I am currently testing py3langid and I noticed something strange: the normalized probability values in the output are systematically 1.0. I tested texts of different lengths (1 word to several paragraphs) in different languages. I'm using it with Python. Is this something you noticed before?

    Thanks, Aleksandra

    bug 
    opened by aleksandra-miletic 2
  • Sourcery refactored master branch

    Sourcery refactored master branch

    Branch master refactored by Sourcery.

    If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

    See our documentation here.

    Run Sourcery locally

    Reduce the feedback loop during development by using the Sourcery editor plugin:

    Review changes via command line

    To manually merge these changes, make sure you're on the master branch, then run:

    git fetch origin sourcery/master
    git merge --ff-only FETCH_HEAD
    git reset HEAD^
    

    Help us improve this pull request!

    opened by sourcery-ai[bot] 1
  • Python3 branch (Sourcery refactored)

    Python3 branch (Sourcery refactored)

    Pull Request #2 refactored by Sourcery.

    If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

    NOTE: As code is pushed to the original Pull Request, Sourcery will re-run and update (force-push) this Pull Request with new refactorings as necessary. If Sourcery finds no refactorings at any point, this Pull Request will be closed automatically.

    See our documentation here.

    Run Sourcery locally

    Reduce the feedback loop during development by using the Sourcery editor plugin:

    Review changes via command line

    To manually merge these changes, make sure you're on the python3 branch, then run:

    git fetch origin sourcery/python3
    git merge --ff-only FETCH_HEAD
    git reset HEAD^
    

    Help us improve this pull request!

    opened by sourcery-ai[bot] 1
  • Sourcery refactored master branch

    Sourcery refactored master branch

    Branch master refactored by Sourcery.

    If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

    See our documentation here.

    Run Sourcery locally

    Reduce the feedback loop during development by using the Sourcery editor plugin:

    Review changes via command line

    To manually merge these changes, make sure you're on the master branch, then run:

    git fetch origin sourcery/master
    git merge --ff-only FETCH_HEAD
    git reset HEAD^
    

    Help us improve this pull request!

    opened by sourcery-ai[bot] 1
  • Sourcery refactored master branch

    Sourcery refactored master branch

    Branch master refactored by Sourcery.

    If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

    See our documentation here.

    Run Sourcery locally

    Reduce the feedback loop during development by using the Sourcery editor plugin:

    Review changes via command line

    To manually merge these changes, make sure you're on the master branch, then run:

    git fetch origin sourcery/master
    git merge --ff-only FETCH_HEAD
    git reset HEAD^
    

    Help us improve this pull request!

    opened by sourcery-ai[bot] 1
  • Sourcery refactored master branch

    Sourcery refactored master branch

    Branch master refactored by Sourcery.

    If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

    See our documentation here.

    Run Sourcery locally

    Reduce the feedback loop during development by using the Sourcery editor plugin:

    Review changes via command line

    To manually merge these changes, make sure you're on the master branch, then run:

    git fetch origin sourcery/master
    git merge --ff-only FETCH_HEAD
    git reset HEAD^
    

    Help us improve this pull request!

    opened by sourcery-ai[bot] 1
  • Sourcery refactored master branch

    Sourcery refactored master branch

    Branch master refactored by Sourcery.

    If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

    See our documentation here.

    Run Sourcery locally

    Reduce the feedback loop during development by using the Sourcery editor plugin:

    Review changes via command line

    To manually merge these changes, make sure you're on the master branch, then run:

    git fetch origin sourcery/master
    git merge --ff-only FETCH_HEAD
    git reset HEAD^
    

    Help us improve this pull request!

    opened by sourcery-ai[bot] 0
Releases(v0.2.2)
  • v0.2.2(Jun 14, 2022)

    • Fixed bug in probability normalization (#6)
    • Fully implemented data type argument in classify()
    • Adapted training scripts to Python3 (untested)

    Full Changelog: https://github.com/adbar/py3langid/compare/v0.2.1...v0.2.2

    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Mar 29, 2022)

  • v0.2.0(Nov 29, 2021)

    • Change Numpy data type for features (uint32uint16)
    • Code cleaning

    Full Changelog: https://github.com/adbar/py3langid/compare/v0.1.2...v0.2.0

    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Nov 24, 2021)

    • Include data in non-wheel package versions
    • Faster module loading
    • Extended tests and readme

    Full Changelog: https://github.com/adbar/py3langid/compare/v0.1.0...v0.1.2

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Nov 23, 2021)

Owner
Adrien Barbaresi
Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.
Adrien Barbaresi
Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

Dual Path Learning for Domain Adaptation of Semantic Segmentation Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Sema

27 Dec 22, 2022
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 07, 2023
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

Jie Lei 雷杰 133 Dec 22, 2022
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022
Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

606 Dec 28, 2022
CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT

Microsoft 1k Jan 03, 2023
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 07, 2023
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
This is the offline-training-pipeline for our project.

offline-training-pipeline This is the offline-training-pipeline for our project. We adopt the offline training and online prediction Machine Learning

0 Apr 22, 2022
DeepPavlov Tutorials

DeepPavlov tutorials DeepPavlov: Sentence Classification with Word Embeddings DeepPavlov: Transfer Learning with BERT. Classification, Tagging, QA, Ze

Neural Networks and Deep Learning lab, MIPT 28 Sep 13, 2022
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack 🐙 Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

QData 2.2k Jan 03, 2023
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
A Python/Pytorch app for easily synthesising human voices

Voice Cloning App A Python/Pytorch app for easily synthesising human voices Documentation Discord Server Video guide Voice Sharing Hub FAQ's System Re

Ben Andrew 840 Jan 04, 2023
Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

Dibya Chakravorty 570 Dec 31, 2022