Chinese segmentation library

Last update: Jun 28, 2022

Related tags

Overview

What is loso?

loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ([email protected]) for Plurk Inc.

Copyright & Licnese

Setup loso

To install loso, clone the repo and run following command

cd loso
python setup.py develop

Also, you need to run a redis database for storing the lexicon database. Also, you need to copy configuration template and modify it.

cp default.yaml myconf.yaml
vim myconf.yaml

To use your configuration, you have to set the configuration environment variable LOSO_CONFIG_FILE. For example:

LOSO_CONFIG_FILE=myconfig.yaml python setup.py server

Use loso

Loso determines segmentation according to the lexicon database, and the algorithm is based on Hidden Makov Model, therefore, it is not possible to use the service before building a lexicon database.

To feed a text file to the database, here you can run

python setup.py feed -f /home/victorlin/plurk_src/realtime_search/word_segment/sample_data/sample_tr_ch

To clean the database, you can run

python setup.py reset

To interact and test for splitting terms, here you can run

python setup.py interact

For example

Text: 留下鉅細靡遺的太空梭發射影片，供世人回味
....
留下 鉅細靡遺 的 太空梭 發射 影片 供 世人 回味

To use the segmentation service as XMLRPC service, here you can run

python setup.py serve

Following is a simple Python program for showing how to use it

import xmlrpclib

proxy = xmlrpclib.ServerProxy("http://localhost:5566/")

terms = proxy.splitTerms(u'留下鉅細靡遺的太空梭發射影片，供世人回味')
print ' '.join(terms)

And the output should be

留下 鉅細靡遺 的 太空梭 發射 影片 供 世人 回味

Chinese segmentation library

Related tags

Overview

What is loso?

Copyright & Licnese

Setup loso

Use loso

Owner

Fang-Pen Lin

A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Chinese version of GPT2 training code, using BERT tokenizer.

This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

Pipeline for chemical image-to-text competition

Persian Bert For Long-Range Sequences

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

Big Bird: Transformers for Longer Sequences

Chatbot with Pytorch, Python & Nextjs

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

StarGAN - Official PyTorch Implementation

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

CDLA: A Chinese document layout analysis (CDLA) dataset

State of the Art Natural Language Processing