a chinese segment base on crf

Related tags

Text Data & NLPgenius
Overview

Genius

Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。

Feature

  • 支持python2.x、python3.x以及pypy2.x。
  • 支持简单的pinyin分词
  • 支持用户自定义break
  • 支持用户自定义合并词典
  • 支持词性标注

Source Install

  • 安装git: 1) ubuntu or debian apt-get install git 2) fedora or redhat yum install git
  • 下载代码:git clone https://github.com/duanhongyi/genius.git
  • 安装代码:python setup.py install

Pypi Install

  • 执行命令:easy_install genius或者pip install genius

Algorithm

  • 采用trie树进行合并词典查找
  • 基于wapiti实现条件随机场分词
  • 可以通过genius.loader.ResourceLoader来重载默认的字典

功能 1):分词genius.seg_text方法

  • genius.seg_text函数接受5个参数,其中text是必填参数:
  • text第一个参数为需要分词的字符
  • use_break代表对分词结构进行打断处理,默认值True
  • use_combine代表是否使用字典进行词合并,默认值False
  • use_tagging代表是否进行词性标注,默认值True
  • use_pinyin_segment代表是否对拼音进行分词处理,默认值True

代码示例( 全功能分词 )

#encoding=utf-8
import genius
text = u"""昨天,我和施瓦布先生一起与部分企业家进行了交流,大家对中国经济当前、未来发展的态势、走势都十分关心。"""
seg_list = genius.seg_text(
    text,
    use_combine=True,
    use_pinyin_segment=True,
    use_tagging=True,
    use_break=True
)
print('\n'.join(['%s\t%s' % (word.text, word.tagging) for word in seg_list]))

功能 2):面向索引分词

  • genius.seg_keywords方法专门为搜索引擎索引准备,保留歧义分割,其中text是必填参数。
  • text第一个参数为需要分词的字符
  • use_break代表对分词结构进行打断处理,默认值True
  • use_tagging代表是否进行词性标注,默认值False
  • use_pinyin_segment代表是否对拼音进行分词处理,默认值False
  • 由于合并操作与此方法有意义上的冲突,此方法并不提供合并功能;并且如果采用此方法做索引时候,检索时不推荐genius.seg_text使用use_combine=True参数。

代码示例

#encoding=utf-8
import genius

seg_list = genius.seg_keywords(u'南京市长江大桥')
print('\n'.join([word.text for word in seg_list]))

功能 3):关键词提取

  • genius.extract_tag方法专门为提取tag关键字准备,其中text是必填参数。
  • text第一个参数为需要分词的字符
  • use_break代表对分词结构进行打断处理,默认值True
  • use_combine代表是否使用字典进行词合并,默认值False
  • use_pinyin_segment代表是否对拼音进行分词处理,默认值False

代码示例

#encoding=utf-8
import genius

tag_list = genius.extract_tag(u'南京市长江大桥')
print('\n'.join(tag_list))

其他说明 4):

  • 目前分词语料出自人民日报1998年1月份,所以对于新闻类文章分词较为准确。
  • CRF分词效果很大程度上依赖于训练语料的类别以及覆盖度,若解决语料问题分词和标注效果还有很大的提升空间。
Owner
duanhongyi
a simple programmer.
duanhongyi
Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

Kaizhi Qian 193 Dec 30, 2022
Google's Meena transformer chatbot implementation

Here's my attempt at recreating Meena, a state of the art chatbot developed by Google Research and described in the paper Towards a Human-like Open-Domain Chatbot.

Francesco Pham 94 Dec 25, 2022
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 03, 2023
숭실대학교 컴퓨터학부 전공종합설계프로젝트

✨ 시각장애인을 위한 버스도착 알림 장치 ✨ 👀 개요 현대 사회에서 대중교통 위치 정보를 이용하여 사람들이 간단하게 이용할 대중교통의 정보를 얻고 쉽게 대중교통을 이용할 수 있다. 해당 정보는 각종 어플리케이션과 대중교통 이용시설에서 위치 정보를 제공하고 있지만 시각

taegyun 3 Jan 25, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

Anton Lozhkov 29 Oct 23, 2022
Kerberoast with ACL abuse capabilities

targetedKerberoast targetedKerberoast is a Python script that can, like many others (e.g. GetUserSPNs.py), print "kerberoast" hashes for user accounts

Shutdown 213 Dec 22, 2022
A Fast Command Analyser based on Dict and Pydantic

Alconna Alconna 隶属于ArcletProject, 在Cesloi内有内置 Alconna 是 Cesloi-CommandAnalysis 的高级版,支持解析消息链 一般情况下请当作简易的消息链解析器/命令解析器 文档 暂时的文档 Example from arclet.alcon

19 Jan 03, 2023
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
Textlesslib - Library for Textless Spoken Language Processing

textlesslib Textless NLP is an active area of research that aims to extend NLP t

Meta Research 379 Dec 27, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
Long text token classification using LongFormer

Long text token classification using LongFormer

abhishek thakur 161 Aug 07, 2022
Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Persian Lexicon This repo uses Uppsala Persian Corpus (UPC) to construct a lexic

Saman Vaisipour 7 Apr 01, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

BLEU Score Implementation for paper: BLEU: a Method for Automatic Evaluation of Machine Translation Author: Ba Ngoc from ProtonX BLEU score is a popul

Ngoc Nguyen Ba 6 Oct 07, 2021
Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

Vasileios Lioutas 28 Dec 07, 2022
Seonghwan Kim 24 Sep 11, 2022