a chinese segment base on crf

Related tags

Text Data & NLPgenius
Overview

Genius

Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。

Feature

  • 支持python2.x、python3.x以及pypy2.x。
  • 支持简单的pinyin分词
  • 支持用户自定义break
  • 支持用户自定义合并词典
  • 支持词性标注

Source Install

  • 安装git: 1) ubuntu or debian apt-get install git 2) fedora or redhat yum install git
  • 下载代码:git clone https://github.com/duanhongyi/genius.git
  • 安装代码:python setup.py install

Pypi Install

  • 执行命令:easy_install genius或者pip install genius

Algorithm

  • 采用trie树进行合并词典查找
  • 基于wapiti实现条件随机场分词
  • 可以通过genius.loader.ResourceLoader来重载默认的字典

功能 1):分词genius.seg_text方法

  • genius.seg_text函数接受5个参数,其中text是必填参数:
  • text第一个参数为需要分词的字符
  • use_break代表对分词结构进行打断处理,默认值True
  • use_combine代表是否使用字典进行词合并,默认值False
  • use_tagging代表是否进行词性标注,默认值True
  • use_pinyin_segment代表是否对拼音进行分词处理,默认值True

代码示例( 全功能分词 )

#encoding=utf-8
import genius
text = u"""昨天,我和施瓦布先生一起与部分企业家进行了交流,大家对中国经济当前、未来发展的态势、走势都十分关心。"""
seg_list = genius.seg_text(
    text,
    use_combine=True,
    use_pinyin_segment=True,
    use_tagging=True,
    use_break=True
)
print('\n'.join(['%s\t%s' % (word.text, word.tagging) for word in seg_list]))

功能 2):面向索引分词

  • genius.seg_keywords方法专门为搜索引擎索引准备,保留歧义分割,其中text是必填参数。
  • text第一个参数为需要分词的字符
  • use_break代表对分词结构进行打断处理,默认值True
  • use_tagging代表是否进行词性标注,默认值False
  • use_pinyin_segment代表是否对拼音进行分词处理,默认值False
  • 由于合并操作与此方法有意义上的冲突,此方法并不提供合并功能;并且如果采用此方法做索引时候,检索时不推荐genius.seg_text使用use_combine=True参数。

代码示例

#encoding=utf-8
import genius

seg_list = genius.seg_keywords(u'南京市长江大桥')
print('\n'.join([word.text for word in seg_list]))

功能 3):关键词提取

  • genius.extract_tag方法专门为提取tag关键字准备,其中text是必填参数。
  • text第一个参数为需要分词的字符
  • use_break代表对分词结构进行打断处理,默认值True
  • use_combine代表是否使用字典进行词合并,默认值False
  • use_pinyin_segment代表是否对拼音进行分词处理,默认值False

代码示例

#encoding=utf-8
import genius

tag_list = genius.extract_tag(u'南京市长江大桥')
print('\n'.join(tag_list))

其他说明 4):

  • 目前分词语料出自人民日报1998年1月份,所以对于新闻类文章分词较为准确。
  • CRF分词效果很大程度上依赖于训练语料的类别以及覆盖度,若解决语料问题分词和标注效果还有很大的提升空间。
Owner
duanhongyi
a simple programmer.
duanhongyi
Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

Siva Prakash 3 Apr 05, 2022
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

OTT-JAX 255 Dec 26, 2022
Text Normalization(文本正则化)

Text Normalization(文本正则化) 任务描述:通过机器学习算法将英文文本的“手写”形式转换成“口语“形式,例如“6ft”转换成“six feet”等 实验结果 XGBoost + bag-of-words: 0.99159 XGBoost+Weights+rules:0.99002

Jason_Zhang 0 Feb 26, 2022
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 05, 2021
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

Eliyar Eziz 2.3k Dec 29, 2022
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

6 May 22, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
Semantic search for quotes.

squote A semantic search engine that takes some input text and returns some (questionably) relevant (questionably) famous quotes. Built with: bert-as-

cjwallace 11 Jun 25, 2022
Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural G2P to portuguese language Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written for

fluz 11 Nov 16, 2022
PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

Facebook Research 1.6k Dec 29, 2022
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

Phil Wang 167 Dec 27, 2022
An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

NLP-Pytorch-Assignment An assignment from my grad-level data mining course (before I started personal projects) demonstrating some experience with NLP

David Thorne 0 Feb 06, 2022
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022