Japanese synonym library

Overview

chikkarpy

PyPi version test

chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar.

chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

単体でも同義語辞書の検索ツールとして利用できます。

利用方法 Usage

TL;DR

$ pip install chikkarpy

$ echo "閉店" | chikkarpy
閉店    クローズ,close,店仕舞い

Step 1. chikkarpyのインストール

$ pip install chikkarpy

Step 2. 使用方法

コマンドライン

$ echo "閉店" | chikkarpy
閉店    クローズ,close,店仕舞い

chikkarpyは入力された単語を見て一致する同義語のリストを返します。 同義語辞書内の曖昧性フラグが1の見出し語をトリガーにすることはできません。 出力はクエリ\t同義語リストの形式です。

$ chikkarpy search -h
usage: chikkarpy search [-h] [-d [file [file ...]]] [-ev] [-o file] [-v]
                        [file [file ...]]

Search synonyms

positional arguments:
  file                  text written in utf-8

optional arguments:
  -h, --help            show this help message and exit
  -d [file [file ...]]  synonym dictionary (default: system synonym
                        dictionary)
  -ev                   Enable verb and adjective synonyms.
  -o file               the output file
  -v, --version         print chikkarpy version

自分で用意したユーザー辞書を使いたい場合は-dで読み込むバイナリ辞書を指定できます。 (バイナリ辞書のビルドは辞書の作成を参照してください。) 複数辞書を読み込む場合は順番に注意してください。 以下の場合,user2 > user > system の順で同義語を検索して見つかった時点で検索結果を返します。

chikkarpy -d system.dic user.dic user2.dic

また、出力はデフォルトで体言のみです。 用言も出力したい場合は-evを有効にしてください。

$ echo "開放" | chikkarpy
開放	オープン,open
$ echo "開放" | chikkarpy -ev
開放	開け放す,開く,オープン,open

python ライブラリ

使用例

from chikkarpy import Chikkar
from chikkarpy.dictionarylib import Dictionary

chikkar = Chikkar()

system_dic = Dictionary("system.dic", False)
chikkar.add_dictionary(system_dic)

print(chikkar.find("閉店"))
# => ['クローズ', 'close', '店仕舞い']

print(chikkar.find("閉店", group_ids=[5])) # グループIDによる検索
# => ['クローズ', 'close', '店仕舞い']

print(chikkar.find("開放"))
# => ['オープン', 'open']

chikkar.enable_verb() # 用言の出力制御(デフォルトは体言のみ出力)
print(chikkar.find("開放"))
# => ['開け放す', '開く', 'オープン', 'open']

chikkar.add_dictionary()で複数の辞書を読み込ませる場合は順番に注意してください。 最後に読み込んだ辞書を優先して検索します。

辞書の作成 Build a dictionary

新しく辞書を追加する場合は、利用前にバイナリ形式辞書の作成が必要です。 Before using new dictionary, you need to create a binary format dictionary.

$ chikkarpy build -i synonym_dict.csv -o system.dic 
$ chikkarpy build -h
usage: chikkarpy build [-h] -i file [-o file] [-d string]

Build Synonym Dictionary

optional arguments:
  -h, --help  show this help message and exit
  -i file     dictionary file (csv)
  -o file     output file (default: synonym.dic)
  -d string   description comment to be embedded on dictionary

開発者向け

Code Format

scripts/lint.sh を実行して、コードが正しいフォーマットかを確認してください。

flake8 flake8-import-order flake8-builtins が必要です。

Test

scripts/test.sh を実行してテストしてください。

Contact

chikkarpyはWAP Tokushima Laboratory of AI and NLPによって開発されています。

開発者やユーザーの方々が質問したり議論するためのSlackワークスペースを用意しています。

You might also like...
Script to download some free japanese lessons in portuguse from NHK
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

A Japanese tokenizer based on recurrent neural networks
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!
Comments
  • pip install does not work under SudachiPy 0.6.x environment / SudachiPy 0.6.x の環境下で pip install が通らない

    pip install does not work under SudachiPy 0.6.x environment / SudachiPy 0.6.x の環境下で pip install が通らない

    temporary solution / 暫定的な解決方法

    Install SudachiPy 0.5.4, then chikkarpy, then reinstall the latest version of SudachiPy. SudachiPy 0.5.4 をインストールしてから、chikkarpy をインストールし、その後 SudachiPy 最新版を再インストールする。

    pip install sudachipy==0.5.4 --upgrade
    pip install sudachidict_core
    pip install chikkarpy
    pip install sudachipy --upgrade
    
    opened by Nishihara-Daiki 1
  • chikkarpy has no attribute 'dictionarylib' in certain cases

    chikkarpy has no attribute 'dictionarylib' in certain cases

    case 1: raised ERROR if call chikkarpy.dictionarylib

    $ pip install chikkarpy
    $ python
    >>> import chikkarpy
    >>> chikkarpy.dictionarylib
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: module 'chikkarpy' has no attribute 'dictionarylib'
    

    case 2: pass if use from chikkarpy import dictionarylib

    $ pip install chikkarpy
    $ python
    >>> from chikkarpy import dictionarylib
    >>> dictionarylib
    <module 'chikkarpy.dictionarylib' from '/usr/local/lib/python3.7/dist-packages/chikkarpy/dictionarylib/__init__.py'>
    

    case 3: pass if call chikkarpy.dictionarylib AFTER from chikkarpy import dictionarylib

    $ pip install chikkarpy
    $ python
    >>> import chikkarpy
    >>> from chikkarpy import dictionarylib
    >>> chikkarpy.dictionarylib
    <module 'chikkarpy.dictionarylib' from '/usr/local/lib/python3.7/dist-packages/chikkarpy/dictionarylib/__init__.py'>
    
    opened by Nishihara-Daiki 0
Releases(v0.1.1)
Owner
Works Applications
Works Applications
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
C.J. Hutto 3.8k Dec 30, 2022
A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

Feast 52 Jan 05, 2023
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
Converts text into a PDF of handwritten notes

Text To Handwritten Notes Converts text into a PDF of handwritten notes Explore the docs » · Report Bug · Request Feature · Steps: $ git clone https:/

UVSinghK 63 Oct 09, 2022
Code and data accompanying Natural Language Processing with PyTorch

Natural Language Processing with PyTorch Build Intelligent Language Applications Using Deep Learning By Delip Rao and Brian McMahan Welcome. This is a

Joostware 1.8k Jan 01, 2023
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
Text Classification Using LSTM

Text classification is the task of assigning a set of predefined categories to free text. Text classifiers can be used to organize, structure, and categorize pretty much anything. For example, new ar

KrishArul26 3 Jan 03, 2023
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

capbot-siic Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021. Problem Inspiration A plethora

Aryan Kargwal 19 Feb 17, 2022
Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

Wolfgang 226 Dec 30, 2022
Python utility library for compositing PDF documents with reportlab.

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the s

Michael Gale 1 Jan 06, 2022
2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

Choi Eun Jeong 2 Aug 16, 2022
Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

Nicolas Ruscher 1 Feb 01, 2022
NeMo: a toolkit for conversational AI

NVIDIA NeMo Introduction NeMo is a toolkit for creating Conversational AI applications. NeMo product page. Introductory video. The toolkit comes with

NVIDIA Corporation 5.3k Jan 04, 2023
📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

Bardia Shahrestani 2 May 26, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022