Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Last update: Jul 23, 2022

Related tags

Overview

japanese-ebook-analysis

This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technical users. You can analyse an ebook, and see the following information:

The length of the book in words
The length of the book in characters
The number of unique words used in the book
The number of unique words that are only used once in the book
The percentage of unique words that are only used once
The number of unique characters used
The number of unique characters that are only used once
The percentage of unique characters that are only used once
A list of all the words used in the book as well as how often they are used
A list of all the characters used in the book as well as how often they are used

For text processing, we use MeCab

Usage

Currently, the project is not deployed anywhere, so to use the service, you will need to follow the steps below in the development section to get the server running.

Upload a .epub file containing japanese text to the server
The server will redirect you to a page showing you information about the ebook. You can then also click the 'See more details' button to see all the generated data, including a list of all the words used together with how many occurences there are for each word, and the same for the characters as well.

Development

Clone repository: git clone https://github.com/christofferaakre/japanese-ebook-analysis.git
Make sure you have mecab set up on your system. See http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/
(Only required if you will actually upload ebooks or run the analyse_epub.py script), which you will not need to do to contribute to other parts of the app. for a good guide on how to set it up.
Install python dependencies: pip install -r requirements.txt
Install other dependencies (these all need to be in your system path):
- pandoc
Run ./app.py to start the flask dev server

Contributing

I'm very happy for any happy contributions! Before contributing, please have a look at CONTRIBUTING.md.

To see what needs work on, have a look at the repo's Issues and its Pull requests.

Feel free to submit your own issue or pull request about a new feature or anything else. When submitting a pull request, don't be afraid to modify any of the files; I'm not very attached to the coding style used in the repo.

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Related tags

Overview

japanese-ebook-analysis

Usage

Development

Contributing

Owner

Christoffer Aakre

超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

An Explainable Leaderboard for NLP

BERT Attention Analysis

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

초성 해석기 based on ko-BART

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

ChatBotProyect - This is an unfinished project about a simple chatbot.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

BERT, LDA, and TFIDF based keyword extraction in Python

Model parallel transformers in JAX and Haiku

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

Few-shot Natural Language Generation for Task-Oriented Dialog

RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.