File-based TF-IDF: Calculates keywords in a document, using a word corpus.

Related tags

Text Data & NLPtf-idf
Overview

File-based TF-IDF

Calculates keywords in a document, using a word corpus.

Why?

Because I found myself with hundreds of plain text files, with no way to know what each one contains. I then recalled this thing called TF-IDF from university, but found no utility that operates on files. Hence, here we are.

How?

Basically, each word in the current document gets a score. The score increases each time the word it appears in this document, and decreases each time it appears in another document. The words with the highest scores will thus (theoretically) be the keywords.

Of course, this requires you to have many other documents (the corpus) to compare with. They should contain approximately the same language. For example, it makes sense to split chapters in a book and use those as the corpus. Use your senses.

Installation

Copy tfidf.py to some location on $PATH

Usage

usage: tfidf [-h] [--json] [--min-df MIN_DF] [-n N | --all] --input-document INPUT_DOCUMENT [corpus ...]

Calculates keywords in a document, using a word corpus.

positional arguments:
  corpus                corpus files (optional but highly reccommended)

options:
  -h, --help            show this help message and exit
  --json, -j            get output as json
  --min-df MIN_DF       if a word occurs less than this number of times in the corpus, it's not considered (default: 2)
  -n N                  limit output to this many words (default: 10)
  --all                 Don't limit the amount of words to output (default: false)
  --input-document INPUT_DOCUMENT, -i INPUT_DOCUMENT
                        document file to extract keywords from

Examples

To get the top 10 keywords for chapter 1 of Moby Dick:

# assume that *.txt matches all other chapters of mobydick
$ tfidf -n 10 -i mobydick_chapter1.txt *.txt

WORD             TF_IDF           TF               
passenger        0.003            0.002            
whenever         0.003            0.002            
money            0.003            0.002            
passengers       0.002            0.001            
purse            0.002            0.001            
me               0.002            0.011            
image            0.002            0.001            
hunks            0.002            0.001            
respectfully     0.002            0.001            
robust           0.002            0.001            
-----
num words in corpus: 208425
$ tfidf --all -j -i mobydick_chapter1.txt *.txt
[
    {
        "word": "lazarus",
        "tf_idf": 0.0052818627137794375,
        "tf": 0.0028169014084507044
    },
    {
        "word": "frost",
        "tf_idf": 0.004433890895007659,
        "tf": 0.0028169014084507044
    },
    {
        "word": "bedford",
        "tf_idf": 0.0037492766733561254,
        "tf": 0.0028169014084507044
    },
    ...
]

TF-IDF equations

t — term (word)
d — document (set of words)
corpus — (set of documents)
N — number of documents in corpus

tf(t,d) = count of t in d / number of words in d
df(t) = occurrence of t in N documents
idf(t) = N/df(t)

tf_idf(t, d) = tf(t, d) * idf(t)
Owner
Jakob Lindskog
Jakob Lindskog
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022
My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

Aflah 9 Oct 31, 2022
Resources for "Natural Language Processing" Coursera course.

Natural Language Processing course resources This github contains practical assignments for Natural Language Processing course by Higher School of Eco

Advanced Machine Learning specialisation by HSE 1.1k Jan 01, 2023
leaking paid token generator that was a shit lmao for 100$ haha

Discord-Token-Generator-Leaked leaking paid token generator that was a shit lmao for 100$ he selling it for 100$ wth here the code enjoy don't forget

Keevo 5 Apr 15, 2022
PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models

Heng Cai 209 Dec 30, 2022
keras implement of transformers for humans

keras implement of transformers for humans

苏剑林(Jianlin Su) 4.8k Jan 03, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 07, 2023
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

1 Oct 14, 2021
Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

BLEU Score Implementation for paper: BLEU: a Method for Automatic Evaluation of Machine Translation Author: Ba Ngoc from ProtonX BLEU score is a popul

Ngoc Nguyen Ba 6 Oct 07, 2021
edge-SR: Super-Resolution For The Masses

edge-SR: Super Resolution For The Masses Citation Pablo Navarrete Michelini, Yunhua Lu and Xingqun Jiang. "edge-SR: Super-Resolution For The Masses",

Pablo 40 Nov 10, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

Hyunwoong Ko 72 Dec 07, 2022
端到端的长本文摘要模型(法研杯2020司法摘要赛道)

端到端的长文本摘要模型(法研杯2020司法摘要赛道)

苏剑林(Jianlin Su) 334 Jan 08, 2023
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

anlp21 Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dba

David Bamman 48 Dec 06, 2022
Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

Sean Robertson 159 Apr 04, 2022
VoiceFixer VoiceFixer is a framework for general speech restoration.

VoiceFixer VoiceFixer is a framework for general speech restoration. We aim at the restoration of severly degraded speech and historical speech. Paper

Leo 174 Jan 06, 2023
This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

NORESQA: Speech Quality Assessment using Non-Matching References This is a Pytorch implementation for using NORESQA. It contains minimal code to predi

Meta Research 36 Dec 08, 2022