File-based TF-IDF

Calculates keywords in a document, using a word corpus.

Why?

Because I found myself with hundreds of plain text files, with no way to know what each one contains. I then recalled this thing called TF-IDF from university, but found no utility that operates on files. Hence, here we are.

How?

Basically, each word in the current document gets a score. The score increases each time the word it appears in this document, and decreases each time it appears in another document. The words with the highest scores will thus (theoretically) be the keywords.

Of course, this requires you to have many other documents (the corpus) to compare with. They should contain approximately the same language. For example, it makes sense to split chapters in a book and use those as the corpus. Use your senses.

Installation

Copy tfidf.py to some location on $PATH

Usage

usage: tfidf [-h] [--json] [--min-df MIN_DF] [-n N | --all] --input-document INPUT_DOCUMENT [corpus ...]

Calculates keywords in a document, using a word corpus.

positional arguments:
  corpus                corpus files (optional but highly reccommended)

options:
  -h, --help            show this help message and exit
  --json, -j            get output as json
  --min-df MIN_DF       if a word occurs less than this number of times in the corpus, it's not considered (default: 2)
  -n N                  limit output to this many words (default: 10)
  --all                 Don't limit the amount of words to output (default: false)
  --input-document INPUT_DOCUMENT, -i INPUT_DOCUMENT
                        document file to extract keywords from

Examples

To get the top 10 keywords for chapter 1 of Moby Dick:

# assume that *.txt matches all other chapters of mobydick
$ tfidf -n 10 -i mobydick_chapter1.txt *.txt

WORD             TF_IDF           TF               
passenger        0.003            0.002            
whenever         0.003            0.002            
money            0.003            0.002            
passengers       0.002            0.001            
purse            0.002            0.001            
me               0.002            0.011            
image            0.002            0.001            
hunks            0.002            0.001            
respectfully     0.002            0.001            
robust           0.002            0.001            
-----
num words in corpus: 208425

$ tfidf --all -j -i mobydick_chapter1.txt *.txt
[
    {
        "word": "lazarus",
        "tf_idf": 0.0052818627137794375,
        "tf": 0.0028169014084507044
    },
    {
        "word": "frost",
        "tf_idf": 0.004433890895007659,
        "tf": 0.0028169014084507044
    },
    {
        "word": "bedford",
        "tf_idf": 0.0037492766733561254,
        "tf": 0.0028169014084507044
    },
    ...
]

TF-IDF equations

t — term (word)
d — document (set of words)
corpus — (set of documents)
N — number of documents in corpus

tf(t,d) = count of t in d / number of words in d
df(t) = occurrence of t in N documents
idf(t) = N/df(t)

tf_idf(t, d) = tf(t, d) * idf(t)

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

Related tags

Overview

File-based TF-IDF

Why?

How?

Installation

Usage

Examples

TF-IDF equations

Owner

Jakob Lindskog

Tool which allow you to detect and translate text.

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Code for the paper "Flexible Generation of Natural Language Deductions"

SurvTRACE: Transformers for Survival Analysis with Competing Events

[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Knowledge Oriented Programming Language

Mednlp - Medical natural language parsing and utility library

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Code for the paper "Language Models are Unsupervised Multitask Learners"

Multilingual word vectors in 78 languages

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Phrase-Based & Neural Unsupervised Machine Translation

Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs