Vector space based Information Retrieval System for Text Processing - Information retrieval

Last update: Jan 01, 2022

Related tags

Text Processing BITS-IR-PROJECT

Overview

Information Retrieval: Text Processing

Group 13

Sequence of operations

Install Requirements
Add given wikipedia files to the corpus directory.
Download glove.6B.100d.txt dataset (Ignore if already present) and place it in the project root directory.
Run construct_index.py
Run construct_index.py --zoned_index True
Run trim_embeddings.py
Run test_queries.py
Run test_queries.py --score_title True
Run test_queries.py --expand_query True

Installing Requirements:

   pip install -r requirements.txt

corpus

Contains the files to be indexed. Add files directly to this directory. Do not create subdirectories.
For this assignment, we have used the following files present in the AA folder of wikipedia files.
wiki_00
wiki_01
wiki_05
wiki_06
wiki_10
wiki_11
wiki_15
wiki_16
wiki_20
wiki_21
wiki_25
wiki_26
wiki_30
wiki_31

index_files

Contains the inverted indices constructed using construct_index.py.

construct_index.py

Constructs the inverted indices used for query evaluation.
Command Line Arguments:
--zoned_index: True if zoned indexing must be used. Set to False by default.

trim_embeddings.py

Trims the GloVe embeddings to contain terms only from corpus. Download the glove.6B.100d.txt dataset before running this file.

test_queries.py

Evaluates queries and displays retrieved documents.
Command Line Arguments:
--score_title: True if zoned index considered for evaluation. Set to False by default.
--expand_query: True if query expansion must be used. Set to False by default.

helper_module.py

Contains helper functions used by other files. Do not run this file.

document_list.txt

Contains the document ids and names used for evaluation.

Vector space based Information Retrieval System for Text Processing - Information retrieval

Related tags

Overview

Information Retrieval: Text Processing

Group 13

Sequence of operations

Installing Requirements:

corpus

index_files

construct_index.py

trim_embeddings.py

test_queries.py

helper_module.py

document_list.txt

Owner

Adventura is an open source Python Text Adventure Engine

Hamming code generation, error detection & correction.

A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

一个可以可以统计群组用户发言，并且能将聊天内容生成词云的机器人

Meeting, rendezvous, confluence (Finnish kohtaaminen) mark up, down, and up again.

Convert English text to IPA using the toPhonetic

A python tool one can extract the "hash" from a WINDOWS HELLO PIN

Deasciify-highlighted - A Python script for deasciifying text to Turkish and copying clipboard

Vector space based Information Retrieval System for Text Processing - Information retrieval

Simple python program to auto credit your code, text, book, whatever!

BaseCrack is a tool written in Python that can decode all alphanumeric base encoding schemes.

Convert ebooks with few clicks on Telegram!

Convert text to morse code and play morse code sound.

知乎评论区词云分析

This script has been created in order to find what are the most common demanded technologies in Data Engineering field.

Answer some questions and get your brawler csvs ready!

Implementation of hashids (http://hashids.org) in Python. Compatible with Python 2 and Python 3

Chilean Digital Vaccination Pass Parser (CDVPP) parses digital vaccination passes from PDF files

Find a Doc is a free online resource aimed at helping connect the foreign community in Japan with health services in their native language.