TFIDF-based QA system for AIO2 competition

Last update: Feb 19, 2022

Related tags

Overview

AIO2 TF-IDF Baseline

This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition.

In the training stage, the model builds a sparse matrix of TF-IDF features from the questions in training dataset. In the inference stage, the model predicts answers of unseen questions by finding the most similar training question to the input by computing dot product scores of TF-IDF features.

Therefore, in principle, the model cannot predict answers unseen in the training data.

Steps to experiment with the model

Install requirements

$ pip install -r requirements.txt

Train

$ python train.py \
--train_file <data dir>/aio_02_train.jsonl \
--output_dir model \
--pos_list 名詞 \
--stop_words でしょ う \
--max_features 10000

Predict

$ python predict.py \
--model_dir model \
--test_file <data dir>/aio_02_dev_unlabeled_v1.0.jsonl \
--prediction_file <output dir>/predictions.jsonl

Building Docker image

$ docker build -t aio2-tfidf-baseline .

Test locally:

:/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl "> $ docker run --rm -v ":/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl 

Save the docker image to file:

$ docker save aio2-tfidf-baseline | gzip > aio2-tfidf-baseline.tar.gz

License

The codes in this repository are open-sourced under MIT License.

TFIDF-based QA system for AIO2 competition

Related tags

Overview

AIO2 TF-IDF Baseline

Steps to experiment with the model

Install requirements

Train

Predict

Building Docker image

License

Owner

Masatoshi Suzuki

A method to generate speech across multiple speakers

MRC approach for Aspect-based Sentiment Analysis (ABSA)

A desktop GUI providing an audio interface for GPT3.

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

Ongoing research training transformer language models at scale, including: BERT & GPT-2

An open source library for deep learning end-to-end dialog systems and chatbots.

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

Mkdocs + material + cool stuff

This is the offline-training-pipeline for our project.

Sequence modeling benchmarks and temporal convolutional networks

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Dust model dichotomous performance analysis

ETM - R package for Topic Modelling in Embedding Spaces

Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models