Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

Overview

IMDB Sentiment Analysis

This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial Intelligence and Automation

Training

To train a model (CNN, LSTM, Transformer), simply run

python train.py --cfg <./model/xxx> --save <./save/>

You can change the configuration in config.

Model

LSTM

we follow the origin LSTM as possible

lstm

CNN

we adopt the methods mentioned in Effective Use of Word Order for Text Categorization with Convolutional Neural Networks

cnn

Transformer

We use the original Transformer Encoder as Attention is all you need and use the concept of CLS Token as BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

trans

Experiment result

Model Comparison

Model Accuracy
LSTM 89.02
Transformer 87.47
CNN 88.66
Fine-tuned BERT 93.43

LSTM

Batch size
Batch size Loss Accuracy
64 0.4293 0.8802
128 0.4298 0.8818
256 0.4304 0.8836
512 0.4380 0.8807
Embedding Size
Embedding size train Loss train Accuracy val loss val accuracy
32 0.4021 0.9127 0.4419 0.8707
64 0.3848 0.9306 0.4297 0.8832
128 0.3772 0.9385 0.4265 0.8871
256 0.3584 0.9582 0.4303 0.8825
512 0.3504 0.9668 0.4295 0.8838
Drop out
Drop out rate Train Loss Train Accuracy Test loss Test Accuracy
0.0 0.3554 0.9623 0.4428 0.8704
0.1 0.3475 0.9696 0.4353 0.8780
0.2 0.3516 0.9652 0.4312 0.8825
0.3 0.3577 0.9589 0.4292 0.8844
0.4 0.3587 0.9576 0.4272 0.8868
0.5 0.3621 0.9544 0.4269 0.8865
0.6 0.3906 0.9242 0.4272 0.8863
0.7 0.3789 0.9356 0.4303 0.8826
0.8 0.3939 0.9204 0.4311 0.8826
0.9 0.4211 0.8918 0.4526 0.8584
Weight decay
Weight decay train loss train accuracy test loss test accuracy
1.0e-8 0.3716 0.9436 0.4261 0.8876
1.0e-7 0.3803 0.9349 0.4281 0.8862
1.0e-6 0.3701 0.9456 0.4264 0.8878
1.0e-5 0.3698 0.9461 0.4283 0.8850
1.0e-4 0.3785 0.9377 0.4318 0.8806
Number layers

Number of LSTM blocks

Number layers train loss train accuracy test loss test accuracy
1 0.3786 0.9364 0.4291 0.8844
2 0.3701 0.9456 0.4264 0.8878
3 0.3707 0.9451 0.4243 0.8902
4 0.3713 0.9446 0.4279 0.8857

CNN

out channel size
out size train acc test acc
8 0.9679 0.8743
16 0.9791 0.8767
32 0.9824 0.8811
64 0.9891 0.8848
128 0.9915 0.8824
256 0.9909 0.8827
512 0.9920 0.8841
1024 0.9959 0.8833
multi scale filter
Number train acc test acc
1 [5] 0.9698 0.8748
2 [5, 11] 0.9852 0.8827
3 [5, 11, 17] 0.9890 0.8850
4 [5, 11, 17, 23] 0.9915 0.8848
5 [5, 11, 17, 23, 29] 0.9924 0.8842
6 [5, 11, 17, 23, 29, 35] 0.9930 0.8836
step train acc test acc
2 [5 7 9] 0.9878 0.8816
4 [5 9 11] 0.9890 0.8816
6 [5 11 17] 0.9919 0.8834
8 [5 13 21] 0.9884 0.8836
10[5 15 25] 0.9919 0.8848
12[5 17 29] 0.9898 0.8812
14[5 29 43] 0.9935 0.8809
Owner
Daniel
Daniel
TalkNet: Audio-visual active speaker detection Model

Is someone talking? TalkNet: Audio-visual active speaker detection Model This repository contains the code for our ACM MM 2021 paper, TalkNet, an acti

142 Dec 14, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

ASReview hackathon for Follow the Money 2 Nov 28, 2021
Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

CoML 762 Dec 29, 2022
SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Erre Quadro Srl 384 Dec 12, 2022
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

2k Jan 04, 2023
The proliferation of disinformation across social media has led the application of deep learning techniques to detect fake news.

Fake News Detection Overview The proliferation of disinformation across social media has led the application of deep learning techniques to detect fak

Kushal Shingote 1 Feb 08, 2022
Minimal GUI for accessing the Watson Text to Speech service.

Description Minimal graphical application for accessing the Watson Text to Speech service. Requirements Python 3 plus all dependencies listed in requi

Moritz Maxeiner 1 Oct 22, 2021
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023
NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

NL-Augmenter 🦎 → 🐍 The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformat

684 Jan 09, 2023
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
Fully featured implementation of Routing Transformer

Routing Transformer A fully featured implementation of Routing Transformer. The paper proposes using k-means to route similar queries / keys into the

Phil Wang 246 Jan 02, 2023
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.9k Jan 06, 2023
Open source code for AlphaFold.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

DeepMind 9.7k Jan 02, 2023
Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Introduction Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization,

Sai Himal Allu 1 Apr 25, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Maksim Zhdanov 7 Sep 20, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022