Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

Overview

SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge

Introduction

SentiLARE is a sentiment-aware pre-trained language model enhanced by linguistic knowledge. You can read our paper for more details. This project is a PyTorch implementation of our work.

Dependencies

  • Python 3
  • NumPy
  • Scikit-learn
  • PyTorch >= 1.3.0
  • PyTorch-Transformers (Huggingface) 1.2.0
  • TensorboardX
  • Sentence Transformers 0.2.6 (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)
  • NLTK (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)

Quick Start for Fine-tuning

Datasets of Downstream Tasks

Our experiments contain sentence-level sentiment classification (e.g. SST / MR / IMDB / Yelp-2 / Yelp-5) and aspect-level sentiment analysis (e.g. Lap14 / Res14 / Res16). You can download the pre-processed datasets (Google Drive / Tsinghua Cloud) of the downstream tasks. The detailed description of the data formats is attached to the datasets.

Fine-tuning

To quickly conduct the fine-tuning experiments, you can directly download the checkpoint (Google Drive / Tsinghua Cloud) of our pre-trained model. We show the example of fine-tuning SentiLARE on SST as follows:

cd finetune
CUDA_VISIBLE_DEVICES=0,1,2 python run_sent_sentilr_roberta.py \
          --data_dir data/sent/sst \
          --model_type roberta \
          --model_name_or_path pretrain_model/ \
          --task_name sst \
          --do_train \
          --do_eval \
          --max_seq_length 256 \
          --per_gpu_train_batch_size 4 \
          --learning_rate 2e-5 \
          --num_train_epochs 3 \
          --output_dir sent_finetune/sst \
          --logging_steps 100 \
          --save_steps 100 \
          --warmup_steps 100 \
          --eval_all_checkpoints \
          --overwrite_output_dir

Note that data_dir is set to the directory of pre-processed SST dataset, and model_name_or_path is set to the directory of the pre-trained model checkpoint. output_dir is the directory to save the fine-tuning checkpoints. You can refer to the fine-tuning codes to get the description of other hyper-parameters.

More details about fine-tuning SentiLARE on other datasets can be found in finetune/README.MD.

POS Tagging and Polarity Acquisition for Downstream Tasks

During pre-processing, we tokenize the original datasets with NLTK, tag the sentences with Stanford Log-Linear Part-of-Speech Tagger, and obtain the sentiment polarity with Sentence-BERT.

Pre-training

If you want to conduct pre-training by yourself instead of directly using the checkpoint we provide, this part may help you pre-process the pre-training dataset and run the pre-training scripts.

Dataset

We use Yelp Dataset Challenge 2019 as our pre-training dataset. According to the Term of Use of Yelp dataset, you should download Yelp dataset on your own.

POS Tagging and Polarity Acquisition for Pre-training Dataset

Similar to fine-tuning, we also conduct part-of-speech tagging and sentiment polarity acquisition on the pre-training dataset. Note that since the pre-training dataset is quite large, the pre-processing procedure may take a long time because we need to use Sentence-BERT to obtain the representation vectors of all the sentences in the pre-training dataset.

Pre-training

Refer to pretrain/README.MD for more implementation details about pre-training.

Citation

@inproceedings{ke-etal-2020-sentilare,
    title = "{S}enti{LARE}: Sentiment-Aware Language Representation Learning with Linguistic Knowledge",
    author = "Ke, Pei  and Ji, Haozhe  and Liu, Siyang  and Zhu, Xiaoyan  and Huang, Minlie",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    pages = "6975--6988",
}

Please kindly cite our paper if this paper and the codes are helpful.

Thanks

Many thanks to the GitHub repositories of Transformers and BERT-PT. Part of our codes are modified based on their codes.

Owner
Conversational AI groups from Tsinghua University
CoMoGAN: continuous model-guided image-to-image translation. CVPR 2021 oral.

CoMoGAN: Continuous Model-guided Image-to-Image Translation Official repository. Paper CoMoGAN: continuous model-guided image-to-image translation [ar

166 Dec 31, 2022
No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency

This repository contains the implementation for the paper: No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consiste

Alireza Golestaneh 75 Dec 30, 2022
An example project demonstrating how the Autonomous Learning Library can be used to build new reinforcement learning agents.

About This repository shows how Autonomous Learning Library can be used to build new reinforcement learning agents. In particular, it contains a model

Chris Nota 5 Aug 30, 2022
2021 credit card consuming recommendation

2021 credit card consuming recommendation

Wang, Chung-Che 7 Mar 08, 2022
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021 The code for training mCOLT/mRASP2, a multilingua

104 Jan 01, 2023
Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning

Automated Side Channel Analysis of Media Software with Manifold Learning Official implementation of USENIX Security 2022 paper: Automated Side Channel

Yuanyuan Yuan 175 Jan 07, 2023
Official MegEngine implementation of CREStereo(CVPR 2022 Oral).

[CVPR 2022] Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation This repository contains MegEngine implementation of ou

MEGVII Research 309 Dec 30, 2022
Another pytorch implementation of FCN (Fully Convolutional Networks)

FCN-pytorch-easiest Trying to be the easiest FCN pytorch implementation and just in a get and use fashion Here I use a handbag semantic segmentation f

Y. Dong 158 Dec 21, 2022
AI-UPV at IberLEF-2021 DETOXIS task: Toxicity Detection in Immigration-Related Web News Comments Using Transformers and Statistical Models

AI-UPV at IberLEF-2021 DETOXIS task: Toxicity Detection in Immigration-Related Web News Comments Using Transformers and Statistical Models Description

Angel de Paula 0 Jun 08, 2022
A PyTorch implementation of "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? Source: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

Caiyong Wang 14 Sep 20, 2022
Yolact-keras实例分割模型在keras当中的实现

Yolact-keras实例分割模型在keras当中的实现 目录 性能情况 Performance 所需环境 Environment 文件下载 Download 训练步骤 How2train 预测步骤 How2predict 评估步骤 How2eval 参考资料 Reference 性能情况 训练数

Bubbliiiing 11 Dec 26, 2022
Introduction to Statistics and Basics of Mathematics for Data Science - The Hacker's Way

HackerMath for Machine Learning “Study hard what interests you the most in the most undisciplined, irreverent and original manner possible.” ― Richard

Amit Kapoor 1.4k Dec 22, 2022
Tech Resources for Academic Communities

Free tech resources for faculty, students, researchers, life-long learners, and academic community builders for use in tech based courses, workshops, and hackathons.

Microsoft 2.5k Jan 04, 2023
Official Implementation of HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation

HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation by Lukas Hoyer, Dengxin Dai, and Luc Van Gool [Arxiv] [Paper] Overview Unsup

Lukas Hoyer 149 Dec 28, 2022
Official implementation for (Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching, AAAI-2021)

Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching Official pytorch implementation of "Show, Attend and Distill: Kn

Clova AI Research 80 Dec 16, 2022
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It ca

OSU DKI Lab 76 Dec 21, 2022
SciPy fixes and extensions

scipyx SciPy is large library used everywhere in scientific computing. That's why breaking backwards-compatibility comes as a significant cost and is

Nico Schlömer 16 Jul 17, 2022
Official code for UnICORNN (ICML 2021)

UnICORNN (Undamped Independent Controlled Oscillatory RNN) [ICML 2021] This repository contains the implementation to reproduce the numerical experime

Konstantin Rusch 21 Dec 22, 2022
E-Ink Magic Calendar that automatically syncs to Google Calendar and runs off a battery powered Raspberry Pi Zero

MagInkCal This repo contains the code needed to drive an E-Ink Magic Calendar that uses a battery powered (PiSugar2) Raspberry Pi Zero WH to retrieve

2.8k Dec 28, 2022
2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.

TableMASTER-mmocr Contents About The Project Method Description Dependency Getting Started Prerequisites Installation Usage Data preprocess Train Infe

Jianquan Ye 298 Dec 21, 2022