Weakly-supervised Text Classification Based on Keyword Graph

Overview

Weakly-supervised Text Classification Based on Keyword Graph

How to run?

Download data

Our dataset follows previous works. For long texts, we follow Conwea. For short texts, we follow LOTClass.
We transform all their data into unified json format.

  1. Download datasets from: https://drive.google.com/drive/folders/1D8E9T-vuBE-YdAd9OBy-yS4UW4AptA58?usp=sharing

    • Long text datasets(follow Conwea):

      • 20Newsgroup Fine(20NF)
      • 20Newsgroup Coarse(20NC)
      • NYT Fine(NYT_25)
      • NYT Coarse(NYT_5)
    • Short text datasets(follow LOTClass)

      • Agnews
      • dbpedia
      • imdb
      • amazon
  2. Unzip data into './data/processed'

Another way to obtain data (Not recommended):
You can download long text data from Conwea and short text data from LOTClass and transform data into json format using our code. The code is located at 'preprocess_data/process_long.py (process_short.py) You need to edit the preprocess code to change the dataset path to your downloaded path and change the taskname. The processed data is located in 'data/processed'. We alse provide preprocess code for X-class, which is 'process_x_class.py'.

Requirements

This project is based on python==3.8. The dependencies are as follow:

pytorch
DGL
yacs
visdom
transformers
scikit-learn
numpy
scipy

Train and Eval

  • Recommend to start visdom to show the results.
visdom -p 8888

Open the browser to the server_ip:8888 to show visdom panel.

  • Train:
    • First edit 'task/pipeline.py' to specify to config file and CUDA devices you used.
      Some configuration files are provided in the config folder.

    • Start training:

      python task/pipeline.py
      
    • Our code is based on multi GPUs, may be unable to run on single GPU currently.

Run on your custom dataset.

  1. provide datasets to dir data/processed.

    • keywords.json
      keywords for each class. type: dict. key: class_index. value: list containing all keywords for this class. See provided datasets for details.

    • unlabeled.json
      unlabeled sentences in our paper. type: list. item: list with 2 items([sentence_i,label_i]).
      In order to facilitate the evaluation, we are similar to Conwea's settings, where labels of sentences are provided. The labels are only used for evaluation.

  2. provide config to dir config. You can copy one of the existing config files and change some fields, like number_classes, classifier.type, data_dir_name etc.

  3. Specify the config file name in pipeline.py and run the pipeline code.

Citation

Please cite the following paper if you find our code helpful! Thank you very much.

Lu Zhang, Jiandong Ding, Yi Xu, Yingyao Liu and Shuigeng Zhou. "Weakly-supervised Text Classification Based on Keyword Graph". EMNLP 2021.

Owner
Hello_World
Computer Science at Fudan University.
Hello_World
Anomaly Detection 이상치 탐지 전처리 모듈

Anomaly Detection 시계열 데이터에 대한 이상치 탐지 1. Kernel Density Estimation을 활용한 이상치 탐지 train_data_path와 test_data_path에 존재하는 시점 정보를 포함하고 있는 csv 형태의 train data와

CLUST-consortium 43 Nov 28, 2022
RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

RuCLIPtiny Zero-shot image classification model for Russian language RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network

Shahmatov Arseniy 26 Sep 20, 2022
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
BiQE: Code and dataset for the BiQE paper

BiQE: Bidirectional Query Embedding This repository includes code for BiQE and the datasets introduced in Answering Complex Queries in Knowledge Graph

Bhushan Kotnis 1 Oct 20, 2021
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

Sreekanth M 1 Oct 18, 2022
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
My implementation of Safaricom Machine Learning Codility test. The code has bugs, logical I guess I made errors and any correction will be appreciated.

Safaricom_Codility Machine Learning 2022 The test entails two questions. Question 1 was on Machine Learning. Question 2 was on SQL I ran out of time.

Lawrence M. 1 Mar 03, 2022
NLP Core Library and Model Zoo based on PaddlePaddle 2.0

PaddleNLP 2.0拥有丰富的模型库、简洁易用的API与高性能的分布式训练的能力,旨在为飞桨开发者提升文本建模效率,并提供基于PaddlePaddle 2.0的NLP领域最佳实践。

6.9k Jan 01, 2023
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
HAN2HAN : Hangul Font Generation

HAN2HAN : Hangul Font Generation

Changwoo Lee 36 Dec 28, 2022
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 05, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
Search with BERT vectors in Solr and Elasticsearch

Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan 123 Dec 29, 2022
숭실대학교 컴퓨터학부 전공종합설계프로젝트

✨ 시각장애인을 위한 버스도착 알림 장치 ✨ 👀 개요 현대 사회에서 대중교통 위치 정보를 이용하여 사람들이 간단하게 이용할 대중교통의 정보를 얻고 쉽게 대중교통을 이용할 수 있다. 해당 정보는 각종 어플리케이션과 대중교통 이용시설에서 위치 정보를 제공하고 있지만 시각

taegyun 3 Jan 25, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 02, 2023
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 07, 2022
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Microsoft 105 Jan 08, 2022
Exploration of BERT-based models on twitter sentiment classifications

twitter-sentiment-analysis Explore the relationship between twitter sentiment of Tesla and its stock price/return. Explore the effect of different BER

Sammy Cui 2 Oct 02, 2022