Hierarchical Metadata-Aware Document Categorization under Weak Supervision (WSDM'21)

Overview

Hierarchical Metadata-Aware Document Categorization under Weak Supervision

This project provides a weakly supervised framework for hierarchical metadata-aware document categorization.

Links

Installation

For training, a GPU is strongly recommended.

Keras

The code is based on Keras. You can find installation instructions here.

Dependency

The code is written in Python 3.6. The dependencies are summarized in the file requirements.txt. You can install them like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you need to first download the datasets. Three datasets are used in our paper: GitHub, ArXiv, and Amazon. Once you unzip the downloaded file (i.e., data.zip), you can see three folders related to these three datasets, respectively.

Dataset #Documents #Layers #Classes (including ROOT) #Leaves Sample Classes
GitHub 1,596 2 18 14 Computer Vision (Layer-1), Image Generation (Layer-2)
ArXiv 26,400 2 94 88 cs (Layer-1), cs.AI (Layer-2)
Amazon 147,000 2 166 147 Automotive (Layer-1), Car Care (Layer-2)

You need to put these 3 folders under the main folder ./. Then the following running script can be used to run the model.

./test.sh

Level-1/Level-2/Overall Micro-F1/Macro-F1 scores will be shown in the last several lines of the output. The classification result can be found under your dataset folder. For example, if you are using the GitHub dataset, the output will be ./github/out.txt.

Data

In each of the three folders (i.e., github/, arxiv/, and amazon/), there is a json file, where each line represents one document with text and metadata information.

For GitHub, the json format is

{
  "id": "Natsu6767/DCGAN-PyTorch",  
  "user": [
    "Natsu6767"
  ],
  "text": "pytorch implementation of dcgan trained on the celeba dataset deep convolutional gan ...",
  "tags": [
    "pytorch",
    "dcgan",
    "gan",
    "implementation",
    "deeplearning",
    "computer-vision",
    "generative-model"
  ],
  "labels": [
    "$Computer-Vision",
    "$Image-Generation"
  ]
}

The "user" and "tags" fields are metadata.

For ArXiv, the json format is

{
  "id": "1001.0063",
  "authors": [
    "Alessandro Epasto",
    "Enrico Nardelli"
  ],
  "text": "on a model for integrated information in this paper we give a thorough presentation ...",
  "labels": [
    "cs",
    "cs.AI"
  ]
}

The "authors" field is metadata.

For Amazon, the json format is

{
  "user": [
    "A39IXH6I0WT6TK"
  ],
  "product": [
    "B004DLPXAO"
  ],
  "text": "works really great only had a problem when it was updated but they fixed it right away ...",
  "labels": [
    "Apps-for-Android",
    "Books-&-Comics"
  ]
}

The "user" and "product" fields are metadata.

NOTE 1: If you would like to run our code on your own dataset, when you prepare this json file, make sure that: (1) You list the labels in the top-down order. For example, if the label path of your repository is ROOT-A-B-C, then the "labels" field should be ["A", "B", "C"]. (2) For each document, its metadata field is always represented by a list. For example, the "user" field should be ["A39IXH6I0WT6TK"] instead of "A39IXH6I0WT6TK".

Running on New Datasets

In the Quick Start section, we include a pretrained embedding file in the downloaded folders. If you would like to re-train the embedding (or you have a new dataset), please follow the steps below.

  1. Create a directory named ${dataset} under the main folder (e.g., ./github).

  2. Prepare four files:
    (1) ./${dataset}/label_hier.txt indicating the parent children relationships between classes. The first class of each line is the parent class, followed by all its children classes. Whitespace is used as the delimiter. The root class must be named as ROOT. Make sure your class names do not contain whitespace.
    (2) ./${dataset}/doc_id.txt containing labeled document ids for each class. Each line begins with the class name, and then document ids in the corpus (starting from 0) of the corresponding class separated by whitespace.
    (3) ./${dataset}/${json-name}.json. You can refer to the provided json format above. Make sure it has two fields "text" and "labels". You can add your own metadata fields in the json.
    (4) ./${dataset}/meta_dict.json indicating the names of your metadata fields. For example, for GitHub, it should be

{"metadata": ["user", "tags"]}

For ArXiv, it should be

{"metadata": ["authors"]}
  1. Install the dependencies GSL and Eigen. For Eigen, we already provide a zip file JointEmbedding/eigen-3.3.3.zip. You can directly unzip it in JointEmbedding/. For GSL, you can download it here.

  2. ./prep_emb.sh. Make sure you change the dataset/json names. The embedding file will be saved to ./${dataset}/embedding_sph.

After that, you can train the classifier as mentioned in Quick Start (i.e., ./test.sh). Please always refer to the example datasets when adapting the code for a new dataset.

Citation

If you find the implementation useful, please cite the following paper:

@inproceedings{zhang2021hierarchical,
  title={Hierarchical Metadata-Aware Document Categorization under Weak Supervision},
  author={Zhang, Yu and Chen, Xiusi and Meng, Yu and Han, Jiawei},
  booktitle={WSDM'21},
  pages={770--778},
  year={2021},
  organization={ACM}
}
Owner
Yu Zhang
CS Ph.D. student at UIUC; Data Mining
Yu Zhang
Use of Attention Gates in a Convolutional Neural Network / Medical Image Classification and Segmentation

Attention Gated Networks (Image Classification & Segmentation) Pytorch implementation of attention gates used in U-Net and VGG-16 models. The framewor

Ozan Oktay 1.6k Dec 30, 2022
Paper Code:A Self-adaptive Weighted Differential Evolution Approach for Large-scale Feature Selection

1. SaWDE.m is the main function 2. DataPartition.m is used to randomly partition the original data into training sets and test sets with a ratio of 7

wangxb 14 Dec 08, 2022
Feature extraction made simple with torchextractor

torchextractor: PyTorch Intermediate Feature Extraction Introduction Too many times some model definitions get remorselessly copy-pasted just because

Antoine Broyelle 89 Oct 31, 2022
Tooling for the Common Objects In 3D dataset.

CO3D: Common Objects In 3D This repository contains a set of tools for working with the Common Objects in 3D (CO3D) dataset. Download the dataset The

Facebook Research 724 Jan 06, 2023
masscan + nmap + Finger

说明 个人根据使用习惯修改masnmap而来的一个小工具。调用masscan做全端口扫描,再调用nmap做服务识别,最后调用Finger做Web指纹识别。工具使用场景适合风险探测排查、众测等。 使用方法 安装依赖 pip3 install -r requirements.txt -i https:/

Ryan 3 Mar 25, 2022
Vikrant Deshpande 1 Nov 17, 2022
STBP is a way to train SNN with datasets by Backward propagation.

Spiking neural network (SNN), compared with depth neural network (DNN), has faster processing speed, lower energy consumption and more biological interpretability, which is expected to approach Stron

Ling Zhang 18 Dec 09, 2022
Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch

Semantic Segmentation Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch Features Applicable to followin

sithu3 530 Jan 05, 2023
Numerai tournament example scripts using NN and optuna

numerai_NN_example Numerai tournament example scripts using pytorch NN, lightGBM and optuna https://numer.ai/tournament Performance of my model based

Takahiro Maeda 12 Oct 10, 2022
NIMA: Neural IMage Assessment

PyTorch NIMA: Neural IMage Assessment PyTorch implementation of Neural IMage Assessment by Hossein Talebi and Peyman Milanfar. You can learn more from

Kyryl Truskovskyi 293 Dec 30, 2022
MediaPipeのPythonパッケージのサンプルです。2020/12/11時点でPython実装のある4機能(Hands、Pose、Face Mesh、Holistic)について用意しています。

mediapipe-python-sample MediaPipeのPythonパッケージのサンプルです。 2020/12/11時点でPython実装のある以下4機能について用意しています。 Hands Pose Face Mesh Holistic Requirement mediapipe 0.

KazuhitoTakahashi 217 Dec 12, 2022
HAT: Hierarchical Aggregation Transformers for Person Re-identification

HAT: Hierarchical Aggregation Transformers for Person Re-identification

11 Sep 05, 2022
Robust Self-augmentation for NER with Meta-reweighting

Robust Self-augmentation for NER with Meta-reweighting

Lam chi 17 Nov 22, 2022
Semi Supervised Learning for Medical Image Segmentation, a collection of literature reviews and code implementations.

Semi-supervised-learning-for-medical-image-segmentation. Recently, semi-supervised image segmentation has become a hot topic in medical image computin

Healthcare Intelligence Laboratory 1.3k Jan 03, 2023
General neural ODE and DAE modules for power system dynamic modeling.

Py_PSNODE General neural ODE and DAE modules for power system dynamic modeling. The PyTorch-based ODE solver is developed based on torchdiffeq. Sample

14 Dec 31, 2022
The Official Repository for "Generalized OOD Detection: A Survey"

Generalized Out-of-Distribution Detection: A Survey 1. Overview This repository is with our survey paper: Title: Generalized Out-of-Distribution Detec

Jingkang Yang 338 Jan 03, 2023
Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset.

Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset.

41 Jan 04, 2023
Implementation of Self-supervised Graph-level Representation Learning with Local and Global Structure (ICML 2021).

Self-supervised Graph-level Representation Learning with Local and Global Structure Introduction This project is an implementation of ``Self-supervise

MilaGraph 50 Dec 09, 2022
BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Pre-trained checkpoint and bert config json file Location of checkpoint and bert config json file This MLCommons members Google Drive location contain

SAIT (Samsung Advanced Institute of Technology) 12 Apr 27, 2022
Neural implicit reconstruction experiments for the Vector Neuron paper

Neural Implicit Reconstruction with Vector Neurons This repository contains code for the neural implicit reconstruction experiments in the paper Vecto

Congyue Deng 35 Jan 02, 2023