Hierarchical Metadata-Aware Document Categorization under Weak Supervision (WSDM'21)

Overview

Hierarchical Metadata-Aware Document Categorization under Weak Supervision

This project provides a weakly supervised framework for hierarchical metadata-aware document categorization.

Links

Installation

For training, a GPU is strongly recommended.

Keras

The code is based on Keras. You can find installation instructions here.

Dependency

The code is written in Python 3.6. The dependencies are summarized in the file requirements.txt. You can install them like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you need to first download the datasets. Three datasets are used in our paper: GitHub, ArXiv, and Amazon. Once you unzip the downloaded file (i.e., data.zip), you can see three folders related to these three datasets, respectively.

Dataset #Documents #Layers #Classes (including ROOT) #Leaves Sample Classes
GitHub 1,596 2 18 14 Computer Vision (Layer-1), Image Generation (Layer-2)
ArXiv 26,400 2 94 88 cs (Layer-1), cs.AI (Layer-2)
Amazon 147,000 2 166 147 Automotive (Layer-1), Car Care (Layer-2)

You need to put these 3 folders under the main folder ./. Then the following running script can be used to run the model.

./test.sh

Level-1/Level-2/Overall Micro-F1/Macro-F1 scores will be shown in the last several lines of the output. The classification result can be found under your dataset folder. For example, if you are using the GitHub dataset, the output will be ./github/out.txt.

Data

In each of the three folders (i.e., github/, arxiv/, and amazon/), there is a json file, where each line represents one document with text and metadata information.

For GitHub, the json format is

{
  "id": "Natsu6767/DCGAN-PyTorch",  
  "user": [
    "Natsu6767"
  ],
  "text": "pytorch implementation of dcgan trained on the celeba dataset deep convolutional gan ...",
  "tags": [
    "pytorch",
    "dcgan",
    "gan",
    "implementation",
    "deeplearning",
    "computer-vision",
    "generative-model"
  ],
  "labels": [
    "$Computer-Vision",
    "$Image-Generation"
  ]
}

The "user" and "tags" fields are metadata.

For ArXiv, the json format is

{
  "id": "1001.0063",
  "authors": [
    "Alessandro Epasto",
    "Enrico Nardelli"
  ],
  "text": "on a model for integrated information in this paper we give a thorough presentation ...",
  "labels": [
    "cs",
    "cs.AI"
  ]
}

The "authors" field is metadata.

For Amazon, the json format is

{
  "user": [
    "A39IXH6I0WT6TK"
  ],
  "product": [
    "B004DLPXAO"
  ],
  "text": "works really great only had a problem when it was updated but they fixed it right away ...",
  "labels": [
    "Apps-for-Android",
    "Books-&-Comics"
  ]
}

The "user" and "product" fields are metadata.

NOTE 1: If you would like to run our code on your own dataset, when you prepare this json file, make sure that: (1) You list the labels in the top-down order. For example, if the label path of your repository is ROOT-A-B-C, then the "labels" field should be ["A", "B", "C"]. (2) For each document, its metadata field is always represented by a list. For example, the "user" field should be ["A39IXH6I0WT6TK"] instead of "A39IXH6I0WT6TK".

Running on New Datasets

In the Quick Start section, we include a pretrained embedding file in the downloaded folders. If you would like to re-train the embedding (or you have a new dataset), please follow the steps below.

  1. Create a directory named ${dataset} under the main folder (e.g., ./github).

  2. Prepare four files:
    (1) ./${dataset}/label_hier.txt indicating the parent children relationships between classes. The first class of each line is the parent class, followed by all its children classes. Whitespace is used as the delimiter. The root class must be named as ROOT. Make sure your class names do not contain whitespace.
    (2) ./${dataset}/doc_id.txt containing labeled document ids for each class. Each line begins with the class name, and then document ids in the corpus (starting from 0) of the corresponding class separated by whitespace.
    (3) ./${dataset}/${json-name}.json. You can refer to the provided json format above. Make sure it has two fields "text" and "labels". You can add your own metadata fields in the json.
    (4) ./${dataset}/meta_dict.json indicating the names of your metadata fields. For example, for GitHub, it should be

{"metadata": ["user", "tags"]}

For ArXiv, it should be

{"metadata": ["authors"]}
  1. Install the dependencies GSL and Eigen. For Eigen, we already provide a zip file JointEmbedding/eigen-3.3.3.zip. You can directly unzip it in JointEmbedding/. For GSL, you can download it here.

  2. ./prep_emb.sh. Make sure you change the dataset/json names. The embedding file will be saved to ./${dataset}/embedding_sph.

After that, you can train the classifier as mentioned in Quick Start (i.e., ./test.sh). Please always refer to the example datasets when adapting the code for a new dataset.

Citation

If you find the implementation useful, please cite the following paper:

@inproceedings{zhang2021hierarchical,
  title={Hierarchical Metadata-Aware Document Categorization under Weak Supervision},
  author={Zhang, Yu and Chen, Xiusi and Meng, Yu and Han, Jiawei},
  booktitle={WSDM'21},
  pages={770--778},
  year={2021},
  organization={ACM}
}
Owner
Yu Zhang
CS Ph.D. student at UIUC; Data Mining
Yu Zhang
Generalized Proximal Policy Optimization with Sample Reuse (GePPO)

Generalized Proximal Policy Optimization with Sample Reuse This repository is the official implementation of the reinforcement learning algorithm Gene

Jimmy Queeney 9 Nov 28, 2022
Repo for flood prediction using LSTMs and HAND

Abstract Every year, floods cause billions of dollars’ worth of damages to life, crops, and property. With a proper early flood warning system in plac

1 Oct 27, 2021
Meta-meta-learning with evolution and plasticity

Evolve plastic networks to be able to automatically acquire novel cognitive (meta-learning) tasks

5 Jun 28, 2022
机器学习、深度学习、自然语言处理等人工智能基础知识总结。

说明 机器学习、深度学习、自然语言处理基础知识总结。 目前主要参考李航老师的《统计学习方法》一书,也有一些内容例如XGBoost、聚类、深度学习相关内容、NLP相关内容等是书中未提及的。

Peter 445 Dec 12, 2022
Caffe implementation for Hu et al. Segmentation for Natural Language Expressions

Segmentation from Natural Language Expressions This repository contains the Caffe reimplementation of the following paper: R. Hu, M. Rohrbach, T. Darr

10 Jul 27, 2021
🐾 Semantic segmentation of paws from cute pet images (PyTorch)

🐾 paw-segmentation 🐾 Semantic segmentation of paws from cute pet images 🐾 Semantic segmentation of paws from cute pet images (PyTorch) 🐾 Paw Segme

Zabir Al Nazi Nabil 3 Feb 01, 2022
A new version of the CIDACS-RL linkage tool suitable to a cluster computing environment.

Fully Distributed CIDACS-RL The CIDACS-RL is a brazillian record linkage tool suitable to integrate large amount of data with high accuracy. However,

Robespierre Pita 5 Nov 04, 2022
FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Detectron is deprecated. Please see detectron2, a ground-up rewrite of Detectron in PyTorch. Detectron Detectron is Facebook AI Research's software sy

Facebook Research 25.5k Jan 07, 2023
Multispectral Object Detection with Yolov5

Multispectral-Object-Detection Intro Official Code for Cross-Modality Fusion Transformer for Multispectral Object Detection. Multispectral Object Dete

Richard Fang 121 Jan 01, 2023
The source code of CVPR17 'Generative Face Completion'.

GenerativeFaceCompletion Matcaffe implementation of our CVPR17 paper on face completion. In each panel from left to right: original face, masked input

Yijun Li 313 Oct 18, 2022
Spatial Action Maps for Mobile Manipulation (RSS 2020)

spatial-action-maps Update: Please see our new spatial-intention-maps repository, which extends this work to multi-agent settings. It contains many ne

Jimmy Wu 27 Nov 30, 2022
The code is for the paper "A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation"

SD-AANet The code is for the paper "A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation" [arxiv] Overview confi

cv516Buaa 9 Nov 07, 2022
这个开源项目主要是对经典的时间序列预测算法论文进行复现,模型主要参考自GluonTS,框架主要参考自Informer

Time Series Research with Torch 这个开源项目主要是对经典的时间序列预测算法论文进行复现,模型主要参考自GluonTS,框架主要参考自Informer。 建立原因 相较于mxnet和TF,Torch框架中的神经网络层需要提前指定输入维度: # 建立线性层 TensorF

Chi Zhang 85 Dec 29, 2022
Python implementation of ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images, AAAI2022.

ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images Binh M. Le & Simon S. Woo, "ADD:

2 Oct 24, 2022
A list of multi-task learning papers and projects.

This page contains a list of papers on multi-task learning for computer vision. Please create a pull request if you wish to add anything. If you are interested, consider reading our recent survey pap

svandenh 297 Dec 17, 2022
Object detection, 3D detection, and pose estimation using center point detection:

Objects as Points Object detection, 3D detection, and pose estimation using center point detection: Objects as Points, Xingyi Zhou, Dequan Wang, Phili

Xingyi Zhou 6.7k Jan 03, 2023
The codes and related files to reproduce the results for Image Similarity Challenge Track 1.

ISC-Track1-Submission The codes and related files to reproduce the results for Image Similarity Challenge Track 1. Required dependencies To begin with

Wenhao Wang 115 Jan 02, 2023
PyTorch implementation of "Simple and Deep Graph Convolutional Networks"

Simple and Deep Graph Convolutional Networks This repository contains a PyTorch implementation of "Simple and Deep Graph Convolutional Networks".(http

chenm 253 Dec 08, 2022
StyleMapGAN - Official PyTorch Implementation

StyleMapGAN - Official PyTorch Implementation StyleMapGAN: Exploiting Spatial Dimensions of Latent in GAN for Real-time Image Editing Hyunsu Kim, Yunj

NAVER AI 425 Dec 23, 2022
code for Fast Point Cloud Registration with Optimal Transport

robot This is the repository for the paper "Accurate Point Cloud Registration with Robust Optimal Transport". We are in the process of refactoring the

28 Jan 04, 2023