Automatic library of congress classification, using word embeddings from book titles and synopses.

Overview

Automatic Library of Congress Classification

The Library of Congress Classification (LCC) is a comprehensive classification system that was first developed in the late nineteenth and early twentieth centuries to organize and arrange the book collections of the Library of Congress. The vast complexity of this system has made manual book classification for it quite challenging and time-consuming. This is what has motivated research in automating this process, as can be seen in Larson RR (1992), Frank and Paynter (2004), and Ávila-Argüelles et al. (2010).

In this work we propose the usage of word embeddings, made possible by recent advances in NLP, to take advantage of the fairly rich semantic information that they provide. Usage of word embeddings allows us to effectively use the information in the synposis of the books which contains a great deal of information about the record. We hypothesize that the usage of word embeddings and incorporating synopses would yield better performance over the classifcation task, while also freeing us from relying on Library of Congress Subject Headings (LCSH), which are expensive annotations that previous work has used.

To test out our hypotheses we designed Naive Bayes classifiers, Support Vector Machines, Multi-Layer Perceptrons, and LSTMs to predict 15 of 21 Library of Congress classes. The LSTM model with large BERT embeddings outperformed all other models and was able to classify documents with 76% accuracy when trained on a document’s title and synopsis. This is competitive with previous models that classified documents using their Library of Congress Subject Headings.

For a more detailed explanation of our work, please see our project report.


Dependencies

To run our code, you need the following packages:

scikit-learn=1.0.1
pytorch=1.10.0
python=3.9.7
numpy=1.21.4
notebook=6.4.6
matplotlib=3.5.0
gensim=4.1.2
tqdm=4.62.3
transformers=4.13.0
nltk=3.6.5
pandas=1.3.4
seaborn=0.11.2

Checklist

  1. Install the python packages listed above with requirements.txt
$ pip install -r requirements.txt

or any other package manager you would like.

  1. Set PYTHONPATH to the root of this folder by running the command below at the root directory of the project.
$ export PYTHONPATH=$(PWD)
  1. Download the data needed from this link and put it in the project root folder. Make sure the folder is called github_data.

For the features (tf_idf, w2v, and BERT), you can also use the runner python scripts in "runner" folder to create features.

Use the command below to build all the features. The whole features preparation steps take around 2.5 hours.

$ python runner/build_all_features.py

Due to its large memory consumption, the process might crash along the way. If that's the case, please try again by running the same command. The script is able to pick up on where it left of.

Build each feature separately

BERT embeddings

$ python runner/build_bert_embeddings.py --model_size=small  

W2V embeddings

For this one, you will need to run the generate_w2v_embedddings.ipynb notebook.

tf-idf features

$ python runner/build_tfidf_features.py

If the download still fails, then please download the data directly from our Google Drive [Link] (BERT small and large unavailable).

Running the training code for non-sequential model

Starting point
The main notebook for running all the models is in this notebook [Link].
Note that the training process required preprocessed embeddings data which lies in "github_data" folder.

Caching
Note that once each model finishes fitting to the data, the code also stored the result model as a pickle file in the "_cache" folder.

Training code for sequential model

These notebooks for LSTM on BERT and word2vec ware all located in the report/nnn folder. (e.g., [Link].

The rnn codes (LSTM, GRU) can also be found in iml_group_proj/model/bert_[lstm|gpu].py

Contributors (in no specific order)

  • Katie Warburton - Researched previous automatic LCC attempts and found the dataset. Wrote the introduction and helped to write the discussion. Researched and understood the MARC 21 bibliographic standard to parse through the dataset and extract documents with an LCC, title, and synopsis. Balanced the dataset and split it into a train and test set. Described data balancing and the dataset in the report. - katie-warburton

  • Yujie Chen - Trained and assessed the performance of SVM models and reported the SVM and general model development approaches and relevant results. - Yujie-C

  • Teerapat Chaiwachirasak - Wrote the code for generating tf-idf features and BERT embeddings. Trained Naive Bayes and MLP on tf-idf features and BERT embeddings. Wrote training pipelines that take ML models from the whole team and train them together in one same workflow with multiple data settings (title only, synopsis only, and title + synopsis) to get a summarized and unified result. Trained LSTM models on BERT embeddings on (Google Collab). - Teerapat12

  • Ahmad Pourihosseini - Wrote the code for generating word2vec embeddings and its corresponding preprocessing and the code for MLP and LSTM models on these embeddings. Came up with and implemented the idea of visualizing the averaged embeddings. Wrote the parts of the report corresponding to these sections. - ahmad-PH

Owner
Ahmad Pourihosseini
Ahmad Pourihosseini
The hippynn python package - a modular library for atomistic machine learning with pytorch.

The hippynn python package - a modular library for atomistic machine learning with pytorch. We aim to provide a powerful library for the training of a

Los Alamos National Laboratory 37 Dec 29, 2022
Python utility to generate filesystem content for Obsidian.

Security Vault Generator Quickly parse, format, and output common frameworks/content for Obsidian.md. There is a strong focus on MITRE ATT&CK because

Justin Angel 73 Dec 02, 2022
A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation (ICCV 2021)

A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation (ICCV 2021) This repository contains the official implemen

81 Dec 14, 2022
A web porting for NVlabs' StyleGAN2, to facilitate exploring all kinds characteristic of StyleGAN networks

This project is a web porting for NVlabs' StyleGAN2, to facilitate exploring all kinds characteristic of StyleGAN networks. Thanks for NVlabs' excelle

K.L. 150 Dec 15, 2022
An implementation of "MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing" (ICML 2019).

MixHop and N-GCN ⠀ A PyTorch implementation of "MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing" (ICML 2019)

Benedek Rozemberczki 393 Dec 13, 2022
[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data (NeurIPS 2021) This repository will provide the official PyTorch implementa

Liming Jiang 238 Nov 25, 2022
Random Erasing Data Augmentation. Experiments on CIFAR10, CIFAR100 and Fashion-MNIST

Random Erasing Data Augmentation =============================================================== black white random This code has the source code for

Zhun Zhong 654 Dec 26, 2022
SparseML is a libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

SparseML is a toolkit that includes APIs, CLIs, scripts and libraries that apply state-of-the-art sparsification algorithms such as pruning and quantization to any neural network. General, recipe-dri

Neural Magic 1.5k Dec 30, 2022
Microscopy Image Cytometry Toolkit

Cytokit Cytokit is a collection of tools for quantifying and analyzing properties of individual cells in large fluorescent microscopy datasets with a

Hammer Lab 106 Jan 06, 2023
YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

Introduction Yolov5-face is a real-time,high accuracy face detection. Performance Single Scale Inference on VGA resolution(max side is equal to 640 an

DeepCam Shenzhen 1.4k Jan 07, 2023
dualPC.R contains the R code for the main functions.

dualPC.R contains the R code for the main functions. dualPC_sim.R contains an example run with the different PC versions; it calls dualPC_algs.R whic

3 May 30, 2022
JugLab 33 Dec 30, 2022
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022
PyTorch implementation for our AAAI 2022 Paper "Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning"

deepGCFX PyTorch implementation for our AAAI 2022 Paper "Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning" Pr

Thilini Cooray 4 Aug 11, 2022
Official Pytorch implementation of "DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network" (CVPR'21)

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network Pytorch implementation for our DivCo. We propose a simple ye

64 Nov 22, 2022
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer

CSAW-M This repository contains code for CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer. Source code for tr

Yue Liu 7 Oct 11, 2022
New AidForBlind - Various Libraries used like OpenCV and other mentioned in Requirements.txt

AidForBlind Recommended PyCharm IDE Various Libraries used like OpenCV and other

Aalhad Chandewar 1 Jan 13, 2022
Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020).

SentiBERT Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020). https://arxiv.org/abs/20

Da Yin 66 Aug 13, 2022
HAT: Hierarchical Aggregation Transformers for Person Re-identification

HAT: Hierarchical Aggregation Transformers for Person Re-identification

11 Sep 05, 2022
The Submission for SIMMC 2.0 Challenge 2021

The Submission for SIMMC 2.0 Challenge 2021 challenge website Requirements python 3.8.8 pytorch 1.8.1 transformers 4.8.2 apex for multi-gpu nltk Prepr

5 Jul 26, 2022