Automatic library of congress classification, using word embeddings from book titles and synopses.

Overview

Automatic Library of Congress Classification

The Library of Congress Classification (LCC) is a comprehensive classification system that was first developed in the late nineteenth and early twentieth centuries to organize and arrange the book collections of the Library of Congress. The vast complexity of this system has made manual book classification for it quite challenging and time-consuming. This is what has motivated research in automating this process, as can be seen in Larson RR (1992), Frank and Paynter (2004), and Ávila-Argüelles et al. (2010).

In this work we propose the usage of word embeddings, made possible by recent advances in NLP, to take advantage of the fairly rich semantic information that they provide. Usage of word embeddings allows us to effectively use the information in the synposis of the books which contains a great deal of information about the record. We hypothesize that the usage of word embeddings and incorporating synopses would yield better performance over the classifcation task, while also freeing us from relying on Library of Congress Subject Headings (LCSH), which are expensive annotations that previous work has used.

To test out our hypotheses we designed Naive Bayes classifiers, Support Vector Machines, Multi-Layer Perceptrons, and LSTMs to predict 15 of 21 Library of Congress classes. The LSTM model with large BERT embeddings outperformed all other models and was able to classify documents with 76% accuracy when trained on a document’s title and synopsis. This is competitive with previous models that classified documents using their Library of Congress Subject Headings.

For a more detailed explanation of our work, please see our project report.


Dependencies

To run our code, you need the following packages:

scikit-learn=1.0.1
pytorch=1.10.0
python=3.9.7
numpy=1.21.4
notebook=6.4.6
matplotlib=3.5.0
gensim=4.1.2
tqdm=4.62.3
transformers=4.13.0
nltk=3.6.5
pandas=1.3.4
seaborn=0.11.2

Checklist

  1. Install the python packages listed above with requirements.txt
$ pip install -r requirements.txt

or any other package manager you would like.

  1. Set PYTHONPATH to the root of this folder by running the command below at the root directory of the project.
$ export PYTHONPATH=$(PWD)
  1. Download the data needed from this link and put it in the project root folder. Make sure the folder is called github_data.

For the features (tf_idf, w2v, and BERT), you can also use the runner python scripts in "runner" folder to create features.

Use the command below to build all the features. The whole features preparation steps take around 2.5 hours.

$ python runner/build_all_features.py

Due to its large memory consumption, the process might crash along the way. If that's the case, please try again by running the same command. The script is able to pick up on where it left of.

Build each feature separately

BERT embeddings

$ python runner/build_bert_embeddings.py --model_size=small  

W2V embeddings

For this one, you will need to run the generate_w2v_embedddings.ipynb notebook.

tf-idf features

$ python runner/build_tfidf_features.py

If the download still fails, then please download the data directly from our Google Drive [Link] (BERT small and large unavailable).

Running the training code for non-sequential model

Starting point
The main notebook for running all the models is in this notebook [Link].
Note that the training process required preprocessed embeddings data which lies in "github_data" folder.

Caching
Note that once each model finishes fitting to the data, the code also stored the result model as a pickle file in the "_cache" folder.

Training code for sequential model

These notebooks for LSTM on BERT and word2vec ware all located in the report/nnn folder. (e.g., [Link].

The rnn codes (LSTM, GRU) can also be found in iml_group_proj/model/bert_[lstm|gpu].py

Contributors (in no specific order)

  • Katie Warburton - Researched previous automatic LCC attempts and found the dataset. Wrote the introduction and helped to write the discussion. Researched and understood the MARC 21 bibliographic standard to parse through the dataset and extract documents with an LCC, title, and synopsis. Balanced the dataset and split it into a train and test set. Described data balancing and the dataset in the report. - katie-warburton

  • Yujie Chen - Trained and assessed the performance of SVM models and reported the SVM and general model development approaches and relevant results. - Yujie-C

  • Teerapat Chaiwachirasak - Wrote the code for generating tf-idf features and BERT embeddings. Trained Naive Bayes and MLP on tf-idf features and BERT embeddings. Wrote training pipelines that take ML models from the whole team and train them together in one same workflow with multiple data settings (title only, synopsis only, and title + synopsis) to get a summarized and unified result. Trained LSTM models on BERT embeddings on (Google Collab). - Teerapat12

  • Ahmad Pourihosseini - Wrote the code for generating word2vec embeddings and its corresponding preprocessing and the code for MLP and LSTM models on these embeddings. Came up with and implemented the idea of visualizing the averaged embeddings. Wrote the parts of the report corresponding to these sections. - ahmad-PH

Owner
Ahmad Pourihosseini
Ahmad Pourihosseini
A Python type explainer!

typesplainer A Python typehint explainer! Available as a cli, as a website, as a vscode extension, as a vim extension Usage First, install the package

Typesplainer 79 Dec 01, 2022
Asterisk is a framework to generate high-quality training datasets at scale

Asterisk is a framework to generate high-quality training datasets at scale

Mona Nashaat 44 Apr 25, 2022
Pretty Tensor - Fluent Neural Networks in TensorFlow

Pretty Tensor provides a high level builder API for TensorFlow. It provides thin wrappers on Tensors so that you can easily build multi-layer neural networks.

Google 1.2k Dec 29, 2022
Evaluation toolkit of the informative tracking benchmark comprising 9 scenarios, 180 diverse videos, and new challenges.

Informative-tracking-benchmark Informative tracking benchmark (ITB) higher diversity. It contains 9 representative scenarios and 180 diverse videos. m

Xin Li 15 Nov 26, 2022
paper: Hyperspectral Remote Sensing Image Classification Using Deep Convolutional Capsule Network

DC-CapsNet This is a tensorflow and keras based implementation of DC-CapsNet for HSI in the Remote Sensing Letters R. Lei et al., "Hyperspectral Remot

LEI 7 Nov 29, 2022
METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)

Nautilus-OCR The National Library of Luxembourg (BnL) started its first initiative in digitizing newspapers, with layout recognition and OCR on articl

National Library of Luxembourg 36 Dec 05, 2022
Gym environment for FLIPIT: The Game of "Stealthy Takeover"

gym-flipit Gym environment for FLIPIT: The Game of "Stealthy Takeover" invented by Marten van Dijk, Ari Juels, Alina Oprea, and Ronald L. Rivest. Desi

Lisa Oakley 2 Dec 15, 2021
This repository contains code for the paper "Decoupling Representation and Classifier for Long-Tailed Recognition", published at ICLR 2020

Classifier-Balancing This repository contains code for the paper: Decoupling Representation and Classifier for Long-Tailed Recognition Bingyi Kang, Sa

Facebook Research 820 Dec 26, 2022
Automatic 2D-to-3D Video Conversion with CNNs

Deep3D: Automatic 2D-to-3D Video Conversion with CNNs How To Run To run this code. Please install MXNet following the official document. Deep3D requir

Eric Junyuan Xie 1.2k Dec 30, 2022
ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL and PFRL ChainerRL (this repository) is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement al

Chainer 1.1k Jan 01, 2023
Evaluating AlexNet features at various depths

Linear Separability Evaluation This repo provides the scripts to test a learned AlexNet's feature representation performance at the five different con

Yuki M. Asano 32 Dec 30, 2022
A GUI for Face Recognition, based upon Docker, Tkinter, GPU and a camera device.

Face Recognition GUI This repository is a GUI version of Face Recognition by Adam Geitgey, where e.g. Docker and Tkinter are utilized. All the materia

Kasper Henriksen 6 Dec 05, 2022
On Size-Oriented Long-Tailed Graph Classification of Graph Neural Networks

On Size-Oriented Long-Tailed Graph Classification of Graph Neural Networks We provide the code (in PyTorch) and datasets for our paper "On Size-Orient

Zemin Liu 4 Jun 18, 2022
PointPillars inference with TensorRT

A project demonstrating how to use CUDA-PointPillars to deal with cloud points data from lidar.

NVIDIA AI IOT 315 Dec 31, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

ASAPP Research 2.1k Jan 01, 2023
Code repository for the paper "Tracking People with 3D Representations"

Tracking People with 3D Representations Code repository for the paper "Tracking People with 3D Representations" (paper link) (project site). Jathushan

Jathushan Rajasegaran 77 Dec 03, 2022
Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth? paper Official PyTorch Implementation Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor DAMO Academy, Al

213 Nov 12, 2022
Official repository of my book: "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide"

This is the official repository of my book "Deep Learning with PyTorch Step-by-Step". Here you will find one Jupyter notebook for every chapter in the book.

Daniel Voigt Godoy 340 Jan 01, 2023
PyTorch3D is FAIR's library of reusable components for deep learning with 3D data

Introduction PyTorch3D provides efficient, reusable components for 3D Computer Vision research with PyTorch. Key features include: Data structure for

Facebook Research 6.8k Jan 01, 2023