An assignment on creating a minimalist neural network toolkit for CS11-747

Overview

minnn

by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik

This is an exercise in developing a minimalist neural network toolkit for NLP, part of Carnegie Mellon University's CS11-747: Neural Networks for NLP.

The most important files it contains are the following:

  1. minnn.py: This is what you'll need to implement. It implements a very minimalist version of a dynamic neural network toolkit (like PyTorch or Dynet). Some code is provided, but important functionality is not included.
  2. classifier.py: training code for a Deep Averaging Network for text classification using minnn. You can feel free to make any modifications to make it a better model, but the original version of classifier.py must also run with your minnn.py implementation.
  3. setup.py: this is blank, but if your classifier implementation needs to do some sort of data downloading (e.g. of pre-trained word embeddings) you can implement this here. It will be run before running your implementation of classifier.py.
  4. data/: Two datasets, one from the Stanford Sentiment Treebank with tree info removed and another from IMDb reviews.

Assignment Details

Important Notes:

  • There is a detailed description of the code structure in structure.md, including a description of which parts you will need to implement.
  • The only allowed external library is numpy or cupy, no other external libraries are allowed.
  • We will run your code with the following commands, so make sure that whatever your best results are are reproducible using these commands (where you replace ANDREWID with your andrew ID):
    • mkdir -p ANDREWID
    • python classifier.py --train=data/sst-train.txt --dev=data/sst-dev.txt --test=data/sst-test.txt --dev_out=ANDREWID/sst-dev-output.txt --test_out=ANDREWID/sst-test-output.txt
    • python classifier.py --train=data/cfimdb-train.txt --dev=data/cfimdb-dev.txt --test=data/cfimdb-test.txt --dev_out=ANDREWID/cfimdb-dev-output.txt --test_out=ANDREWID/cfimdb-test-output.txt
  • Reference accuracies: with our implementation and the default hyper-parameters, the mean(std) of accuracies with 10 different random seeds on sst is dev=0.4045(0.0070), test=0.4069(0.0105), and on cfimdb dev=0.8792(0.0084). If you implement things exactly in our way and use the default random seed and use the same environment (python 3.8 + numpy 1.18 or 1.19), you may get the accuracies of dev=0.4114, test=0.4253, and on cfimdb dev=0.8857.

The submission file should be a zip file with the following structure (assuming the andrew id is ANDREWID):

  • ANDREWID/
  • ANDREWID/minnn.py # completed minnn.py
  • ANDREWID/classifier.py.py # completed classifier.py with any of your modifications
  • ANDREWID/sst-dev-output.txt # output of the dev set for SST data
  • ANDREWID/sst-test-output.txt # output of the test set for SST data
  • ANDREWID/cfimdb-dev-output.txt # output of the dev set for CFIMDB data
  • ANDREWID/cfimdb-test-output.txt # output of the test set for CFIMDB data
  • ANDREWID/report.pdf # (optional), report. here you can describe anything particularly new or interesting that you did

Grading information:

  • A+: Submissions that implement something new and achieve particularly large accuracy improvements (e.g. 2% over the baseline on SST)
  • A: You additionally implement something else on top of the missing pieces, some examples include:
    • Implementing another optimizer such as Adam
    • Incorporating pre-trained word embeddings, such as those from fasttext
    • Changing the model architecture significantly
  • A-: You implement all the missing pieces and the original classifier.py code achieves comparable accuracy to our reference implementation (about 41% on SST)
  • B+: All missing pieces are implemented, but accuracy is not comparable to the reference.
  • B or below: Some parts of the missing pieces are not implemented.

References

Stanford Sentiment Treebank: https://www.aclweb.org/anthology/D13-1170.pdf

IMDb Reviews: https://openreview.net/pdf?id=Sklgs0NFvr

Owner
Graham Neubig
Graham Neubig
WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WikiPron WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronuncia

213 Jan 01, 2023
Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Andrey 9 May 05, 2022
Pytorch NLP library based on FastAI

Quick NLP Quick NLP is a deep learning nlp library inspired by the fast.ai library It follows the same api as fastai and extends it allowing for quick

Agis pof 283 Nov 21, 2022
RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

Jash Mota 20 Jul 14, 2022
Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

InterDigital 21 Dec 29, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022
Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

Malik Makkes 2 Mar 26, 2022
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

Keon Lee 142 Jan 06, 2023
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

STonKGs STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs. This multimodal Transformer combin

STonKGs 27 Aug 11, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
DVC-NLP-Simple-usecase

dvc-NLP-simple-usecase DVC NLP project Reference repository: official reference repo DVC STUDIO MY View Bag of Words- Krish Naik TF-IDF- Krish Naik ST

SUNNY BHAVEEN CHANDRA 2 Oct 02, 2022