Toward Model Interpretability in Medical NLP

Overview

Toward Model Interpretability in Medical NLP

LING380: Topics in Computational Linguistics Final Project James Cross ([email protected]) and Daniel Kim ([email protected]), December 2021

Code Organization

  • data: contains medical report data [LINK TO THAT REPO] used in model fine-tuning and analysis, clinical stop words, and saved accuracy and entropy metrics during evaluation
  • models: checkpoints of the best performing BERT and BioBERT models after hyperparameter optimization
  • notebooks:
  • model_training.ipynb: code to train and fine-tune BERT and BioBERT
  • model_evaluation.ipynb: code to run various model evaluations, visualize word importances, perform post-training clinical stopword masking, and other analyses
  • scripts: same functionality as in the notebooks, in executable python scripts / functions

Dependencies

All packages needed to run the code are available in the default Google Colab environment (see documentation for full list), with the exception of huggingface (transformers), used for loading transformer models, and captum.ai (captum), which provides access for a variety of model interpretation tools.

How to run code

Two options available to run the code; on Google colab and/or locally on your machine.

Option 1) Google Colab

Model training notebook: [https://colab.research.google.com/drive/1uPIi-OVchs_8A-SNcQtLfwelr0ccsz19?usp=sharing] Model evaluation/analysis notebook: [https://colab.research.google.com/drive/1Hfy58JvyPbx55lKKhQAzzrhJIbN_Io0j?usp=sharing]

Option 2) Local Machine

Notebooks: You can run the model_training.ipynb or model_evaluation.ipynb notebooks as is, changing directory paths when needed.

Owner
Junior @ Yale. Research Assistant in Turk-Browne Lab.
Leon is an open-source personal assistant who can live on your server.

Leon Your open-source personal assistant. Website :: Documentation :: Roadmap :: Contributing :: Story đź‘‹ Introduction Leon is an open-source personal

Leon AI 11.7k Dec 30, 2022
🏆 • 5050 most frequent words in 109 languages

🏆 Most Common Words Multilingual 5000 most frequent words in 109 languages. Uses wordfrequency.info as a source. 🔗 License source code license data

14 Nov 24, 2022
What are the best Systems? New Perspectives on NLP Benchmarking

What are the best Systems? New Perspectives on NLP Benchmarking In Machine Learning, a benchmark refers to an ensemble of datasets associated with one

Pierre Colombo 12 Nov 03, 2022
Curso práctico: NLP de cero a cien 🤗

Curso Práctico: NLP de cero a cien Comprende todos los conceptos y arquitecturas clave del estado del arte del NLP y aplícalos a casos prácticos utili

Somos NLP 147 Jan 06, 2023
đź’› Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

Kenneth Enevoldsen 71 Jan 06, 2023
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

CrossSum This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summ

BUET CSE NLP Group 29 Nov 19, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
An open-source NLP library: fast text cleaning and preprocessing.

An open-source NLP library: fast text cleaning and preprocessing

Iaroslav 21 Mar 18, 2022
Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Weitang Liu 1.6k Jan 03, 2023
Contains links to publicly available datasets for modeling health outcomes using speech and language.

speech-nlp-datasets Contains links to publicly available datasets for modeling various health outcomes using speech and language. Speech-based Corpora

Tuka Alhanai 77 Dec 07, 2022
Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 01, 2023
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 02, 2023
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022