Text-Based Ideal Points

Related tags

Deep Learningtbip
Overview

Text-Based Ideal Points

Source code for the paper: Text-Based Ideal Points by Keyon Vafa, Suresh Naidu, and David Blei (ACL 2020).

Update (June 29, 2020): We have added interactive visualizations of topics learned by our model.

Update (May 25, 2020): We have added a PyTorch implementation of the text-based ideal point model.

Update (May 11, 2020): See our Colab notebook to run the model online. Our Github code is more complete, and it can be used to reproduce all of our experiments. However, the TBIP is fastest on GPU, so if you do not have access to a GPU you can use Colab's GPUs for free.

Installation for GPU

Configure a virtual environment using Python 3.6+ (instructions here). Inside the virtual environment, use pip to install the required packages:

(venv)$ pip install -r requirements.txt

The main dependencies are Tensorflow (1.14.0) and Tensorflow Probability (0.7.0).

Installation for CPU

To run on CPU, a version of Tensorflow that does not use GPU must be installed. In requirements.txt, comment out the line that says tensorflow-gpu==1.14.0 and uncomment the line that says tensorflow==1.14.0. Note: the script will be noticeably slower on CPU.

Data

Preprocessed Senate speech data for the 114th Congress is included in data/senate-speeches-114. The original data is from [1]. Preprocessed 2020 Democratic presidential candidate tweet data is included in data/candidate-tweets-2020.

To include a customized data set, first create a repo data/{dataset_name}/clean/. The following four files must be inside this folder:

  • counts.npz: a [num_documents, num_words] sparse CSR matrix containing the word counts for each document.
  • author_indices.npy: a [num_documents] vector where each entry is an integer in the set {0, 1, ..., num_authors - 1}, indicating the author of the corresponding document in counts.npz.
  • vocabulary.txt: a [num_words]-length file where each line denotes the corresponding word in the vocabulary.
  • author_map.txt: a [num_authors]-length file where each line denotes the name of an author in the corpus.

See data/senate-speeches-114/clean for an example of what the four files look like for Senate speeches. The script setup/senate_speeches_to_bag_of_words.py contains example code for creating the four files from unprocessed data.

Learning text-based ideal points

Run tbip.py to produce ideal points. For the Senate speech data, use the command:

(venv)$ python tbip.py  --data=senate-speeches-114  --batch_size=512  --max_steps=100000

You can view Tensorboard while training to see summaries of training (including the learned ideal points and ideological topics). To run Tensorboard, use the command:

(venv)$ tensorboard  --logdir=data/senate-speeches-114/tbip-fits/  --port=6006

The command should output a link where you can view the Tensorboard results in real time. The fitted parameters will be stored in data/senate-speeches-114/tbip-fits/params. To perform the above analyses for the 2020 Democratic candidate tweets, replace senate-speeches-114 with candidate-tweets-2020.

To run custom data, we recommend training Poisson factorization before running the TBIP script for best results. If you have custom data stored in data/{dataset_name}/clean/, you can run

(venv)$ python setup/poisson_factorization.py  --data={dataset_name}

The default number of topics is 50. To use a different number of topics, e.g. 100, use the flag --num_topics=100. After Poisson factorization finishes, use the following command to run the TBIP:

(venv)$ python tbip.py  --data={dataset_name}

You can adjust the batch size, learning rate, number of topics, and number of steps by using the flags --batch_size, --learning_rate, --num_topics, and --max_steps, respectively. To run the TBIP without initializing from Poisson factorization, use the flag --pre_initialize_parameters=False. To view the results in Tensorboard, run

(venv)$ tensorboard  --logdir=data/{dataset_name}/tbip-fits/  --port=6006

Again, the learned parameters will be stored in data/{dataset_name}/tbip-fits/params.

Reproducing Paper Results

NOTE: Since the publication of our paper, we have made small changes to the code that have sped up inference. A byproduct of these changes is that the Tensorflow graph has changed, so its random seed does not produce the same results as before the changes, even though the data, model, and inference are all the same. To reproduce the exact paper results, one must git checkout to a version of our repository from before these changes:

(venv)$ git checkout 31d161e

The commands below will reproduce all of the paper results. The following data is required before running the commands:

  • Senate votes: The original raw data can be found at [2]. The paper includes experiments for Senate sessions 111-114. For each Senate session, we need three files: one for votes, one for members, and one for rollcalls. For example, for Senate session 114, we would use the files: S114_votes.csv, S114_members.csv, S114_rollcalls.csv. Make a repo data/senate-votes and store these three files in data/senate-votes/114/raw/. Repeat for Senate sessions 111-113.
  • Senate speeches: The original raw data can be found at [1]. Specifically, we use the hein-daily data for the 114th Senate session. The files needed are speeches_114.txt, descr_114.txt, and 114_SpeakerMap.txt. Make sure the relevant files are stored in data/senate-speeches-114/raw/.
  • Senator tweets: The data was provided to us by Voxgov [3].
  • Senate speech comparisons: We use a separate data set for the Senate speech comparisons because speech debates must be labeled for Wordshoal. The raw data can be found at [4]. The paper includes experiments for Senate sessions 111-113. We need the files speaker_senator_link_file.csv, speeches_Senate_111.tab, speeches_Senate_112.tab, and speeches_Senate_113.tab. These files should all be stored in data/senate-speech-comparisons/raw/.
  • Democratic presidential candidate tweets: Download the raw tweets here and store tweets.csv in the folder data/candidate-tweets-2020/raw/.

Preprocess, run vote ideal point model, and perform analysis for Senate votes

(venv)$ python setup/preprocess_senate_votes.py  --senate_session=111
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=112
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=113
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=114
(venv)$ python setup/vote_ideal_points.py  --senate_session=111
(venv)$ python setup/vote_ideal_points.py  --senate_session=112
(venv)$ python setup/vote_ideal_points.py  --senate_session=113
(venv)$ python setup/vote_ideal_points.py  --senate_session=114
(venv)$ python analysis/analyze_vote_ideal_points.py

Preprocess, run the TBIP, and perform analysis for Senate speeches for the 114th Senate

(venv)$ python setup/senate_speeches_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-speeches-114
(venv)$ python tbip.py  --data=senate-speeches-114  --counts_transformation=log  --batch_size=512  --max_steps=150000
(venv)$ python analysis/analyze_senate_speeches.py

Preprocess, run the TBIP and Wordfish, and perform analysis for tweets from senators during the 114th Senate

(venv)$ python setup/senate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-tweets-114
(venv)$ python tbip.py  --data=senate-tweets-114  --batch_size=1024  --max_steps=100000
(venv)$ python model_comparison/wordfish.py  --data=senate-tweets-114  --max_steps=50000
(venv)$ python analysis/analyze_senate_tweets.py

Preprocess and run the TBIP for Senate speech comparisons

(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=111
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=112
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=113
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=111
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=112
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=113
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=111  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=112  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=113  --batch_size=128

Run Wordfish for Senate speech comparisons

(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=111
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=112 
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=113

Run Wordshoal for Senate speech comparisons

(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=111  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=112  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=113  --batch_size=1024

Analyze results for Senate speech comparisons

(venv)$ python analysis/compare_tbip_wordfish_wordshoal.py

Preprocess, run the TBIP, and perform analysis for Democratic candidate tweets

(venv)$ python setup/candidate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=candidate-tweets-2020
(venv)$ python tbip.py  --data=candidate-tweets-2020  --batch_size=1024  --max_steps=100000
(venv)$ python analysis/analyze_candidate_tweets.py

Make figures

(venv)$ python analysis/make_figures.py

References

[1] Matthew Gentzkow, Jesse M. Shapiro, and Matt Taddy. Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts. Palo Alto, CA: Stanford Libraries [distributor], 2018-01-16. https://data.stanford.edu/congress_text

[2] Jeffrey B. Lewis, Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet (2020). Voteview: Congressional Roll-Call Votes Database. https://voteview.com/

[3] VoxGovFEDERAL, U.S. Senators tweets from the 114th Congress. 2020. https://voxgov.com

[4] Benjamin E. Lauderdale and Alexander Herzog. Replication Data for: Measuring Political Positions from Legislative Speech. In Harvard Dataverse, 2016. https://doi.org/10.7910/DVN/RQMIV3

Owner
Keyon Vafa
Keyon Vafa
Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity

Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity Indic TTS Samples can be found at https://peter-yh-wu.github.io/cross-

Peter Wu 1 Nov 12, 2022
DeepMReye: magnetic resonance-based eye tracking using deep neural networks

DeepMReye: magnetic resonance-based eye tracking using deep neural networks

73 Dec 21, 2022
A simple log parser and summariser for IIS web server logs

IISLogFileParser A basic parser tool for IIS Logs which summarises findings from the log file. Inspired by the Gist https://gist.github.com/wh13371/e7

2 Mar 26, 2022
Data-Uncertainty Guided Multi-Phase Learning for Semi-supervised Object Detection

An official implementation of paper Data-Uncertainty Guided Multi-Phase Learning for Semi-supervised Object Detection

11 Nov 23, 2022
Exporter for Storage Area Network (SAN)

SAN Exporter Prometheus exporter for Storage Area Network (SAN). We all know that each SAN Storage vendor has their own glossary of terms, health/perf

vCloud 32 Dec 16, 2022
Learning Representational Invariances for Data-Efficient Action Recognition

Learning Representational Invariances for Data-Efficient Action Recognition Official PyTorch implementation for Learning Representational Invariances

Virginia Tech Vision and Learning Lab 27 Nov 22, 2022
《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》(EMNLP 2020)

The Most Important Thing. Our code is developed based on: LXMERT: Learning Cross-Modality Encoder Representations from Transformers

53 Dec 16, 2022
Think Big, Teach Small: Do Language Models Distil Occam’s Razor?

Think Big, Teach Small: Do Language Models Distil Occam’s Razor? Software related to the paper "Think Big, Teach Small: Do Language Models Distil Occa

0 Dec 07, 2021
The Environment I built to study Reinforcement Learning + Pokemon Showdown

pokemon-showdown-rl-environment The Environment I built to study Reinforcement Learning + Pokemon Showdown Been a while since I ran this. Think it is

3 Jan 16, 2022
Python implementation of "Single Image Haze Removal Using Dark Channel Prior"

##Dependencies pillow(~2.6.0) Numpy(~1.9.0) If the scripts throw AttributeError: __float__, make sure your pillow has jpeg support e.g. try: $ sudo ap

Joyee Cheung 73 Dec 20, 2022
CAST: Character labeling in Animation using Self-supervision by Tracking

CAST: Character labeling in Animation using Self-supervision by Tracking (Published as a conference paper at EuroGraphics 2022) Note: The CAST paper c

15 Nov 18, 2022
PyGCL: Graph Contrastive Learning Library for PyTorch

PyGCL: Graph Contrastive Learning for PyTorch PyGCL is an open-source library for graph contrastive learning (GCL), which features modularized GCL com

GCL: Graph Contrastive Learning Library for PyTorch 594 Jan 08, 2023
Non-Imaging Transient Reconstruction And TEmporal Search (NITRATES)

Non-Imaging Transient Reconstruction And TEmporal Search (NITRATES) This repo contains the full NITRATES pipeline for maximum likelihood-driven discov

13 Nov 08, 2022
Neural style in TensorFlow! 🎨

neural-style An implementation of neural style in TensorFlow. This implementation is a lot simpler than a lot of the other ones out there, thanks to T

Anish Athalye 5.5k Dec 29, 2022
NeWT: Natural World Tasks

NeWT: Natural World Tasks This repository contains resources for working with the NeWT dataset. ❗ At this time the binary tasks are not publicly avail

Visipedia 26 Oct 18, 2022
python debugger and anti-vm that checks if you're in a virtual machine or if someones trying to debug your file

Anti-Debug was made by Love ❌ code ✅ 🎉 ・What it checks for ・ Kills tools that can be used to debug your file ・ Exits if ran in vm (supports different

Rdimo 31 Aug 09, 2022
Taming Transformers for High-Resolution Image Synthesis

Taming Transformers for High-Resolution Image Synthesis CVPR 2021 (Oral) Taming Transformers for High-Resolution Image Synthesis Patrick Esser*, Robin

CompVis Heidelberg 3.5k Jan 03, 2023
Lux AI environment interface for RLlib multi-agents

Lux AI interface to RLlib MultiAgentsEnv For Lux AI Season 1 Kaggle competition. LuxAI repo RLlib-multiagents docs Kaggle environments repo Please let

Jaime 12 Nov 07, 2022
A collection of loss functions for medical image segmentation

A collection of loss functions for medical image segmentation

Jun 3.1k Jan 03, 2023
Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning

Manifold-SCA Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning The repo is org

Yuanyuan Yuan 172 Dec 29, 2022