Text-Based Ideal Points

Related tags

Deep Learningtbip
Overview

Text-Based Ideal Points

Source code for the paper: Text-Based Ideal Points by Keyon Vafa, Suresh Naidu, and David Blei (ACL 2020).

Update (June 29, 2020): We have added interactive visualizations of topics learned by our model.

Update (May 25, 2020): We have added a PyTorch implementation of the text-based ideal point model.

Update (May 11, 2020): See our Colab notebook to run the model online. Our Github code is more complete, and it can be used to reproduce all of our experiments. However, the TBIP is fastest on GPU, so if you do not have access to a GPU you can use Colab's GPUs for free.

Installation for GPU

Configure a virtual environment using Python 3.6+ (instructions here). Inside the virtual environment, use pip to install the required packages:

(venv)$ pip install -r requirements.txt

The main dependencies are Tensorflow (1.14.0) and Tensorflow Probability (0.7.0).

Installation for CPU

To run on CPU, a version of Tensorflow that does not use GPU must be installed. In requirements.txt, comment out the line that says tensorflow-gpu==1.14.0 and uncomment the line that says tensorflow==1.14.0. Note: the script will be noticeably slower on CPU.

Data

Preprocessed Senate speech data for the 114th Congress is included in data/senate-speeches-114. The original data is from [1]. Preprocessed 2020 Democratic presidential candidate tweet data is included in data/candidate-tweets-2020.

To include a customized data set, first create a repo data/{dataset_name}/clean/. The following four files must be inside this folder:

  • counts.npz: a [num_documents, num_words] sparse CSR matrix containing the word counts for each document.
  • author_indices.npy: a [num_documents] vector where each entry is an integer in the set {0, 1, ..., num_authors - 1}, indicating the author of the corresponding document in counts.npz.
  • vocabulary.txt: a [num_words]-length file where each line denotes the corresponding word in the vocabulary.
  • author_map.txt: a [num_authors]-length file where each line denotes the name of an author in the corpus.

See data/senate-speeches-114/clean for an example of what the four files look like for Senate speeches. The script setup/senate_speeches_to_bag_of_words.py contains example code for creating the four files from unprocessed data.

Learning text-based ideal points

Run tbip.py to produce ideal points. For the Senate speech data, use the command:

(venv)$ python tbip.py  --data=senate-speeches-114  --batch_size=512  --max_steps=100000

You can view Tensorboard while training to see summaries of training (including the learned ideal points and ideological topics). To run Tensorboard, use the command:

(venv)$ tensorboard  --logdir=data/senate-speeches-114/tbip-fits/  --port=6006

The command should output a link where you can view the Tensorboard results in real time. The fitted parameters will be stored in data/senate-speeches-114/tbip-fits/params. To perform the above analyses for the 2020 Democratic candidate tweets, replace senate-speeches-114 with candidate-tweets-2020.

To run custom data, we recommend training Poisson factorization before running the TBIP script for best results. If you have custom data stored in data/{dataset_name}/clean/, you can run

(venv)$ python setup/poisson_factorization.py  --data={dataset_name}

The default number of topics is 50. To use a different number of topics, e.g. 100, use the flag --num_topics=100. After Poisson factorization finishes, use the following command to run the TBIP:

(venv)$ python tbip.py  --data={dataset_name}

You can adjust the batch size, learning rate, number of topics, and number of steps by using the flags --batch_size, --learning_rate, --num_topics, and --max_steps, respectively. To run the TBIP without initializing from Poisson factorization, use the flag --pre_initialize_parameters=False. To view the results in Tensorboard, run

(venv)$ tensorboard  --logdir=data/{dataset_name}/tbip-fits/  --port=6006

Again, the learned parameters will be stored in data/{dataset_name}/tbip-fits/params.

Reproducing Paper Results

NOTE: Since the publication of our paper, we have made small changes to the code that have sped up inference. A byproduct of these changes is that the Tensorflow graph has changed, so its random seed does not produce the same results as before the changes, even though the data, model, and inference are all the same. To reproduce the exact paper results, one must git checkout to a version of our repository from before these changes:

(venv)$ git checkout 31d161e

The commands below will reproduce all of the paper results. The following data is required before running the commands:

  • Senate votes: The original raw data can be found at [2]. The paper includes experiments for Senate sessions 111-114. For each Senate session, we need three files: one for votes, one for members, and one for rollcalls. For example, for Senate session 114, we would use the files: S114_votes.csv, S114_members.csv, S114_rollcalls.csv. Make a repo data/senate-votes and store these three files in data/senate-votes/114/raw/. Repeat for Senate sessions 111-113.
  • Senate speeches: The original raw data can be found at [1]. Specifically, we use the hein-daily data for the 114th Senate session. The files needed are speeches_114.txt, descr_114.txt, and 114_SpeakerMap.txt. Make sure the relevant files are stored in data/senate-speeches-114/raw/.
  • Senator tweets: The data was provided to us by Voxgov [3].
  • Senate speech comparisons: We use a separate data set for the Senate speech comparisons because speech debates must be labeled for Wordshoal. The raw data can be found at [4]. The paper includes experiments for Senate sessions 111-113. We need the files speaker_senator_link_file.csv, speeches_Senate_111.tab, speeches_Senate_112.tab, and speeches_Senate_113.tab. These files should all be stored in data/senate-speech-comparisons/raw/.
  • Democratic presidential candidate tweets: Download the raw tweets here and store tweets.csv in the folder data/candidate-tweets-2020/raw/.

Preprocess, run vote ideal point model, and perform analysis for Senate votes

(venv)$ python setup/preprocess_senate_votes.py  --senate_session=111
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=112
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=113
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=114
(venv)$ python setup/vote_ideal_points.py  --senate_session=111
(venv)$ python setup/vote_ideal_points.py  --senate_session=112
(venv)$ python setup/vote_ideal_points.py  --senate_session=113
(venv)$ python setup/vote_ideal_points.py  --senate_session=114
(venv)$ python analysis/analyze_vote_ideal_points.py

Preprocess, run the TBIP, and perform analysis for Senate speeches for the 114th Senate

(venv)$ python setup/senate_speeches_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-speeches-114
(venv)$ python tbip.py  --data=senate-speeches-114  --counts_transformation=log  --batch_size=512  --max_steps=150000
(venv)$ python analysis/analyze_senate_speeches.py

Preprocess, run the TBIP and Wordfish, and perform analysis for tweets from senators during the 114th Senate

(venv)$ python setup/senate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-tweets-114
(venv)$ python tbip.py  --data=senate-tweets-114  --batch_size=1024  --max_steps=100000
(venv)$ python model_comparison/wordfish.py  --data=senate-tweets-114  --max_steps=50000
(venv)$ python analysis/analyze_senate_tweets.py

Preprocess and run the TBIP for Senate speech comparisons

(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=111
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=112
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=113
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=111
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=112
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=113
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=111  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=112  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=113  --batch_size=128

Run Wordfish for Senate speech comparisons

(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=111
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=112 
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=113

Run Wordshoal for Senate speech comparisons

(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=111  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=112  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=113  --batch_size=1024

Analyze results for Senate speech comparisons

(venv)$ python analysis/compare_tbip_wordfish_wordshoal.py

Preprocess, run the TBIP, and perform analysis for Democratic candidate tweets

(venv)$ python setup/candidate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=candidate-tweets-2020
(venv)$ python tbip.py  --data=candidate-tweets-2020  --batch_size=1024  --max_steps=100000
(venv)$ python analysis/analyze_candidate_tweets.py

Make figures

(venv)$ python analysis/make_figures.py

References

[1] Matthew Gentzkow, Jesse M. Shapiro, and Matt Taddy. Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts. Palo Alto, CA: Stanford Libraries [distributor], 2018-01-16. https://data.stanford.edu/congress_text

[2] Jeffrey B. Lewis, Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet (2020). Voteview: Congressional Roll-Call Votes Database. https://voteview.com/

[3] VoxGovFEDERAL, U.S. Senators tweets from the 114th Congress. 2020. https://voxgov.com

[4] Benjamin E. Lauderdale and Alexander Herzog. Replication Data for: Measuring Political Positions from Legislative Speech. In Harvard Dataverse, 2016. https://doi.org/10.7910/DVN/RQMIV3

Owner
Keyon Vafa
Keyon Vafa
Self-supervised spatio-spectro-temporal represenation learning for EEG analysis

EEG-Oriented Self-Supervised Learning and Cluster-Aware Adaptation This repository provides a tensorflow implementation of a submitted paper: EEG-Orie

Wonjun Ko 4 Jun 09, 2022
Exploit ILP to learn symmetry breaking constraints of ASP programs.

ILP Symmetry Breaking Overview This project aims to exploit inductive logic programming to lift symmetry breaking constraints of ASP programs. Given a

Research Group Production Systems 1 Apr 13, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Dec 30, 2022
🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

face.evoLVe: High-Performance Face Recognition Library based on PaddlePaddle & PyTorch Evolve to be more comprehensive, effective and efficient for fa

Zhao Jian 3.1k Jan 04, 2023
code for the ICLR'22 paper: On Robust Prefix-Tuning for Text Classification

On Robust Prefix-Tuning for Text Classification Prefix-tuning has drawed much attention as it is a parameter-efficient and modular alternative to adap

Zonghan Yang 12 Nov 30, 2022
Building blocks for uncertainty-aware cycle consistency presented at NeurIPS'21.

UncertaintyAwareCycleConsistency This repository provides the building blocks and the API for the work presented in the NeurIPS'21 paper Robustness vi

EML Tübingen 19 Dec 12, 2022
RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

RIFE RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation Ported from https://github.com/hzwer/arXiv2020-RIFE Dependencies NumPy

49 Jan 07, 2023
A PyTorch-based R-YOLOv4 implementation which combines YOLOv4 model and loss function from R3Det for arbitrary oriented object detection.

R-YOLOv4 This is a PyTorch-based R-YOLOv4 implementation which combines YOLOv4 model and loss function from R3Det for arbitrary oriented object detect

94 Dec 03, 2022
Code for "Steerable Pyramid Transform Enables Robust Left Ventricle Quantification"

Code for "Steerable Pyramid Transform Enables Robust Left Ventricle Quantification" This is an end-to-end framework for accurate and robust left ventr

2 Jul 09, 2022
Implementation of ToeplitzLDA for spatiotemporal stationary time series data.

Code for the ToeplitzLDA classifier proposed in here. The classifier conforms sklearn and can be used as a drop-in replacement for other LDA classifiers. For in-depth usage refer to the learning from

Jan Sosulski 5 Nov 07, 2022
Repo for paper "Dynamic Placement of Rapidly Deployable Mobile Sensor Robots Using Machine Learning and Expected Value of Information"

Repo for paper "Dynamic Placement of Rapidly Deployable Mobile Sensor Robots Using Machine Learning and Expected Value of Information" Notes I probabl

Berkeley Expert System Technologies Lab 0 Jul 01, 2021
We present a regularized self-labeling approach to improve the generalization and robustness properties of fine-tuning.

Overview This repository provides the implementation for the paper "Improved Regularization and Robustness for Fine-tuning in Neural Networks", which

NEU-StatsML-Research 21 Sep 08, 2022
Trading Gym is an open source project for the development of reinforcement learning algorithms in the context of trading.

Trading Gym Trading Gym is an open-source project for the development of reinforcement learning algorithms in the context of trading. It is currently

Dimitry Foures 535 Nov 15, 2022
Code for "LoRA: Low-Rank Adaptation of Large Language Models"

LoRA: Low-Rank Adaptation of Large Language Models This repo contains the implementation of LoRA in GPT-2 and steps to replicate the results in our re

Microsoft 394 Jan 08, 2023
PyTorch implementation of the cross-modality generative model that synthesizes dance from music.

Dancing to Music PyTorch implementation of the cross-modality generative model that synthesizes dance from music. Paper Hsin-Ying Lee, Xiaodong Yang,

NVIDIA Research Projects 485 Dec 26, 2022
Official repository for CVPR21 paper "Deep Stable Learning for Out-Of-Distribution Generalization".

StableNet StableNet is a deep stable learning method for out-of-distribution generalization. This is the official repo for CVPR21 paper "Deep Stable L

120 Dec 28, 2022
Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

Knowledge Distillation for BERT Unsupervised Domain Adaptation Official PyTorch implementation | Paper Abstract A pre-trained language model, BERT, ha

Minho Ryu 29 Nov 30, 2022
ManipulaTHOR, a framework that facilitates visual manipulation of objects using a robotic arm

ManipulaTHOR: A Framework for Visual Object Manipulation Kiana Ehsani, Winson Han, Alvaro Herrasti, Eli VanderBilt, Luca Weihs, Eric Kolve, Aniruddha

AI2 65 Dec 30, 2022
Joint parameterization and fitting of stroke clusters

StrokeStrip: Joint Parameterization and Fitting of Stroke Clusters Dave Pagurek van Mossel1, Chenxi Liu1, Nicholas Vining1,2, Mikhail Bessmeltsev3, Al

Dave Pagurek 44 Dec 01, 2022
Dynamic View Synthesis from Dynamic Monocular Video

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer This repository contains code to compute depth from a

Intelligent Systems Lab Org 2.3k Jan 01, 2023