Using machine learning to predict and analyze high and low reader engagement for New York Times articles posted to Facebook.

Overview

How The New York Times can increase Engagement on Facebook

Using machine learning to understand characteristics of news content that garners "high" Facebook engagement


Author: Jessica Miles - [email protected]


In this repository, I used machine learning to understand the characteristics of Facebook content posted by The New York Times that lead to high user engagement. The analysis includes Natural Language Processing (NLP) of the post and article metadata as well as categorical features from both the posts and original articles.

My model was ultimately not accurate enough for me to recommend using it as a "black box" deciding which articles to post to Facebook. Instead, I used its coefficients to determine the most important keyword and topics among high versus low engagement posts, then grouped together similar topics and keywords together into themes to form recommendations. I believe this approach resulted in generalized results which are more likely to be useful over time, as compared to focusing on specific high engagement topics which only occur once, such as "the 2016 Presidential Election."

Business Problem

Modern Americans consume news in multiple formats: in print, browsing and searching websites online, and on social media. To remain relevant in modern times, news organizations need to be able to engage users on social media platforms such as Facebook as well as using traditional print and web methods.

However, news outlets may produce far more content than can reasonably be posted to such platforms, so they need a methodology to decide what content formats and topics will be most successful. The type of content Facebook users engage with most may differ from what is run on the front page of the printed paper, so it's prudent to analyze user engagement with Facebook content as a standalone exercise.

One criterion Facebook's News Feed algorithm uses to prioritize content's visibility to users is the amount of initial engagement (shares, comments, and likes) on a given post. Higher prioritization in News Feed may help content be disseminated to a wider audience, some of whom may decide to become subscribers.

It's important to note that while this analysis focuses on increasing engagement, I do not advocate for engagement level to be the only consideration in deciding what to post. Just because people enjoy sharing recipes and reading pieces about animals doesn't mean The Times should unequivocally prioritize those topics over pieces related to general politics or the economy. However, understanding the patterns behind high Facebook engagement could be included as one factor of many in the ultimate editorial decision.

Data

I started with a found dataset of about 48,000 Facebook posts from The New York Times' account covering the time period from late 2012 to late 2016. Data used in the analysis included the text in the post, when it was posted, and post type (link, video, or photo).

I also used the NYT API to pull all article metadata from this time period, and went through several steps to match the content in the Facebook posts to their original articles and multimedia features. The data I retrieved from the NYT API included article headline and abstract as well as metadata such as the news desk the article came from, topical subjects and other entities mentioned in the articles, word count, and format (written versus multimedia).

Of the original 48,000 Facebook posts, I was only able to match about 43,000 of them to articles. There were some challlenges in performing the matching due to differences between FB post text and links and the current article abstract and links; the links in the Facebook posts were often shortened, and even once expanded did not always match directly to a current article. Therefore, I modeled the features derived from all Facebook posts as a separate data set from the features derived from matched articles.

Methods

Engagement metrics included number of comments, shares, and likes/loves. Rather than focus on each of these separately, I created a single engagement metric.

  • First, I calculated the percentile for each separate metric for each post
  • Then, I calculated the mean of percentiles across the three metrics to act as a single engagement metric for that post
  • For my binary classification problem, posts with mean percentile over 75th were considered "high engagement" and those under 75th were considered "low engagement".
  • I also engineered a multi-class target using the same criteria for "high", but splitting the rest into "low" (below 25th percentile) and "moderate" (25th to 50th percentile).

Below are the distributions for each engagement metric. Note: Histograms don't include outliers for visibility, but percentiles are calculated with outliers included.

Comments Distribution (All Posts)

Shares Distribution (All Posts)

Likes Distribution (All Posts)

The distributions of all Facebook posts and the smaller subset of posts matched to articles were quite similar. Although I modeled them separately, I used results from both sets of models in my final recommendations.

There appeared to be a slight uptick in engagement when posts were made on weekends, so I engineered a categorical variable for that.

I also engineered a categorical variable for time of day the post was made, as posts made in the morning and evening appeared to get more engagement.

When I visualized the most frequent words for high versus low engagement, I saw that words related to 2016 presidential candidates were more common in high engagement than in average. Otherwise though, they looked quite similar.

Word Frequencies for Low Engagement

Word Frequencies for High Engagement

I also noticed that some high engagement posts posed questions to users, and asked their opinions. I created custom stopwords lists without permutations of the word "you" and without the quotation mark, to test performance compared to full stop words and punctuation list.

I used sklearn's GridSearchCV combined with pipelines and other transformer classes to iterate over several different NLP preprocessing approaches, as well as Logistic Regression and Nauve Bayes model hyperparameters. The preprocessing steps I tested included:

  • Text vectorization strategy count, binary, or count + Tf-Idf normalization
  • Removing different sets of stopwords and punctuation
  • Uni-grams, bi-grams, or both
  • Applying lemmatization (with POS tags) or not
  • Varying number of max_features to be used in modeling

Model Performance

I modeled both a binary and multi-class problem.

The binary performed slightly better on High Engagement, but the multi-class was interesting for what it showed about how the model tended to get confused.

The best binary classification model was able to identify about 62% of high engagement posts correctly (score is cross-validated, and on unseen test data).

Preprocessing and model parameters for the best model were as follows:

  • Removed NLTK stopwords and punctuation, with the exception of '?' and permutations of the word 'your'
  • TF-IDF normalization on word vectors cosisting of both uni-grams and bi-grams
  • Maximum 2000 features (both word vectors and topics)
  • Logistic Regression model using L2 regularization, no intercept fitted

Binary Confusion Matrix on Test Data

Multi-Class Confusion Matrix on Test Data

Distributions of all three engagement metrics were very right-skewed: they had many outliers on the high end and tapered off very smoothly. I ultimately chose the 75th percentile as the cutoff for "high engagement", but there truly was no obvious cutoff point. Earlier in my analysis, I initially onsidered only outliers (using IQR * 1.5) to be high engagement, but these models did not perform as well. I believe it's natural that the model would be confused about posts towards the middle of the distribution regardless of where the cutoff point is drawn. The multi-class model confirms this, as it performed most poorly on the Moderate Engagement middle class which represents the 25th through 75th percentiles.

Recommendations:

To select features to factor into recommendations, I examined the predictors that had greatest odds ratios of High and Low engagement. After using a held-out test set to evaluate performance on unseen data, I trained the model on 10 random splits (without replacement), each consisting of 90% the entire data set, and calculated the mean odds ratio for each predictor. TF-IDF scores and top 2000 bi- and un-grams were re-calculated in each split. Feature importances represented as odds ratios were selected based on the aggregated mean across all 10 splits, and the standard error is shown on all graphs of odds ratios.

To group features into categories, I reviewed the top 300 predictors of both High and Low for Facebook post uni-grams and bi-grams as well as original article subjects, for a total of about 1,200 words and topics that I categorized. I made several passes through the list of features, assigning logical categories to words and subjects that seemed sufficiently unambiguous in their meaning. Several passes allowed me to refine the categories. Not all words and subjects were categorized; only those I recognized as common. Categories that tended more towards either high or low engagement were considered in the running for final recommendations.

1. Prioritize Breaking News over Recurring Content

  • Breaking News was one of the top predictors of high engagement
  • In contrast, posts that represented daily or recurring features tended to have lower engagement. Examples include: Quotation of the Day, New York Today, Daily Briefing ("Here's what you need to know to start your day"), What You Should Watch This Week

2. Focus on the Current President and Election over General Politics

  • Topics related to the candidates in the 2016 presidential election, and the current president and first lady at the time, were highly engaging.
  • However, topics related to general politics and government were less engaging.

3. Prioritize U.S. National content over U.S. Local and Foreign

  • Posts that mention 'America', 'Americans' and 'American', as well as patriotics themes such as the national anthem and flags, are highly engaging.
  • Posts with words that seem more local to certain places are less engaging, as is most foreign coverage.

4. Post More Multimedia Content Outside Subscriber Paywall

  • Video and Photo post types are where photos and videos were uploaded to Facebook, so are outside the paywall. These are more engaging.
  • Most posts containing the words "video" or "watch" are actually posted as links to content, which is frequently behind the paywall. These are less engaging.

5. Post on Evenings and Weekends, when appropriate

  • Content posted from 7 PM to 11 PM or on a weekend day has slightly increased odds of high engagement compared to posts added at other times
  • This is likely due to News Feed algorithm prioritizing recently posted content, and these being popular times to engage with Facebook

6. Focus on Additional Highly Engaging Topics

  • Opinion and Editorial content (though not Op-Eds and Ethics)
  • Obituaries
  • Recipes and Cooking (though not Food section)
  • Parenting and Children
  • Mental Health
  • Beauty and Self Care
  • Exercise
  • Marriage and Relationships
  • Religion

See the Appendix in my presentation for additional charts showing the odds ratios for high engagement on these topics.

Caveats and Limitations

  • Facebook's own News Feed algorithm is very important to driving engagement, and is based partly on user-centric preferences which we can't model
  • The cutoff point for "High engagement" is somewhat arbitrary
  • Tastes change, so results from 2016 may not be applicable to present day. Facebook's algorithm also may have changed.
  • These recommendations assume high engagement is the primary goal: they should be considered in the context of The Times' values and mission statement.

Potential Next Steps

  • Review sentiment of articles to see whether that affects engagement
  • Compare engagement on Facebook to comments count on the New York Times website, so see if there is a difference in what drives engagement there
  • Create an interactive dashboard so engagement of certain words and subjects can be reviewed

For further information

Please review the narrative of my analysis in my introductory jupyter notebook, modeling and analysis notebook, and my presentation.

For any additional questions, please contact **[email protected]

Repository Structure:

├── README.md                <- The top-level README for reviewers of this project.
├── data_gathering.ipynb     <- 1. Notebook used to gather data from NYT API and match it to posts
├── intro_eda.ipynb          <- 2. Project introduction and data cleaning and exploration
├── model_analysis.ipynb     <- 3. Modeling and analysis of model results to form recommendations
├── presentation.pdf         <- PDF version of project presentation
└── images
    └── images               <- images of visualizations
└── data
    └── data                 <- found and generated during analysis
└── models
    └── models               <- exported copies of best model pipelines, as well as notebook used to model in Google colab

Owner
Jessica Miles
Jessica Miles
PyTorch implementation of ARM-Net: Adaptive Relation Modeling Network for Structured Data.

A ready-to-use framework of latest models for structured (tabular) data learning with PyTorch. Applications include recommendation, CRT prediction, healthcare analytics, and etc.

48 Nov 30, 2022
Pathdreamer: A World Model for Indoor Navigation

Pathdreamer: A World Model for Indoor Navigation This repository hosts the open source code for Pathdreamer, to be presented at ICCV 2021. Paper | Pro

Google Research 122 Jan 04, 2023
A port of muP to JAX/Haiku

MUP for Haiku This is a (very preliminary) port of Yang and Hu et al.'s μP repo to Haiku and JAX. It's not feature complete, and I'm very open to sugg

18 Dec 30, 2022
Related resources for our EMNLP 2021 paper

Plan-then-Generate: Controlled Data-to-Text Generation via Planning Authors: Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang, and Nigel Collier Code

Yixuan Su 61 Jan 03, 2023
PyTorch Connectomics: segmentation toolbox for EM connectomics

Introduction The field of connectomics aims to reconstruct the wiring diagram of the brain by mapping the neural connections at the level of individua

Zudi Lin 132 Dec 26, 2022
Deep Probabilistic Programming Course @ DIKU

Deep Probabilistic Programming Course @ DIKU

52 May 14, 2022
Distance-Ratio-Based Formulation for Metric Learning

Distance-Ratio-Based Formulation for Metric Learning Environment Python3 Pytorch (http://pytorch.org/) (version 1.6.0+cu101) json tqdm Preparing datas

Hyeongji Kim 1 Dec 07, 2022
Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

48 Dec 20, 2022
Official code of "R2RNet: Low-light Image Enhancement via Real-low to Real-normal Network."

R2RNet Official code of "R2RNet: Low-light Image Enhancement via Real-low to Real-normal Network." Jiang Hai, Zhu Xuan, Ren Yang, Yutong Hao, Fengzhu

77 Dec 24, 2022
Code of the paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodner and Joachim Denzler

Part Detector Discovery This is the code used in our paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodne

Computer Vision Group Jena 17 Feb 22, 2022
Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

DALLE2 Video (wip) ** only to be built after DALLE2 image is done and replicated, and the importance of the prior network is validated ** Direct appli

Phil Wang 105 May 15, 2022
Accurate identification of bacteriophages from metagenomic data using Transformer

PhaMer is a python library for identifying bacteriophages from metagenomic data. PhaMer is based on a Transorfer model and rely on protein-based vocab

Kenneth Shang 9 Nov 30, 2022
Tutorials, assignments, and competitions for MIT Deep Learning related courses.

MIT Deep Learning This repository is a collection of tutorials for MIT Deep Learning courses. More added as courses progress. Tutorial: Deep Learning

Lex Fridman 9.5k Jan 07, 2023
GAN-generated image detection based on CNNs

GAN-image-detection This repository contains a GAN-generated image detector developed to distinguish real images from synthetic ones. The detector is

Image and Sound Processing Lab 17 Dec 15, 2022
基于Paddle框架的arcface复现

arcface-Paddle 基于Paddle框架的arcface复现 ArcFace-Paddle 本项目基于paddlepaddle框架复现ArcFace,并参加百度第三届论文复现赛,将在2021年5月15日比赛完后提供AIStudio链接~敬请期待 参考项目: InsightFace Padd

QuanHao Guo 16 Dec 15, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 03, 2023
Experiments for Operating Systems Lab (ETCS-352)

Operating Systems Lab (ETCS-352) Experiments for Operating Systems Lab (ETCS-352) performed by me in 2021 at uni. All codes are written by me except t

Deekshant Wadhwa 0 Sep 06, 2022
Beginner-friendly repository for Hacktober Fest 2021. Start your contribution to open source through baby steps. 💜

Hacktober Fest 2021 🎉 Open source is changing the world – one contribution at a time! 🎉 This repository is made for beginners who are unfamiliar wit

Abhilash M Nair 32 Dec 11, 2022
A PyTorch implementation of Sharpness-Aware Minimization for Efficiently Improving Generalization

sam.pytorch A PyTorch implementation of Sharpness-Aware Minimization for Efficiently Improving Generalization ( Foret+2020) Paper, Official implementa

Ryuichiro Hataya 102 Dec 28, 2022
How to use TensorLayer

How to use TensorLayer While research in Deep Learning continues to improve the world, we use a bunch of tricks to implement algorithms with TensorLay

zhangrui 349 Dec 07, 2022