Materials to reproduce our findings in our stories, "Amazon Puts Its Own 'Brands' First Above Better-Rated Products" and "When Amazon Takes the Buy Box, it Doesn’t Give it up"

Overview

Amazon Brands and Exclusives

This repository contains code to reproduce the findings featured in our story "Amazon Puts Its Own 'Brands' First Above Better-Rated Products" and "When Amazon Takes the Buy Box, it Doesn’t Give it up" from our series Amazon's Advantage.

Our methodology is described in "How We Analyzed Amazon’s Treatment of Its Brands in Search Results".

Data that we collected and analyzed is in the data folder.
To use the full input dataset (which is not hosted here), please refer to Download data.

Jupyter notebooks used for data preprocessing and analysis are available in the notebooks folder.
Descriptions for each notebook are outlined in the Notebooks section below.

Installation

Python

Make sure you have Python 3.6+ installed. We used Miniconda to create a Python 3.8 virtual environment.

Then install the Python packages:
pip install -r requirements.txt

Notebooks

These notebooks are intended to be run sequentially, but they are not dependent on one another. If you want a quick overview of the methodology, you only need to concern yourself with the notebooks with an asterisk(*).

0-data-preprocessing.ipynb

This notebook parses Amazon search results and Amazon product pages, and produces the intermediary datasets (data/output/datasets/) used in ranking analysis and random forest classifiers.

1-data-analysis-search-results.ipynb *

Bulk of the ranking analysis and stats in the data analysis.

2-random-forest-analysis.ipynb *

Feature engineering training set, finding optimal hyperparameters, and performing the ablation study on a random forest model. The most predictive feature is verified using three separate methods.

3-survey-results.ipynb

Visualizing the survey results from our national panel of 1,000 adults.

4-limiations-product-page-changes.ipynb

Analysis of how often the Buy Box's default shipper and seller change between Amazon and a third party.

utils.py

Contains convenient functions used in the notebooks.

parsers.py

Contains parsers for search results and product pages.

Data

This directory is where inputs, intermediaries, and outputs are saved.

data
├── output
│   ├── figures
│   ├── tables
│   └── datasets
│       ├── amazon_private_label.csv.xz
│       ├── products.csv.xz
│       ├── searches.csv.xz
│       ├── training_set.csv.gz
│       ├── pairwise_training_set.csv.gz
│       └── trademarks
└── input
    ├── combined_queries_with_source.csv
    ├── best_sellers
    ├── generic_search_terms
    ├── search-private-label
    ├── search-selenium
    ├── search-selenium-our-brands-filter_
    ├── selenium-products
    ├── seller_central
    └── spotcheck

data/output/ contains tables, figures, and datasets used in our methodology.

data/output/datasets/amazon_private_label.csv.xz is our dataset of Amazon brands, exclusives, and proprietary electronics (N=137,428 products). We use each product's unique ID (called an ASIN) to identify Amazon's own products in our methodology.

data/output/datasets/trademarks contains a dataset of trademarked brands registered by Amazon. The data was collected from USPTO.gov and Amazon. We included an additional README with the exact steps we took to build this dataset in the directory.

data/output/datasets/searches.csv.xz parsed search result pages from top and generic searches (N=187,534 product positions). You can filter this by search_term for each of these subsets from data/input/combined_queries_with_source.csv.

data/output/datasets/products.csv.xz parsed product pages from the searches above (N=157,405 product pages).

data/output/training_set.csv.gz metadata used to train and evaluate the random forest. Additionally, feature engineering is conducted in notebooks/2-random-forest-analysis.ipynb, which produces pairwise_training_set.csv.gz.

Every file in data/input except combined_queries_with_source.csv is stored in AWS s3. Those are not hosted in this repository.

Download Data

You can find the raw inputs in data/input in s3://markup-public-data/amazon-brands/.

If you trust us, you can download the HTML and JSON files in data/input using this script: sh data/download_input_data.sh

Note this is not necessary to run notebooks and see full results.

data/input/search-selenium/ (12 GB uncompressed)

First page of search results collected in January 2021. Download the HTML files search-selenium.tar.xz (238 MB compressed) here.

data/input/selenium-products/ (220 GB uncompressed)

Product pages collected in February 2021. Download the HTML files selenium-products.tar.xz (9 GB compressed) here.

data/input/search-selenium-our-brands-filter_/ (35 GB uncompressed)

Search results filtered by "our brands". Contains every page of search results. Download search-selenium-our-brands-filter_.tar.xz (403 MB compressed) here.

data/input/search-private-label/ (25 GB uncompressed)

API responses for search results filtered down to products Amazon identifies as "our brands". Contains paginated API results. Download the JSON files search-private-label.tar.xz (402 MB uncompressed) here.

data/input/seller_central/ (105 MB)

Seller central data for Q4 2020. Download the CSV file All_Q4_2020.csv.xz (105 MB compressioned) here.

data/input/best_sellers/ (4 GB)

Amazon's best sellers under the category "Amazon Devices & Accessories". Download the HTML files best_sellers.tar.xz (60MB compressed) here.

data/input/spotcheck/ (4 GB)

A sub-sample of product pages for spot-checking Buy Box changes. Download the HTML files spotcheck.tar.xz (159 MB compressed) here.

Data and a Twitter bot for the EPA's DOCUMERICA (1972-1977) program.

documerica This repository holds JSON(L) artifacts and a few scripts related to managing archival data from the EPA's DOCUMERICA program. Contents: Ma

William Woodruff 2 Oct 27, 2021
Nflmetrics - Johns Hopkins Spring 2022 Sports Analytics research project about NFL Draft Metrics

nflmetrics GitHub repo for Johns Hopkins Spring 2022 Sports Analytics research p

Anish Kulkarni 4 Feb 24, 2022
WikipediaBot from mohirdev.uz

wiki-bot WikipediaBot from mohirdev.uz Requirements wikipedia aiogram Installing wiki/aiogram pip install wikipedia pip install aiogram

Muhammad Ali 5 Sep 28, 2022
A python library for creating Slack slash commands using AWS Lambda Functions

slashbot Slashbot makes it easy to create slash commands using AWS Lambda functions. These can be handy for creating a secure way to execute automated

Eric Brassell 17 Oct 21, 2022
This repository will be a draft of a package about the latest total marine fish production in Indonesia. Data will be collected from PIPP (Pusat Informasi Pelabuhan Perikanan).

indomarinefish This package will give us information about the latest total marine fish production in Indonesia. The Name of the fish is written in In

1 Oct 13, 2021
The successor of GeoSnipe, a pythonic Minecraft username sniper based on AsyncIO.

OneSnipe The successor of GeoSnipe, a pythonic Minecraft username sniper based on AsyncIO. Documentation View Documentation Features • Mojang & Micros

1 Jan 14, 2022
Program that automates the bump of the Disboard Bot. Done 100% in Python with PyAutoGUI library

Auto-Discord-Bump Program that automates the bump of the Disboard Bot done 100% in python with PyAutoGUI How to configue You will need 3 things before

Mateus 1 Dec 19, 2021
asyncio client for Deta Cloud

aiodeta Unofficial client for Deta Clound Install pip install aiodeta Supported functionality Deta Base Deta Drive Decorator for cron tasks Examples i

Andrii Leitsius 19 Feb 14, 2022
A Discord bot that controls Pico-8.

Pico-8 Discord Bot Synopsis: A Discord bot that controls Pico-8. Please let me know if you make any games with this tool! I will simplify the discord.

Camden 1 Jan 28, 2022
A telegram bot which can show you the status of telegram bot

BotStatus-Ts-Bot An open source telegram Bot Status bot For demo you can check here The status is updated in every 1 hour About Bot This is a Bot stat

Ts_Bots 8 Nov 17, 2022
SystemSix is an e-Ink "desk accessory" running on a Raspberry Pi. It is a bit of nostalgia that can function as a calendar, display the weather

SystemSix is an e-Ink "desk accessory" running on a Raspberry Pi. It is a bit of nostalgia that can function as a calendar, display the weather, the c

John Calhoun 372 Jan 02, 2023
A FORKED AND Modded version of TL:GD for 🅱️3R0K🧲support

for support join here working example group Leech Here For Any Issues/Imrovements or Discussions go here or here Please Leave A star And Fork this Rep

XcodersHub 165 Mar 12, 2022
Google translator bot using pyTelegramBotAPI

iTranslator-bot Super google translator bot using pyTelegramBotAPI A bot is a professional bot that automatically detects a language in texts or capti

Abdulatif 6 Nov 22, 2022
A modular telegram Python bot running on python3 with an sqlalchemy database.

Saber A modular telegram Python bot running on python3 with an sqlalchemy database. Originally a marie fork - Saber has evolved further and was built

ZERO • アクバル . 4 Nov 09, 2021
EzilaX Music ❤ is the best and only Telegram VC player with playlists, Multi Playback, Channel play and more POWERD By SDBOTs

EzilaX-Music 🎵 A bot that can play music on Telegram Group and Channel Voice Chats Available on telegram as @EzilaXMBot Features 🔥 Thumbnail Support

Sadew Jayasekara 9 Oct 24, 2021
You can submit any PR and have SWAGS. Happy Hacktoberfest !

Excluded project Repository 🔴 🔴 🔴 - PR limit is reached. Please use another Repository Hacktoberfest 2021 🎉 🗣 Hacktoberfest encourages participat

Hansajith 63 Oct 21, 2022
This bot will delete messages containing blacklisted words in your telegram groups.

Profanity Detector Bot This bot will delete messages containing blacklisted words in your telegram groups. Made using ProfanityDetector.

Aditya 17 Oct 08, 2022
A unified API wrapper for YouTube and Twitch chat bots.

Chatto A unified API wrapper for YouTube and Twitch chat bots. Contributing Chatto is open to contributions. To find out where to get started, have a

Ethan Henderson 5 Aug 01, 2022
A Simple Telegram Inline Torrent Search Bot by @infotechIT

Torrent-Search-RoBot A Simple Telegram Inline Torrent Search Bot by @infotechIT. Torrent API Using api.infotech.wtf API Host Bot Deploy to Heroku Clic

InfoTech 0 May 05, 2022
Python version of PlaceNL's headless bot with automatic access token refresh

Reddit /r/place 2022 headless bot This headless Python bot will automatically login to reddit, obtain access tokens (and refreshes them when they expi

19 May 21, 2022