topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

Overview

NLP Space News Topic Modeling

Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com

Binder Open In Colab nbviewer pre-commit CI CodeQL License: MIT OpenSource Code style: black prs-welcome pyup

Table of Contents

  1. Project Idea
  2. Data acquisition
  3. Analysis
  4. Usage
  5. Project Organization

Project Idea

This project aims to learn topics published in Space news from the Guardian (UK) news publication1.

1: articles were also retrieved from the blog Space.com (web scraping), the New York Times (space news from the science section) and from the Hubble Telescope news archive, but these data sources were not used in analysis

Data acquisition

Primary data source

News articles are retrieved using the official API provided by the Guardian.

Supplementary data sources

Data is also acquired from articles published by the Hubble Telescope, the New York Times (US) and blog publication Space.com

Although these articles were acquired, they were not used in analysis.

Data file creation

  1. Use 1_get_list_of_urls.ipynb
    • programmatically retrieves urls from API or archive of publication
    • retrieves metadata such as date and time, section, sub-section, headline/abstract/short summary, etc.
  2. Use 2_scrape_urls.ipynb
    • scrapes news article text from publication url
  3. Use 3_merge_scraped_and_filter.ipynb
    • merge metadata (1_get_list_of_urls.ipynb) with scraped article text (2_scrape_urls.ipynb)

Analysis

Analysis will be performed using an un-supervised learning model. Details are included in the 8_gensim_coherence_nlp_trials_v3.ipynb notebook in the root directory.

Usage

  1. Clone this repository
    $ git clone
  2. Create Python virtual environment, install packages and launch interactive Python platform
    $ make build
  3. Run notebooks in the following order
    • 3_merge_scraped_and_filter.ipynb (view) (covers data from the Hubble news feed, New York Times and Space.com)
      • merge multiple files of articles text data retrieved from news publications API or archive
      • filter out articles of less than 500 words
      • export to *.csv file for use in unsupervised machine learning models
    • 8_gensim_coherence_nlp_trials_v3.ipynb (view) (does not cover data from the Hubble news feed, New York Times and Space.com)
      • experiments in selecting number of topics using
        • coherence score from built-in coherence model to score Gensim's NMF
        • sklearn's implementation of TFIDF + NMF, using best number of topics found using Gensim's NMF
      • manually reading articles that NMF associates with each topic
    • 9_nlp_workflow.ipynb (view)
      • code-only version of 9_gensim_coherence_nlp_trials_v3.ipynb, with necessary considerations for deployment of topic model

Project Organization

├── .pre-commit-config.yaml       <- configuration file for pre-commit hooks
├── .github
│   ├── workflows
│       └── integrate.yml         <- configuration file for Github Actions
├── LICENSE
├── environment.yml               <- configuration file to create environment to run project on Binder
├── Makefile                      <- Makefile with commands like `make lint` or `make build`
├── README.md                     <- The top-level README for developers using this project.
├── app
│   ├── data                      <- data exported from training topic modeler, for use with API
|   └── tests                     <- Source code for use in API tests
|       ├── test-logs             <- Reports from running unit tests on API
|       └── testing_utils         <- Source code for use in unit tests
|           └── *.py              <- Scripts to use in testing API routes
|       ├── __init__.py           <- Allows Python modules to be imported from testing_utils
|       └── test_api.py           <- Unit tests for API
├── api.py                        <- Defines API routes
├── pytest.ini                    <- Test configuration
├── requirements.txt              <- Packages required to run and test API
├── s*,t*.py                      <- Scripts to use in defining API routes
├── data
│   ├── raw                       <- raw data retrieved from news publication
|   └── processed                 <- merged and filtered data
├── executed-notebooks            <- Notebooks with output.
├── *.ipynb                       <- Jupyter notebooks. Naming convention is a number (for ordering),
│                                    and a short `-` delimited description
├── requirements.txt              <- packages required to execute all Jupyter notebooks interactively (not from CI)
├── setup.py                      <- makes project pip installable (pip install -e .) so `src` can be imported
├── src                           <- Source code for use in this project.
│   ├── __init__.py               <- Makes src a Python module
│   └── *.py                      <- Scripts to use in analysis for pre-processing, training, etc.
├── papermill_runner.py           <- Python functions that execute system shell commands.
└── tox.ini                       <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Owner
edesz
edesz
A Practitioner's Guide to Natural Language Processing

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, Text

Dipanjan (DJ) Sarkar 1.5k Jan 03, 2023
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
Search Git commits in natural language

NaLCoS - NAtural Language COmmit Search Search commit messages in your repository in natural language. NaLCoS (NAtural Language COmmit Search) is a co

Pushkar Patel 50 Mar 22, 2022
Translates basic English sentences into the Huna language (hoo-NAH)

huna-translator The Huna Language Translates basic English sentences into the Huna language (hoo-NAH). The Huna constructed language was developed in

Miles Smith 0 Jan 20, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Low-resource-Machine-Translation This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal o

Andrea Cavallo 3 Jun 22, 2022
[KBS] Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks

#Sentic GCN Introduction This repository was used in our paper: Aspect-Based Sentiment Analysis via Affective Knowledge Enhanced Graph Convolutional N

Akuchi 35 Nov 16, 2022
Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

Phil Wang 92 Dec 25, 2022
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
Simple GUI where you can enter an article and get a crisp summarized version.

Text-Summarization-using-TextRank-BART Simple GUI where you can enter an article and get a crisp summarized version. How to run: Clone the repo Instal

Rohit P 4 Sep 28, 2022
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

Matthias 479 Jan 01, 2023
🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

PokéBattle is an esoteric language designed so that the program looks like the transcript of a Pokémon battle. Original inspiration and specification

Eduardo Correia 9 Jan 11, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

LightSpeech UnOfficial PyTorch implementation of LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

Rishikesh (ऋषिकेश) 54 Dec 03, 2022
auto_code_complete is a auto word-completetion program which allows you to customize it on your need

auto_code_complete v1.3 purpose and usage auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the m

RUO 2 Feb 22, 2022