NLP

T5 Project proposal

Topic Modeling and Clustering of News-Articles-and-Essays

Students:

Nasser Alshehri
Abdullah Bushnag
Abdulrhman Alqurashi

OVERVIEW

News come in different formats, different types and different categories. Here we attempt to use Topic modeling and Clustering to get answers on what each content containt based on its content and then we try to do it based only on its title.

The process would be: We load the data. Keep what we need from the data. Clean the text(ex:stopwords).

Build the bag of words for all documents. Build the bag of words for each document.

Vectorize the data. Run the LDA model. Run the model on all data and save the output to dataframe

Run the Clustering algorithm. Save the data to csv. Make the charts.

Data

The data is acquired from: https://components.one/datasets/all-the-news-articles-dataset

The Raw data containts 12 features: id, title, author, date, content, year, month, publication, category, digital, section, url.

The features we are using are only the 'title' and 'content'.

The data we are not interested in will be dropped/ignored.

The 'title' is the headling/name/title of the news/Article/Essay. The 'Content' is the body/content/Essay/Article/News itself.

TOOLS

Pandas Numpy Scikit-learn Matplotlib Seaborn nltk gensim

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Related tags

Overview

NLP

Students:

OVERVIEW

Data

TOOLS

Owner

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

Open Source Neural Machine Translation in PyTorch

Creating a Feed of MISP Events from ThreatFox (by abuse.ch)

Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

Machine Psychology: Python Generated Art

Code for the paper "Flexible Generation of Natural Language Deductions"

Text Normalization（文本正则化）

Blackstone is a spaCy model and library for processing long-form, unstructured legal text

Code repository for "It's About Time: Analog clock Reading in the Wild"

A BERT-based reverse-dictionary of Korean proverbs

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

PUA Programming Language written in Python.

A fast, efficient universal vector embedding utility package.

Sequence model architectures from scratch in PyTorch

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Data loaders and abstractions for text and NLP

🏖 Easy training and deployment of seq2seq models.

Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.