NLP-topic-mdel-LDA

1. Dataset

the dataset were gathered from New York Times website, Energy section. (nytimes.com). the Website offers the journals by categories, and I used the category energy. For the text mining, I had to check the structure of website. The websiste basically using HTML base, and had four big frames. To create the crawler, I used selenium chrome web driver and python. For the first put the url and access address. In this step, I already put the url which is energy section so that I can avoid additional step. The journals I wanted to crawl is only for renewable energy, so I used send_keys function from BeautifulSoup. Then make the sorting option as newest. This sorting option was found as Xpath from chrome instpection. Then use the selenium to scroll down and at the end download the date, title and headline and save as csv file.

This dataset has date, title and headline of the journals related renewable energy from Dec 11 2020 to Feb 26, 2021, and it has total 110 rows without missing values. The ‘news’ column is combination of ‘title’ column and ‘headline’ column. for the topic modeling, mostly the ‘news’ column has been used.

2. text pre-processing

special characters, numbers and punctuation marks are removed. For this step, python replace function has been applied. Every character excludes English al-phabet (a-zA-Z) is replaced to blank. (“ “).
Second step is removing the short length words. In this project, the words have less than 3 alphabet character are assumed as not useful information. For example, “if”, “it”, “of”, “at”. For this step, for loop and if statement has been applied.
convert capital letters to lower letters. By this steps, the total number of words can be re-duced. For this step, apply function has been applied

3. LDA

LDA is an unsupervised machine learning model that find topics from the literature and one of the representative algorithms of topic modeling. in this code, gensim library has been applied for the model.

4. Visualization

For the visualization of LDA model, pyLDAvis package has been applied. The distance of each circle shows how different each topic is from each other. If the two circles overlapped, it indicates that these two topics are similar topics

By clicking each circle, each words term frequency is shown as bar chart representation. The blue bar indicates overall term frequency and the red bar indicates estimated term frequency within the selected topic, and the bar chart is sorted by the red line LDA is an unsupervised machine learning model that find topics from the literature and one of the representative algorithms of topic modeling

NLP topic mdel LDA - Gathered from New York Times website

Related tags

Overview

NLP-topic-mdel-LDA

1. Dataset

2. text pre-processing

3. LDA

4. Visualization

Owner

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Google AI 2018 BERT pytorch implementation

Question and answer retrieval in Turkish with BERT

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

A fast, efficient universal vector embedding utility package.

Contact Extraction with Question Answering.

DaCy: The State of the Art Danish NLP pipeline using SpaCy

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

A high-level yet extensible library for fast language model tuning via automatic prompt search

The code from the whylogs workshop in DataTalks.Club on 29 March 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Just a Basic like Language for Zeno INC