Web-Scraper-for-a-news-website

This is a webscraper for a specific website (Economic Times). It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Installation

Install the following:

Selenium: Please follow the link https://selenium-python.readthedocs.io/installation.html and install the selenium.
Chromedriver: Check your Chrome browser's version (Menu -> Help -> About Google Chrome) and download the relevant Chromedriver from https://sites.google.com/chromium.org/driver/home
TQDM: https://pypi.org/project/tqdm/
BeautifulSoup4: https://pypi.org/project/beautifulsoup4/

Using the webscraper

It is important to take care of the sequence of executing these files. Please follow the sequence below:

ET_Archive_Links.py: Use this website as it is the source of everything that we'll do later. This scripy gives us the initial links in the Archive page of the website.
ET_All_Links_Inside_Archive.py: This is the script that takes the output (csv file) of the previous script. It produces a new file which contain URLs of all the archived news on the website since 2002.
ET_Content.py: Finally, this is the script that scrapes the headlines along with the dates. ( If you want to scrap any other part of the website then this is the script that you have to edit )

Dataset

I used the scraper on another news website named "Businessline". It's dataset is available on Kaggle(https://www.kaggle.com/rsiyanwal/20182019-businessline-headlines).

This is a webscraper for a specific website

Related tags

Overview

Web-Scraper-for-a-news-website

Installation

Using the webscraper

Dataset

Owner

Rahul Siyanwal

Minecraft Item Scraper

Nekopoi scraper using python3

This is a module that I had created along with my friend. It's a basic web scraping module

Web3 Pancakeswap Sniper bot written in python3

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

An experiment to deploy a serverless infrastructure for a scrapy project.

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

simple http & https proxy scraper and checker

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Find papers by keywords and venues. Then download it automatically

Scrape all the media from an OnlyFans account - Updated regularly

Scrapes all articles and their headlines from theonion.com

Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

A high-level distributed crawling framework.

Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

A simple app to scrap data from Twitter.

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

A high-level distributed crawling framework.

Extract embedded metadata from HTML markup