Complete pipeline for crawling online newspaper article.

Last update: May 27, 2022

Related tags

Overview

NewsPipe

This repository contains the complete pipeline for collecting online newspaper article. The articles are stored in a MongoDB. The whole pipeline is dockerized, thus the user does not need to worry about dependencies. Additionally, docker-compose is available to increase the useability for the user.

Requirement

To use this system, you need to create a .env file in which the MongoDB information is available:

MONGO_ROOT_USER=devroot
MONGO_ROOT_PASSWORD=devroot
MONGOEXPRESS_LOGIN=dev
MONGOEXPRESS_PASSWORD=dev
MONGO_CHART_USERNAME=dev
MONGO_CHART_PASSWORD=dev
POSTGRES_USER=airflow
POSTGRES_PASS=airflow

If you want to specify the number of threads then open airflow-newspipe-docker and adjust the sed command in airflow-docker/Dockerfile. If you want 4 threads per process:

&& sed -i'.orig' 's/max_threads = 2/max_threads = 4/g' ${AIRFLOW_HOME}/airflow.cfg \

Additionally, you can also specify the number of processes (2 processes in this case):

&& sed -i'.orig' 's/parallelism = 32/parallelism = 2/g' ${AIRFLOW_HOME}/airflow.cfg \

Getting Started

To start this application, run:

docker-compose up

To see the database collections, mongo-express is in use and available on localhost:8081. The MongoDB itself is available on port 27017.
The airflow application should be available on localhost:8083. You will see the airflow dashboard with the default examples.
For the mongo chart dashboard, open localhost

Adding article sources

Each crawler is defined as DAG in 'dag'. To add a data source, you must therefore add DAGs in the dags folder. A DAG is a Python script that contains the settings for an entire crawling pipeline. Use the default example as a template. The DAGs are very simple and straightforward.

import os
import datetime

from dag_factory import create_dag

url = "taz.de" # url of newspaper source

# Defining the crawling intervals
airflow_config = {'schedule_interval': '@hourly', # set a interval, for continuous crawling
                  'start_date': datetime.datetime(2020, 6, 4, 21), # set a date, on which the dag will run
                  'end_date':datetime.datetime(2020, 6, 5, 6), # optinal, set if it is needed
                  }

# Create crawling DAG
DAG = create_dag(url=url,
                 airflow_config=airflow_config,
                 name=os.path.basename(__file__))

Options for schedule_interval:

preset	meaning	cron
`@once`	Schedule once and only once
`@hourly`	Run once an hour at the beginning of the hour	`0 * * * *`
`@daily`	Run once a day at midnight	`0 0 * * *`
`@weekly`	Run once a week at midnight on Sunday morning	`0 0 * * 0`
`@monthly`	Run once a month at midnight of the first day of the month	`0 0 1 * *`
`@quarterly`	Run once a quarter at midnight on the first day	`0 0 1 /3 `
`@yearly`	Run once a year at midnight of January 1	`0 0 1 1 *`

Mongo Charts

MongoDB Charts is a data visualization tool that is integrated within the MongoDB ecosystem. By default, there are no visualization available or shipped with NewsPipe. Therefore, you have to create dashboard on your needs. This involves following 3 steps:

Setup data source
Data aggregation
Dashboard creation which are well documented on docs.mongodb.com.

Credentials:

The credentials for mongo charts are:

E-Mail: [email protected]
Password: MONGO_CHART_PASSWORD

Connection URI

URI: mongodb://MONGO_ROOT_USER:[email protected]:27017

Basic-html-scraper - A complete how to of web scraping with Python for beginners

basic-html-scraper Code from YT Video This video includes a complete how to of w

12 Oct 22, 2022

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

1 Mar 28, 2022

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

7 Oct 23, 2022

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

1 Dec 7, 2021

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

NLP Space News Topic Modeling Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com Table of Contents Project Idea Data acquisition Primary data sour

1 Jan 3, 2022

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

273 Dec 31, 2022

Framework for the Complete Gaze Tracking Pipeline

Framework for the Complete Gaze Tracking Pipeline The figure below shows a general representation of the camera-to-screen gaze tracking pipeline [1].

20 Jan 6, 2023

Complete portable pipeline for masking of Aadhaar Number adhering to Govt. Privacy Guidelines.

Aadhaar Number Masking Pipeline Implementation of a complete pipeline that masks the Aadhaar Number in given images to adhere to Govt. of India's Priv

1 Nov 6, 2021

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

1.5k Jan 4, 2023

TuShare is a utility for crawling historical data of China stocks

TuShare Tushare Pro版已发布，请访问新的官网了解和查询数据接口！ https://tushare.pro TuShare是实现对股票/期货等金融数据从数据采集、清洗加工到数据存储过程的工具，满足金融量化分析师和学习数据分析的人在数据获取方面的需求，它的特点是数据覆盖范围广，接口

11.9k Dec 30, 2022

Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

45.5k Jan 7, 2023

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

1.5k Dec 24, 2022

Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

2k Jan 5, 2023

declutters url lists for crawling/pentesting

uro Using a URL list for security testing can be painful as there are a lot of URLs that have uninteresting/duplicate content; uro aims to solve that.

677 Jan 7, 2023

Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

1 Nov 3, 2021

Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

1 Dec 26, 2021

Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

4 Aug 30, 2022

An University Project of Quera Web Crawling.

WebCrawlerProject An University Project of Quera Web Crawling. خزشگر اینستاگرام در این پروژه شما باید با استفاده از کتابخانه های زیر یک خزشگر اینستاگر

3 Aug 12, 2022

This is a repository for the Duke University Cloud Computing course project on Serveless Data Engineering Pipeline. For this project, I recreated the below pipeline.

AWS Data Engineering Pipeline This is a repository for the Duke University Cloud Computing course project on Serverless Data Engineering Pipeline. For

15 Jul 28, 2021

Comments

Airflow Web UI crashing

Web UI crashes after some time.It seems that the problem lies with Postgres. This is not a serious issue, because the pipeline will continue working without UI

opened by steven-mi 1
docs: Fix a few typos
There are small typos in:

README.md

Fixes:

Should read usability rather than useability.

Should read optional rather than optinal.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
opened by timgates42 0

Releases(1.1)

1.1(Jan 2, 2021)
New Features:

MongoDB Chart for data visualization

Refactored DAG factory interface

No data cleaning in DAG. Data is cleaned by the crawler (less code duplication, better testability of cleaning module)

Source code(tar.gz)
Source code(zip)
1.0(Oct 12, 2020)
New features:

DAG for updating old MongoDB documents

Add goose for text extraction to make it more stable e.g. Spiegel works now!

Code refactor for DAG factory

Source code(tar.gz)
Source code(zip)
0.1.1-distributed(Sep 7, 2020)

This is the first release for NewsPipe, which is able to run on a celery cluster with multiple threads for every node
Source code(tar.gz)
Source code(zip)
0.1-parallel(Sep 7, 2020)

This is the first release for NewsPipe, which only runs on node with multiple threads
Source code(tar.gz)
Source code(zip)
0.1-local(Sep 7, 2020)

This is the first release for NewsPipe, which only runs on one node
Source code(tar.gz)
Source code(zip)

Complete pipeline for crawling online newspaper article.

Related tags

Overview

NewsPipe

Requirement

Getting Started

Adding article sources

Mongo Charts

Credentials:

Connection URI

You might also like...

Basic-html-scraper - A complete how to of web scraping with Python for beginners

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

Framework for the Complete Gaze Tracking Pipeline

Complete portable pipeline for masking of Aadhaar Number adhering to Govt. Privacy Guidelines.

A high-level distributed crawling framework.

TuShare is a utility for crawling historical data of China stocks

Scrapy, a fast high-level web crawling & scraping framework for Python.

A high-level distributed crawling framework.

Web crawling framework based on asyncio.

declutters url lists for crawling/pentesting

Scrapy uses Request and Response objects for crawling web sites.

Amazon scraper using scrapy, a python framework for crawling websites.

Python script for crawling ResearchGate.net papers✨⭐️📎

An University Project of Quera Web Crawling.

This is a repository for the Duke University Cloud Computing course project on Serveless Data Engineering Pipeline. For this project, I recreated the below pipeline.

Comments

Airflow Web UI crashing

docs: Fix a few typos

Releases(1.1)

1.1(Jan 2, 2021)

1.0(Oct 12, 2020)

0.1.1-distributed(Sep 7, 2020)

0.1-parallel(Sep 7, 2020)

0.1-local(Sep 7, 2020)

Owner

newspipe

A python tool to scrape NFT's off of OpenSea

mlscraper: Scrape data from HTML pages automatically with Machine Learning

An application that on a given url, crowls a web page and gets all words, sorts and counts them.

This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

tweet random sand cat pictures

A simple code to fetch comments below an Instagram post and save them to a csv file

基于Github Action的定时HITsz疫情上报脚本，开箱即用

Web Content Retrieval for Humans™

Scrapy-based cyber security news finder

A web crawler script that crawls the target website and lists its links

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

TikTok Username Swapper/Claimer/etc

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

A Very simple free proxy list scraper.

Goblyn is a Python tool focused to enumeration and capture of website files metadata.

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

This tool crawls a list of websites and download all PDF and office documents

Meme-videos - Scrapes memes and turn them into a video compilations

fork huanghyw/jd_seckill