A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Overview

Parallel web scraping

The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Goal

The script extracts names and prices of the Top-100 crypto coins and stores the data into a db.

Disclaimer

The task is quite contrived and serves mainly for study purpose. There are innumerous of mature sources containing both real-time and historical cryptocurrency data.

Solved problems within the project

  1. Multiple pages with one level nesting have been scraped. The propagation has been implemented by gathering internal links from the main page followed by looping on them.
  2. To avoid getting banned from the remote server, a mechanism dealing with proxy servers was implemented.
  3. A free public proxy server is commonly assumed as unreliable in terms of availability. To overcome this issue:
    • another scraping script extracts a list of free public proxy servers from a web site.
    • with each launch of the script, the list of 10 proxy servers gets updated by currently available proxy servers.
    • during the script execution, some proxy servers get unavailable. Thus, each scraping query goes through this list and searches for an alive proxy server to execute a query.
  4. To speed up the scraping of the total 101 web pages multithreading is involved. The work is divided among 4 threads running almost simultaneously.
  5. The extracted data is being written directly to a DataBase.
Owner
Kushal Shingote
Android Developer📱📱 iOS Apps📱📱 Swift | Xcode | SwiftUI iOS Swift development📱 Kotlin Application📱📱 iOS📱 Artificial Intelligence 💻 Data science
Kushal Shingote
Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

NewsScraper A simple Python 3 module to get crypto or news articles and their content from various RSS feeds. 🔧 Installation Clone the repo locally.

Rokas 3 Jan 02, 2022
Web scrapper para cotizar articulos

WebScrapper Este web scrapper esta desarrollado en python 3.10.0 para buscar en la pagina de cyber puerta articulos dentro del catalogo. El programa t

Jordan Gaona 1 Oct 27, 2021
crypto currency scraping

SCRYPTO What ? Crypto currencies scraping (At the moment, only bitcoin and ethereum crypto currencies are supported) How ? A python script is running

15 Sep 01, 2022
This project was created using Python technology and flask tools to scrape a music site

python-scrapping This project was created using Python technology and flask tools to scrape a music site You need to install the following packages to

hosein moradi 1 Dec 07, 2021
Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

TwitterScraper Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine . Screenshot Data Users Only

Remax Alghamdi 19 Nov 17, 2022
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 07, 2023
Introduction to WebScraping Workshop - Semcomp 24 Beta

Extrair informações da internet de forma automatizada. Existem diversas maneiras de fazer isso, nesse tutorial vamos ver algumas delas, por meio de bibliotecas de python.

Luísa Moura 19 Sep 11, 2022
This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Movies-Scraper You are probably tired of navigating through a movie website to get the right movie you'd want to watch during the weekend. There may e

1 Jan 31, 2022
A simple python script to fetch the latest covid info

covid-tracker-script A simple python script to fetch the latest covid info How it works First, get the current date in MM-DD-YYYY format. Check if the

Dot 0 Dec 15, 2021
This repo has the source code for the crawler and data crawled from auto-data.net

This repo contains the source code for crawler and crawled data of cars specifications from autodata. The data has roughly 45k cars

Tô Đức Anh 5 Nov 22, 2022
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

crawler_to_visual_gmane Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library. This is a set of tools that al

Saim Zafar 1 Dec 20, 2021
A crawler of doubamovie

豆瓣电影 A crawler of doubamovie 一个小小的入门级scrapy框架的应用,选取豆瓣电影对排行榜前1000的电影数据进行爬取。 spider.py start_requests方法为scrapy的方法,我们对它进行重写。 def start_requests(self):

Cats without dried fish 1 Oct 05, 2021
京东茅台抢购

截止 2021/2/1 日,该项目已无法使用! 京东:约满即止,仅限京东实名认证用户APP端抢购,2月1日10:00开始预约,2月1日12:00开始抢购(京东APP需升级至8.5.6版本及以上) 写在前面 本项目来自 huanghyw - jd_seckill,作者的项目地址我找不到了,找到了再贴上

abee 73 Dec 03, 2022
Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

Eban'ko 19 Dec 07, 2022
Python scrapper scrapping torrent website and download new movies Automatically.

torrent-scrapper Python scrapper scrapping torrent website and download new movies Automatically. If you like it Put a ⭐ on this repo 😇 Run this git

Fazil vk 1 Jan 08, 2022
A dead simple crawler to get books information from Douban.

Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

Yun Wang 1 Jan 10, 2022
Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation This repository provides two web crawlers to label domain nam

1 Nov 05, 2021
京东秒杀商品抢购Python脚本

Jd_Seckill 非常感谢原作者 https://github.com/zhou-xiaojun/jd_mask 提供的代码 也非常感谢 https://github.com/wlwwu/jd_maotai 进行的优化 主要功能 登陆京东商城(www.jd.com) cookies登录 (需要自

Andy Zou 1.5k Jan 03, 2023
Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

Oleg T. 0 Dec 06, 2021