Web crawling framework based on asyncio.

Last update: Jan 05, 2023

Overview

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

联通手机营业厅自动做任务、签到、领流量、领积分等。

A way to scrape sports streams for use with Jellyfin.

Web crawling framework based on asyncio.

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Demonstration on how to use async python to control multiple playwright browsers for web-scraping

Python web scrapper

Pseudo API for Google Trends

Web Scraping Practica With Python

Telegram Group Scrapper

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

Meme-videos - Scrapes memes and turn them into a video compilations

Python scraper to check for earlier appointments in Clalit Health Services

A simple python script to fetch the latest covid info

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Scrapy-based cyber security news finder

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

Web scraper build using python.

Get-web-images - A python code that get images from any site

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.