a high-performance, lightweight and human friendly serving engine for scrapy

Last update: Mar 01, 2022

Related tags

Overview

scrapy-x (X)

a distributed, scalable and lightweight environment for deploying and running scrapy spiders/projects with no-hassle on commodity hardware, also it is compatible with scrapyd /schedule.json and /daemonstatus.json.

Installation

$ pip install -U git+git://github.com/speakol-ads/scrapy-x.git

Usage

let's assume that you have a project called TestCrawler

cd to TestCrawler
run scrapy x
that is all!

Default Settings

it utilizes your default project settings.py file

# whether to enable debug mode or not
X_DEBUG = True

# the default queue name that the system will use
# actually it will be used as a prefix for its internal
# queues, currently there is only one queue called `X_QUEUE_NAME + '.BACKLOG'`
# which holds all jobs that should be crawled.
X_QUEUE_NAME = 'SCRAPY_X_QUEUE'

# the queue workers
# by default it uses the cpu cores count
# try to adjust it based on your resources & needs
X_QUEUE_WORKERS_COUNT = os.cpu_count()

# the webserver workers count
# the workers count required from uvicorn to spwan
# defaults to the available cpu count
# try to adjust it based on your resources & needs
X_SERVER_WORKERS_COUNT = os.cpu_count()

# the port the http server should listen on
X_SERVER_LISTEN_PORT = 6800

# the host used by the http server to listen on
X_SERVER_LISTEN_HOST = '0.0.0.0'

# whether to enable access log or not
X_ENABLE_ACCESS_LOG = True

# redis host
X_REDIS_HOST = 'localhost'

# redis port
X_REDIS_PORT = 6379

# redis db
X_REDIS_DB = 0

# redis password
X_REDIS_PASSWORD = ''

# the maximum allowed wait time for a running task
# it will be killed after that time.
X_TASK_TIMEOUT = 25

Available Endpoints

as well scrapyd core endpoints like (schedule.json, daemonstatus.json), you have the following too:

GET /

returns some info about the engine like the available spiders and backlog queue length

GET|POST /run/{spider_name}

execute the specified spider in {spider_name} and wait for it to return its result, P.S: any query param and json post data will be passed to the spider as argument -a key=value

GET|POST /enqueue/{spider_name}

adding the specified spider in {spider_name} to the backlog to be executed later, P.S: any query param and json post data will be used as spider argument

Technologies Used

Author

I'm Mohamed, a software engineer who enjoys writing code in his free time, I'm speaking python, php, go, rust and js

My Similar Projects

P.S: star the project if you liked it ^_^

a high-performance, lightweight and human friendly serving engine for scrapy

Related tags

Overview

scrapy-x (X)

Installation

Usage

Default Settings

Available Endpoints

Technologies Used

Author

My Similar Projects

Owner

Speakol Ads

Web Scraping Instagram photos with Selenium by only using a hashtag.

👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

A scalable frontier for web crawlers

A Python web scraper to scrape latest posts from official Coinbase's Blog.

A high-level distributed crawling framework.

🥫 The simple, fast, and modern web scraping library

京东云无线宝积分推送，支持查看多设备积分使用情况

Google Maps crawler using Selenium

Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Python framework to scrape Pastebin pastes and analyze them

Minecraft Item Scraper

哔哩哔哩爬取器：以个人为中心

A simple django-rest-framework api using web scraping

爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

This is python to scrape overview and reviews of companies from Glassdoor.

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

A scrapy pipeline that provides an easy way to store files and images using various folder structures.