🐞 Douban Movie / Douban Book Scarpy

Last update: Dec 03, 2022

Related tags

Overview

ScrapyDouban

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

The purpose of maintaining this project is to share some of my practice in the process of using Scrapy, the project covers about 80% of my knowledge of Scrapy, I hope to help friends who are learning Scrapy, please note that the current version of the project is Scrapy 2.5.0.

Docker

Project contains douban_scrapyd douban_db douban_adminer three containers.

The douban_scrapyd container is based on python:3.9-slim-buster, the default installed Python3 libraries are scrapy scrapyd pymysql pillow arrow, default mapping port 6800:6800 to facilitate user access to scrapyd management interface via host IP:6800, login required parameters, username:scrapyd password:public.

The douban_db container is based on mysql:8, root password is public, and the default initialization is to import the docker/mysql/douban.sql file to the douban database.

douban_adminer container is based on adminer:4, default mapping port 8080:8080 to facilitate users to access the database management interface through the host IP:8080, login required parameters, server:mysql username:root password:public.

Project SQL

The path to the SQL file used by the project is docker/mysql/douban.sql.

Collection Process

First collect Subject ID --> then crawl the detail page by Subject ID to collect data --> finally collect comments by Subject ID

method

$ git clone https://github.com/xjia77/ScrapyDouban.git
# Build and run containers
$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# enter douban_scrapyd container
$ sudo docker exec -it douban_scrapyd bash
# enter scrapy content
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list
# Grabbing movie data
$ scrapy crawl movie_subject # collect movie Subject ID
$ scrapy crawl movie_meta # collect movie data
$ scrapy crawl movie_comment # collect movie comment
# Grabbing book data
$ scrapy crawl book_subject # collect book Subject ID
$ scrapy crawl book_meta # collect book data
$ scrapy crawl book_comment # collect book comment

If you want to make changes to your code more easily while testing, you can mount your project in the scrapy directory to the douban_scrapyd container. If you are used to working with scrapyd, you can deploy your project directly to the douban_scrapyd container via scrapyd-client.

Proxy IP

Due to douban's anti-crawler mechanism, the only way to bypass it now is through a proxy IP. ProxyMiddleware middleware is not enabled in the default settings.py. If you really need to use Douban's data to do some research, you can go rent a paid proxy pool.

image download

douban.pipelines.CoverPipeline processes the cover download logic by filtering spider.name, and the save path of the downloaded image files is the /srv/ScrapyDouban/storage directory of the douban_scrapy container.

🐞 Douban Movie / Douban Book Scarpy

Related tags

Overview

ScrapyDouban

Docker

Project SQL

Collection Process

method

Proxy IP

image download

Owner

Xingbo Jia

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

A database scraper created with mechanical soup and sqlite

A simple proxy scraper that utilizes the requests module in python.

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

An experiment to deploy a serverless infrastructure for a scrapy project.

This is a script that scrapes the longitude and latitude on food.grab.com

Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

A way to scrape sports streams for use with Jellyfin.

哔哩哔哩爬取器：以个人为中心

A module for CME that spiders hashes across the domain with a given hash.

a small library for extracting rich content from urls

Introduction to WebScraping Workshop - Semcomp 24 Beta

Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Web Scraping Instagram photos with Selenium by only using a hashtag.

This is python to scrape overview and reviews of companies from Glassdoor.

热搜榜-python爬虫+正则re+beautifulsoup+xpath

Amazon scraper using scrapy, a python framework for crawling websites.

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

A Very simple free proxy list scraper.