当前位置:网站首页>Scripy tutorial - (2) write a simple crawler
Scripy tutorial - (2) write a simple crawler
2022-04-23 20:22:00 【Winding Liao】
Scrapy course - (2) Write a simple reptile
Purpose : Crawl through all the book names on this page , Price ,url, Inventory , Comments and cover pictures . This paper is based on Website For example
Check robotstxt_obey
Create good scrapy project After , Come first settings.py find ROBOTSTXT_OBEY, And set it to False.
( This action means not complying with the website's robots.txt, Please apply after obtaining the approval of the website . Note : This website is a sample practice website .)
Look at the location of the elements
Back to the example website , Press F12 Open developer tools .

Start with 2 A little exercise to familiarize yourself with xpath ~
First , The title of the book is h3 Inside a tag Inside , Location xpath as follows :
// parse book titles
response.xpath('//h3/a/@title').extract()
// extract Can parse out all title The name of
// If you use extract_first() Will resolve the first title The name of
Then check the price location ,xpath as follows :
// parse book price
response.xpath('//p[@class="price_color"]/text()').extract()
lookup url Is quite important , Because we have to find all the books first url, Further in request all url, And get the information we want , Its xpath as follows :
response.xpath('//h3/a/@href').extract_first()
// Output results : 'catalogue/a-light-in-the-attic_1000/index.html'
Request The first book
Then observe url It can be found that , What has just been resolved is the suffix of the book website , That means we have to add the prefix , Is a complete url. So here , Let's start writing the first function.
def parse(self, response):
// Find all the books url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// Combine URL prefix with suffix
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
pass
Parse Data
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
View the analysis results
Here you can use yield To view the analysis results :
// inside parse_book function
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
Complete a simple crawler
def parse(self, response):
// Find all the books url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// Combine URL prefix with suffix
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
Execution crawler
scrapy crawl <your_spider_name>
版权声明
本文为[Winding Liao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204232021221091.html
边栏推荐
- BMP JPEG 图片转换为矢量图像 ContourTrace
- The second method of file upload in form form is implemented by fileitem class, servletfileupload class and diskfileitemfactory class.
- A useless confession artifact
- Mathematical modeling column | Part 5: MATLAB optimization model solving method (Part I): Standard Model
- Operation of numpy array
- Analysis of the relationship between generalized Bim and CAD under the current background
- How does onlyoffice solve no route to host
- SQL Server Connectors By Thread Pool | DTSQLServerTP 插件使用说明
- 【PTA】L2-011 玩转二叉树
- 微信中金财富高端专区安全吗,证券如何开户呢
猜你喜欢

Livego + ffmpeg + RTMP + flvjs to realize live video

RT-1052学习笔记 - GPIO架构分析

. Ren -- the intimate artifact in the field of vertical Recruitment!

CVPR 2022 | querydet: use cascaded sparse query to accelerate small target detection under high resolution

Some basic configurations in interlij idea

堡垒机、跳板机JumpServer的搭建,以及使用,图文详细

【栈和队列专题】—— 滑动窗口

DNS cloud school | quickly locate DNS resolution exceptions and keep these four DNS status codes in mind

16MySQL之DCL 中 COMMIT和ROllBACK

Servlet learning notes
随机推荐
Experience of mathematical modeling in 18 year research competition
Error reported by Azkaban: Azkaban jobExecutor. utils. process. ProcessFailureException: Process exited with code 127
Tencent Qiu Dongyang: techniques and ways of accelerating deep model reasoning
Intersection calculation of straight line and plane in PCL point cloud processing (53)
How do BIM swindlers cheat? (turn)
Latest investigation and progress of building intelligence based on sati
堡垒机、跳板机JumpServer的搭建,以及使用,图文详细
Remote code execution in Win 11 using wpad / PAC and JScript
【目标跟踪】基于帧差法结合卡尔曼滤波实现行人姿态识别附matlab代码
How about CICC fortune? Is it safe to open an account
Alicloud: could not connect to SMTP host: SMTP 163.com, port: 25
NC basic usage 2
ABAQUS script email auto notification
PCL点云处理之基于PCA的几何形状特征计算(五十二)
The R language uses the timeroc package to calculate the multi time AUC value of survival data without competitive risk, and uses the confint function to calculate the confidence interval value of mul
R语言ggplot2可视化分面图(facet_wrap)、使用lineheight参数自定义设置分面图标签栏(灰色标签栏)的高度
Customize timeline component styles
How to protect ECs from hacker attacks?
Commit and rollback in DCL of 16 MySQL
. Ren -- the intimate artifact in the field of vertical Recruitment!