当前位置:网站首页>Scripy tutorial - (2) write a simple crawler
Scripy tutorial - (2) write a simple crawler
2022-04-23 20:22:00 【Winding Liao】
Scrapy course - (2) Write a simple reptile
Purpose : Crawl through all the book names on this page , Price ,url, Inventory , Comments and cover pictures . This paper is based on Website For example
Check robotstxt_obey
Create good scrapy project After , Come first settings.py find ROBOTSTXT_OBEY, And set it to False.
( This action means not complying with the website's robots.txt, Please apply after obtaining the approval of the website . Note : This website is a sample practice website .)
Look at the location of the elements
Back to the example website , Press F12 Open developer tools .
Start with 2 A little exercise to familiarize yourself with xpath ~
First , The title of the book is h3 Inside a tag Inside , Location xpath as follows :
// parse book titles
response.xpath('//h3/a/@title').extract()
// extract Can parse out all title The name of
// If you use extract_first() Will resolve the first title The name of
Then check the price location ,xpath as follows :
// parse book price
response.xpath('//p[@class="price_color"]/text()').extract()
lookup url Is quite important , Because we have to find all the books first url, Further in request all url, And get the information we want , Its xpath as follows :
response.xpath('//h3/a/@href').extract_first()
// Output results : 'catalogue/a-light-in-the-attic_1000/index.html'
Request The first book
Then observe url It can be found that , What has just been resolved is the suffix of the book website , That means we have to add the prefix , Is a complete url. So here , Let's start writing the first function.
def parse(self, response):
// Find all the books url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// Combine URL prefix with suffix
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
pass
Parse Data
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
View the analysis results
Here you can use yield To view the analysis results :
// inside parse_book function
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
Complete a simple crawler
def parse(self, response):
// Find all the books url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// Combine URL prefix with suffix
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
Execution crawler
scrapy crawl <your_spider_name>
版权声明
本文为[Winding Liao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204232021221091.html
边栏推荐
- R语言使用econocharts包创建微观经济或宏观经济图、indifference函数可视化无差异曲线、自定义计算交叉点、自定义配置indifference函数的参数丰富可视化效果
- Remote code execution in Win 11 using wpad / PAC and JScript 3
- Intersection calculation of straight line and plane in PCL point cloud processing (53)
- 【目标跟踪】基于帧差法结合卡尔曼滤波实现行人姿态识别附matlab代码
- ArcGIS js api 4. X submergence analysis and water submergence analysis
- Remote code execution in Win 11 using wpad / PAC and JScript 1
- Matlab analytic hierarchy process to quickly calculate the weight
- Cadence OrCAD capture batch change component packaging function introduction graphic tutorial and video demonstration
- Mathematical modeling column | Part 5: MATLAB optimization model solving method (Part I): Standard Model
- RT-1052学习笔记 - GPIO架构分析
猜你喜欢
Five minutes to show you what JWT is
[target tracking] pedestrian attitude recognition based on frame difference method combined with Kalman filter, with matlab code
CVPR 2022 | QueryDet:使用级联稀疏query加速高分辨率下的小目标检测
Linux64Bit下安装MySQL5.6-不能修改root密码
What is the difference between a host and a server?
[graph theory brush question-4] force deduction 778 Swimming in a rising pool
Customize timeline component styles
go-zero框架数据库方面避坑指南
波场DAO新物种下场,USDD如何破局稳定币市场?
16MySQL之DCL 中 COMMIT和ROllBACK
随机推荐
The ODB model calculates the data and outputs it to excel
Modeling based on catiav6
R language uses the preprocess function of caret package for data preprocessing: BoxCox transform all data columns (convert non normal distribution data columns to normal distribution data and can not
STM32 Basics
DTMF dual tone multi frequency signal simulation demonstration system
Zdns was invited to attend the annual conference of Tencent cloud basic resources and share the 2020 domain name industry development report
Some basic configurations in interlij idea
How do BIM swindlers cheat? (turn)
Click an EL checkbox to select all questions
Five minutes to show you what JWT is
论文写作 19: 会议论文与期刊论文的区别
Solution to PowerDesigner's failure to connect to MySQL in x64 system
Notes of Tang Shu's grammar class in postgraduate entrance examination English
nc基础用法1
Livego + ffmpeg + RTMP + flvjs to realize live video
Servlet learning notes
Mysql database and table building: the difference between utf8 and utf8mb4
JDBC tool class jdbcconutil gets the connection to the database
RT-1052学习笔记 - GPIO架构分析
Automatically fill in body temperature and win10 task plan