当前位置:网站首页>Scripy tutorial - (2) write a simple crawler
Scripy tutorial - (2) write a simple crawler
2022-04-23 20:22:00 【Winding Liao】
Scrapy course - (2) Write a simple reptile
Purpose : Crawl through all the book names on this page , Price ,url, Inventory , Comments and cover pictures . This paper is based on Website For example
Check robotstxt_obey
Create good scrapy project After , Come first settings.py find ROBOTSTXT_OBEY, And set it to False.
( This action means not complying with the website's robots.txt, Please apply after obtaining the approval of the website . Note : This website is a sample practice website .)
Look at the location of the elements
Back to the example website , Press F12 Open developer tools .
Start with 2 A little exercise to familiarize yourself with xpath ~
First , The title of the book is h3 Inside a tag Inside , Location xpath as follows :
// parse book titles
response.xpath('//h3/a/@title').extract()
// extract Can parse out all title The name of
// If you use extract_first() Will resolve the first title The name of
Then check the price location ,xpath as follows :
// parse book price
response.xpath('//p[@class="price_color"]/text()').extract()
lookup url Is quite important , Because we have to find all the books first url, Further in request all url, And get the information we want , Its xpath as follows :
response.xpath('//h3/a/@href').extract_first()
// Output results : 'catalogue/a-light-in-the-attic_1000/index.html'
Request The first book
Then observe url It can be found that , What has just been resolved is the suffix of the book website , That means we have to add the prefix , Is a complete url. So here , Let's start writing the first function.
def parse(self, response):
// Find all the books url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// Combine URL prefix with suffix
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
pass
Parse Data
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
View the analysis results
Here you can use yield To view the analysis results :
// inside parse_book function
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
Complete a simple crawler
def parse(self, response):
// Find all the books url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// Combine URL prefix with suffix
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
Execution crawler
scrapy crawl <your_spider_name>
版权声明
本文为[Winding Liao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204232021221091.html
边栏推荐
- Error reported by Azkaban: Azkaban jobExecutor. utils. process. ProcessFailureException: Process exited with code 64
- JDBC database addition, deletion, query and modification tool class
- Common form verification
- DNS cloud school | analysis of hidden tunnel attacks in the hidden corner of DNS
- Use the rolling division method to find the maximum common divisor of two numbers
- DTMF dual tone multi frequency signal simulation demonstration system
- 考研英语唐叔的语法课笔记
- A useless confession artifact
- R语言ggplot2可视化:ggplot2可视化散点图并使用geom_mark_ellipse函数在数据簇或数据分组的数据点周围添加椭圆进行注释
- Es error: request contains unrecognized parameter [ignore_throttled]
猜你喜欢
CVPR 2022 | QueryDet:使用级联稀疏query加速高分辨率下的小目标检测
WordPress插件:WP-China-Yes解决国内访问官网慢的方法
Tensorflow 2 basic operation dictionary
Handwritten Google's first generation distributed computing framework MapReduce
selenium.common.exceptions.WebDriverException: Message: ‘chromedriver‘ executable needs to be in PAT
Notes of Tang Shu's grammar class in postgraduate entrance examination English
[graph theory brush question-5] Li Kou 1971 Find out if there is a path in the graph
Mathematical modeling column | Part 5: MATLAB optimization model solving method (Part I): Standard Model
Mysql database backup scheme
. Ren -- the intimate artifact in the field of vertical Recruitment!
随机推荐
Numpy mathematical function & logical function
The R language uses the timeroc package to calculate the multi time AUC value of survival data without competitive risk, and uses the confint function to calculate the confidence interval value of mul
Cadence Orcad Capture 批量更改元件封装功能介绍图文教程及视频演示
BMP JPEG 图片转换为矢量图像 ContourTrace
论文写作 19: 会议论文与期刊论文的区别
NC basic usage 3
Error reported by Azkaban: Azkaban jobExecutor. utils. process. ProcessFailureException: Process exited with code 64
Redis installation (centos7 command line installation)
The market share of the financial industry exceeds 50%, and zdns has built a solid foundation for the financial technology network
网络通信基础(局域网、广域网、IP地址、端口号、协议、封装、分用)
Redis的安装(CentOS7命令行安装)
Operation of numpy array
Sqoop imports data from Mysql to HDFS using lzop compression format and reports NullPointerException
Actual measurement of automatic ticket grabbing script of barley network based on selenium (the first part of the new year)
AQS learning
WordPress插件:WP-China-Yes解决国内访问官网慢的方法
【问题解决】‘ascii‘ codec can‘t encode characters in position xx-xx: ordinal not in range(128)
Recommend an open source free drawing software draw IO exportable vector graph
SQL Server Connectors By Thread Pool | DTSQLServerTP plugin instructions
JDBC tool class jdbcconutil gets the connection to the database