当前位置:网站首页>Scrapy教程 - (2)寫一個簡單爬蟲
Scrapy教程 - (2)寫一個簡單爬蟲
2022-04-23 20:21:00 【彎彎廖】
Scrapy教程 - (2)寫一個簡單爬蟲
目的:爬取此網頁的所有書籍名稱,價格,url,庫存,評價及封面圖片。本文以此網站為例
檢查robotstxt_obey
創建好scrapy project後,先到settings.py找到ROBOTSTXT_OBEY,並把它設成False。
(此舉動意義為不遵守該網站的robots.txt,請在徵得該網同意後再施行。備註:此網站為範例練習網站。)
查看元素位置
回到範例網站,按F12打開開發者工具。
先以2個小練習來熟悉一下xpath ~
首先,書籍名稱在h3裡的a tag裡面,位置xpath如下:
// parse book titles
response.xpath('//h3/a/@title').extract()
// extract可以解析出所有title的名稱
// 若是使用extract_first()則會解析出第一個title的名稱
接著查看價格所在位置,xpath如下:
// parse book price
response.xpath('//p[@class="price_color"]/text()').extract()
查找url是相當重要的,因為我們必須先找到所有書籍的url,進一步在request所有url,並獲得我們想要取得的資料,其 xpath如下:
response.xpath('//h3/a/@href').extract_first()
// 輸出結果: 'catalogue/a-light-in-the-attic_1000/index.html'
Request第一本書籍
接著觀察url可以發現,剛剛所解析出的是該書籍網址的後綴,也就是說我們必須把前綴加上去,才是一個完整的url。因此到這裡,我們開始寫第一個function。
def parse(self, response):
// 找所有書籍的url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// 將網址前綴與後綴結合
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
pass
Parse Data
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
查看解析成果
這裡可以用yield來查看解析成果:
// inside parse_book function
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
完成一個簡單爬蟲
def parse(self, response):
// 找所有書籍的url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// 將網址前綴與後綴結合
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
執行爬蟲
scrapy crawl <your_spider_name>
版权声明
本文为[彎彎廖]所创,转载请带上原文链接,感谢
https://whitneyliao.blog.csdn.net/article/details/123717221
边栏推荐
- SRS deployment
- [graph theory brush question-4] force deduction 778 Swimming in a rising pool
- Intersection calculation of straight line and plane in PCL point cloud processing (53)
- 【PTA】L2-011 玩转二叉树
- What is the difference between a host and a server?
- nc基础用法4
- Latest investigation and progress of building intelligence based on sati
- NC basic usage 3
- WordPress插件:WP-China-Yes解决国内访问官网慢的方法
- Error reported by Azkaban: Azkaban jobExecutor. utils. process. ProcessFailureException: Process exited with code 127
猜你喜欢
Browser - learning notes
[graph theory brush question-4] force deduction 778 Swimming in a rising pool
Numpy Index & slice & iteration
Recommend an open source free drawing software draw IO exportable vector graph
PIP installation package reports an error. Could not find a version that satisfies the requirement pymysql (from versions: none)
Fundamentals of network communication (LAN, Wan, IP address, port number, protocol, encapsulation and distribution)
aqs的学习
DNS cloud school | quickly locate DNS resolution exceptions and keep these four DNS status codes in mind
[graph theory brush question-5] Li Kou 1971 Find out if there is a path in the graph
Wave field Dao new species end up, how does usdd break the situation and stabilize the currency market?
随机推荐
Confusion about thread blocking after calling the read () method of wrapper flow
JDBC database addition, deletion, query and modification tool class
NC basic usage 1
使用 WPAD/PAC 和 JScript在win11中进行远程代码执行1
Markdown < a > tag new page open link
After route link navigation, the sub page does not display the navigation style problem
nc基础用法4
. Ren -- the intimate artifact in the field of vertical Recruitment!
PCL点云处理之直线与平面的交点计算(五十三)
堡垒机、跳板机JumpServer的搭建,以及使用,图文详细
2022 - Data Warehouse - [time dimension table] - year, week and holiday
Mathematical modeling column | Part 5: MATLAB optimization model solving method (Part I): Standard Model
PCL点云处理之基于PCA的几何形状特征计算(五十二)
Modeling based on catiav6
Cadence OrCAD capture batch change component packaging function introduction graphic tutorial and video demonstration
R language ggplot2 visualization: ggplot2 visualizes the scatter diagram and uses geom_ mark_ The ellipse function adds ellipses around data points of data clusters or data groups for annotation
SRS deployment
Tensorflow 2 basic operation dictionary
How does onlyoffice solve no route to host
RT-1052学习笔记 - GPIO架构分析