当前位置:网站首页>Scrapy教程 - (2)寫一個簡單爬蟲
Scrapy教程 - (2)寫一個簡單爬蟲
2022-04-23 20:21:00 【彎彎廖】
Scrapy教程 - (2)寫一個簡單爬蟲
目的:爬取此網頁的所有書籍名稱,價格,url,庫存,評價及封面圖片。本文以此網站為例
檢查robotstxt_obey
創建好scrapy project後,先到settings.py找到ROBOTSTXT_OBEY,並把它設成False。
(此舉動意義為不遵守該網站的robots.txt,請在徵得該網同意後再施行。備註:此網站為範例練習網站。)
查看元素位置
回到範例網站,按F12打開開發者工具。

先以2個小練習來熟悉一下xpath ~
首先,書籍名稱在h3裡的a tag裡面,位置xpath如下:
// parse book titles
response.xpath('//h3/a/@title').extract()
// extract可以解析出所有title的名稱
// 若是使用extract_first()則會解析出第一個title的名稱
接著查看價格所在位置,xpath如下:
// parse book price
response.xpath('//p[@class="price_color"]/text()').extract()
查找url是相當重要的,因為我們必須先找到所有書籍的url,進一步在request所有url,並獲得我們想要取得的資料,其 xpath如下:
response.xpath('//h3/a/@href').extract_first()
// 輸出結果: 'catalogue/a-light-in-the-attic_1000/index.html'
Request第一本書籍
接著觀察url可以發現,剛剛所解析出的是該書籍網址的後綴,也就是說我們必須把前綴加上去,才是一個完整的url。因此到這裡,我們開始寫第一個function。
def parse(self, response):
// 找所有書籍的url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// 將網址前綴與後綴結合
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
pass
Parse Data
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
查看解析成果
這裡可以用yield來查看解析成果:
// inside parse_book function
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
完成一個簡單爬蟲
def parse(self, response):
// 找所有書籍的url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// 將網址前綴與後綴結合
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
執行爬蟲
scrapy crawl <your_spider_name>
版权声明
本文为[彎彎廖]所创,转载请带上原文链接,感谢
https://whitneyliao.blog.csdn.net/article/details/123717221
边栏推荐
- STM32 Basics
- Redis cache penetration, cache breakdown, cache avalanche
- How to protect ECs from hacker attacks?
- PostgreSQL basic functions
- DNS cloud school | quickly locate DNS resolution exceptions and keep these four DNS status codes in mind
- ArcGIS JS version military landmark drawing (dovetail arrow, pincer arrow, assembly area) fan and other custom graphics
- Browser - learning notes
- ABAQUS script email auto notification
- Servlet learning notes
- [graph theory brush question-5] Li Kou 1971 Find out if there is a path in the graph
猜你喜欢

Error reported by Azkaban: Azkaban jobExecutor. utils. process. ProcessFailureException: Process exited with code 127

Mysql database backup scheme
![[text classification cases] (4) RNN and LSTM film evaluation Tendency Classification, with tensorflow complete code attached](/img/19/27631caff199fbf13f802decbd6ead.gif)
[text classification cases] (4) RNN and LSTM film evaluation Tendency Classification, with tensorflow complete code attached

SQL Server Connectors By Thread Pool | DTSQLServerTP plugin instructions

Browser - learning notes

JDBC tool class jdbcfiledateutil uploads files and date format conversion, including the latest, simplest and easiest way to upload single files and multiple files

DNS cloud school | analysis of hidden tunnel attacks in the hidden corner of DNS

CVPR 2022 | querydet: use cascaded sparse query to accelerate small target detection under high resolution

Modeling based on catiav6

SIGIR'22「微软」CTR估计:利用上下文信息促进特征表征学习
随机推荐
Operation of numpy array
An error is reported when sqoop imports data from Mysql to HDFS: sqlexception in nextkeyvalue
Customize timeline component styles
Mysql database backup scheme
Alicloud: could not connect to SMTP host: SMTP 163.com, port: 25
DTMF dual tone multi frequency signal simulation demonstration system
Historical track data reading of Holux m1200-e Bluetooth GPS track recorder
[graph theory brush question-4] force deduction 778 Swimming in a rising pool
PCL点云处理之基于PCA的几何形状特征计算(五十二)
. Ren -- the intimate artifact in the field of vertical Recruitment!
R language uses the preprocess function of caret package for data preprocessing: BoxCox transform all data columns (convert non normal distribution data columns to normal distribution data and can not
LeetCode动态规划训练营(1~5天)
【PTA】整除光棍
Building the tide, building the foundation and winning the future -- the successful holding of zdns Partner Conference
[target tracking] pedestrian attitude recognition based on frame difference method combined with Kalman filter, with matlab code
nc基础用法3
DNS cloud school rising posture! Three advanced uses of authoritative DNS
Investigate why close is required after sqlsession is used in mybatties
Numpy mathematical function & logical function
R语言survival包coxph函数构建cox回归模型、ggrisk包ggrisk函数和two_scatter函数可视化Cox回归的风险评分图、解读风险评分图、基于LIRI数据集(基因数据集)