当前位置:网站首页>Scrapy教程 - (2)寫一個簡單爬蟲
Scrapy教程 - (2)寫一個簡單爬蟲
2022-04-23 20:21:00 【彎彎廖】
Scrapy教程 - (2)寫一個簡單爬蟲
目的:爬取此網頁的所有書籍名稱,價格,url,庫存,評價及封面圖片。本文以此網站為例
檢查robotstxt_obey
創建好scrapy project後,先到settings.py找到ROBOTSTXT_OBEY,並把它設成False。
(此舉動意義為不遵守該網站的robots.txt,請在徵得該網同意後再施行。備註:此網站為範例練習網站。)
查看元素位置
回到範例網站,按F12打開開發者工具。

先以2個小練習來熟悉一下xpath ~
首先,書籍名稱在h3裡的a tag裡面,位置xpath如下:
// parse book titles
response.xpath('//h3/a/@title').extract()
// extract可以解析出所有title的名稱
// 若是使用extract_first()則會解析出第一個title的名稱
接著查看價格所在位置,xpath如下:
// parse book price
response.xpath('//p[@class="price_color"]/text()').extract()
查找url是相當重要的,因為我們必須先找到所有書籍的url,進一步在request所有url,並獲得我們想要取得的資料,其 xpath如下:
response.xpath('//h3/a/@href').extract_first()
// 輸出結果: 'catalogue/a-light-in-the-attic_1000/index.html'
Request第一本書籍
接著觀察url可以發現,剛剛所解析出的是該書籍網址的後綴,也就是說我們必須把前綴加上去,才是一個完整的url。因此到這裡,我們開始寫第一個function。
def parse(self, response):
// 找所有書籍的url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// 將網址前綴與後綴結合
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
pass
Parse Data
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
查看解析成果
這裡可以用yield來查看解析成果:
// inside parse_book function
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
完成一個簡單爬蟲
def parse(self, response):
// 找所有書籍的url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// 將網址前綴與後綴結合
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
執行爬蟲
scrapy crawl <your_spider_name>
版权声明
本文为[彎彎廖]所创,转载请带上原文链接,感谢
https://whitneyliao.blog.csdn.net/article/details/123717221
边栏推荐
- 【PTA】L1-002 打印沙漏
- Wave field Dao new species end up, how does usdd break the situation and stabilize the currency market?
- Leetcode dynamic planning training camp (1-5 days)
- Solution to PowerDesigner's failure to connect to MySQL in x64 system
- WordPress plug-in: WP CHINA Yes solution to slow domestic access to the official website
- Remote code execution in Win 11 using wpad / PAC and JScript
- 2022 - Data Warehouse - [time dimension table] - year, week and holiday
- NC basic usage 2
- Computing the intersection of two planes in PCL point cloud processing (51)
- Redis installation (centos7 command line installation)
猜你喜欢

An error is reported when sqoop imports data from Mysql to HDFS: sqlexception in nextkeyvalue

堡垒机、跳板机JumpServer的搭建,以及使用,图文详细

Browser - learning notes

Customize timeline component styles

SQL Server connectors by thread pool 𞓜 instructions for dtsqlservertp plug-in

aqs的学习

PIP installation package reports an error. Could not find a version that satisfies the requirement pymysql (from versions: none)

. Ren -- the intimate artifact in the field of vertical Recruitment!

16MySQL之DCL 中 COMMIT和ROllBACK

Numpy Index & slice & iteration
随机推荐
Implementation of mypromise
Introduction to link database function of cadence OrCAD capture CIS replacement components, graphic tutorial and video demonstration
An error is reported when sqoop imports data from Mysql to HDFS: sqlexception in nextkeyvalue
selenium.common.exceptions.WebDriverException: Message: ‘chromedriver‘ executable needs to be in PAT
Customize timeline component styles
[text classification cases] (4) RNN and LSTM film evaluation Tendency Classification, with tensorflow complete code attached
本地调用feign接口报404
nc基础用法
redis 分布式锁
Leetcode dynamic planning training camp (1-5 days)
Mathematical modeling column | Part 5: MATLAB optimization model solving method (Part I): Standard Model
Handwritten Google's first generation distributed computing framework MapReduce
【问题解决】‘ascii‘ codec can‘t encode characters in position xx-xx: ordinal not in range(128)
Numpy mathematical function & logical function
Openharmony open source developer growth plan, looking for new open source forces that change the world!
SQL Server connectors by thread pool 𞓜 instructions for dtsqlservertp plug-in
JDBC tool class jdbcfiledateutil uploads files and date format conversion, including the latest, simplest and easiest way to upload single files and multiple files
Redis的安装(CentOS7命令行安装)
Markdown < a > tag new page open link
Redis cache penetration, cache breakdown, cache avalanche