当前位置:网站首页>Scrapy教程 - (2)寫一個簡單爬蟲
Scrapy教程 - (2)寫一個簡單爬蟲
2022-04-23 20:21:00 【彎彎廖】
Scrapy教程 - (2)寫一個簡單爬蟲
目的:爬取此網頁的所有書籍名稱,價格,url,庫存,評價及封面圖片。本文以此網站為例
檢查robotstxt_obey
創建好scrapy project後,先到settings.py找到ROBOTSTXT_OBEY,並把它設成False。
(此舉動意義為不遵守該網站的robots.txt,請在徵得該網同意後再施行。備註:此網站為範例練習網站。)
查看元素位置
回到範例網站,按F12打開開發者工具。
先以2個小練習來熟悉一下xpath ~
首先,書籍名稱在h3裡的a tag裡面,位置xpath如下:
// parse book titles
response.xpath('//h3/a/@title').extract()
// extract可以解析出所有title的名稱
// 若是使用extract_first()則會解析出第一個title的名稱
接著查看價格所在位置,xpath如下:
// parse book price
response.xpath('//p[@class="price_color"]/text()').extract()
查找url是相當重要的,因為我們必須先找到所有書籍的url,進一步在request所有url,並獲得我們想要取得的資料,其 xpath如下:
response.xpath('//h3/a/@href').extract_first()
// 輸出結果: 'catalogue/a-light-in-the-attic_1000/index.html'
Request第一本書籍
接著觀察url可以發現,剛剛所解析出的是該書籍網址的後綴,也就是說我們必須把前綴加上去,才是一個完整的url。因此到這裡,我們開始寫第一個function。
def parse(self, response):
// 找所有書籍的url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// 將網址前綴與後綴結合
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
pass
Parse Data
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
查看解析成果
這裡可以用yield來查看解析成果:
// inside parse_book function
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
完成一個簡單爬蟲
def parse(self, response):
// 找所有書籍的url
books = response.xpath('//h3/a/@href').extract()
for book in books:
// 將網址前綴與後綴結合
url = response.urljoin(book)
yield response.follow(url = url,
callback = self.parse_book)
def parse_book(self, response):
title = response.xpath('//h1/text()').extract_first()
price = response.xpath('//*[@class="price_color"]/text()').extract_first()
image_url = response.xpath('//img/@src').extract_first()
image_url = image_url.replace('../../', 'http://books.toscrape.com/')
rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
rating = rating.replace('star-rating', '')
description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
yield {
'title': title,
'price': price,
'image_url': image_url,
'rating': rating,
'description': description}
執行爬蟲
scrapy crawl <your_spider_name>
版权声明
本文为[彎彎廖]所创,转载请带上原文链接,感谢
https://whitneyliao.blog.csdn.net/article/details/123717221
边栏推荐
- WordPress plug-in: WP CHINA Yes solution to slow domestic access to the official website
- Five minutes to show you what JWT is
- SRS deployment
- Es error: request contains unrecognized parameter [ignore_throttled]
- JDBC tool class jdbcconutil gets the connection to the database
- Alicloud: could not connect to SMTP host: SMTP 163.com, port: 25
- nc基础用法1
- 16MySQL之DCL 中 COMMIT和ROllBACK
- R语言ggplot2可视化分面图(facet_wrap)、使用lineheight参数自定义设置分面图标签栏(灰色标签栏)的高度
- An error is reported in the initialization metadata of the dolphin scheduler -- it turns out that there is a special symbol in the password. "$“
猜你喜欢
selenium. common. exceptions. WebDriverException: Message: ‘chromedriver‘ executable needs to be in PAT
Handwritten Google's first generation distributed computing framework MapReduce
Es error: request contains unrecognized parameter [ignore_throttled]
Tensorflow 2 basic operation dictionary
Monte Carlo py solves the area problem! (save pupils Series)
Mathematical modeling column | Part 5: MATLAB optimization model solving method (Part I): Standard Model
Commit and rollback in DCL of 16 MySQL
JDBC tool class jdbcfiledateutil uploads files and date format conversion, including the latest, simplest and easiest way to upload single files and multiple files
DTMF dual tone multi frequency signal simulation demonstration system
波场DAO新物种下场,USDD如何破局稳定币市场?
随机推荐
Cadence Orcad Capture CIS更换元器件之Link Database 功能介绍图文教程及视频演示
还在用 ListView?使用 AnimatedList 让列表元素动起来
selenium. common. exceptions. WebDriverException: Message: ‘chromedriver‘ executable needs to be in PAT
记录:调用mapper报空指针;<foreach>不去重的用法;
中金财富公司怎么样,开户安全吗
Sqoop imports tinyint type fields to boolean type
DNS cloud school | quickly locate DNS resolution exceptions and keep these four DNS status codes in mind
Common form verification
如何做产品创新?——产品创新方法论探索一
Implementation of mypromise
Tencent Qiu Dongyang: techniques and ways of accelerating deep model reasoning
Fundamentals of network communication (LAN, Wan, IP address, port number, protocol, encapsulation and distribution)
go-zero框架数据库方面避坑指南
R language ggplot2 visual facet_wrap, and use the lineheight parameter to customize the height of the facet icon tab (gray label bar)
Modeling based on catiav6
NC basic usage 3
JDBC tool class jdbcconutil gets the connection to the database
Introduction to link database function of cadence OrCAD capture CIS replacement components, graphic tutorial and video demonstration
【目标跟踪】基于帧差法结合卡尔曼滤波实现行人姿态识别附matlab代码
star