当前位置:网站首页>Scripy tutorial - (2) write a simple crawler

Scripy tutorial - (2) write a simple crawler

2022-04-23 20:22:00 Winding Liao

Scrapy course - (2) Write a simple reptile

Purpose : Crawl through all the book names on this page , Price ,url, Inventory , Comments and cover pictures . This paper is based on Website For example

Check robotstxt_obey

Create good scrapy project After , Come first settings.py find ROBOTSTXT_OBEY, And set it to False.
( This action means not complying with the website's robots.txt, Please apply after obtaining the approval of the website . Note : This website is a sample practice website .)

Look at the location of the elements

Back to the example website , Press F12 Open developer tools .
 Insert picture description here
Start with 2 A little exercise to familiarize yourself with xpath ~
First , The title of the book is h3 Inside a tag Inside , Location xpath as follows :

// parse book titles
response.xpath('//h3/a/@title').extract()

// extract Can parse out all title The name of 
//  If you use extract_first() Will resolve the first title The name of 

Then check the price location ,xpath as follows :

// parse book price
response.xpath('//p[@class="price_color"]/text()').extract()

lookup url Is quite important , Because we have to find all the books first url, Further in request all url, And get the information we want , Its xpath as follows :

response.xpath('//h3/a/@href').extract_first()

//  Output results : 'catalogue/a-light-in-the-attic_1000/index.html'

Request The first book

Then observe url It can be found that , What has just been resolved is the suffix of the book website , That means we have to add the prefix , Is a complete url. So here , Let's start writing the first function.

def parse(self, response):
	//  Find all the books url
	books = response.xpath('//h3/a/@href').extract()
    for book in books:
    	//  Combine URL prefix with suffix 
    	url = response.urljoin(book)
        yield response.follow(url = url,
                              callback = self.parse_book)

def parse_book(self, response):
	pass

Parse Data

def parse_book(self, response):
	title = response.xpath('//h1/text()').extract_first()
    price = response.xpath('//*[@class="price_color"]/text()').extract_first()

    image_url = response.xpath('//img/@src').extract_first()
    image_url = image_url.replace('../../', 'http://books.toscrape.com/') 

    rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
    rating = rating.replace('star-rating', '')

    description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()

View the analysis results

Here you can use yield To view the analysis results :

// inside parse_book function
yield {
    'title': title,
       'price': price,
       'image_url': image_url,
       'rating': rating,
       'description': description}

Complete a simple crawler

def parse(self, response):
	//  Find all the books url
	books = response.xpath('//h3/a/@href').extract()
    for book in books:
    	//  Combine URL prefix with suffix 
    	url = response.urljoin(book)
        yield response.follow(url = url,
                              callback = self.parse_book)
                              
def parse_book(self, response):
	title = response.xpath('//h1/text()').extract_first()
    price = response.xpath('//*[@class="price_color"]/text()').extract_first()

    image_url = response.xpath('//img/@src').extract_first()
    image_url = image_url.replace('../../', 'http://books.toscrape.com/') 

    rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
    rating = rating.replace('star-rating', '')

    description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
	
	yield {
    'title': title,
       	   'price': price,
           'image_url': image_url,
           'rating': rating,
           'description': description}

Execution crawler

scrapy crawl <your_spider_name>

版权声明
本文为[Winding Liao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204232021221091.html