当前位置：网站首页>Yyds dry goods inventory a common error in a web crawler

Yyds dry goods inventory a common error in a web crawler

2022-04-21 11:49:00 【Python advanced】

Hello everyone , I'm Pippi .

One 、 Preface

A few days ago Python There is a member of the silver exchange group called 【 Rain is rain 】 My fans asked one Python The problem with web crawlers , Take it out here and share it with you , Let's study together .

# yyds Dry inventory # Inventory a common error in a web crawler _Python introduction

Questions as follows ：

# yyds Dry inventory # Inventory a common error in a web crawler _Python Basics _02 .

Two 、 To solve the process

It's easy to doubt that the structure of the original web page has changed , Use xpath If the selector extracts , There will be mismatches , List index is out of range , Cause error .

【Python Initiate 】 Gives an idea , It can be used try Exception handling to avoid , But I still can't get the data , It's really a little big .

Then in the afternoon 【Python Initiate 】 He found the reason when running his code , As shown in the figure below .

# yyds Dry inventory # Inventory a common error in a web crawler _Python Basics _03 His url here , There is a problem with the structure , One more. /, Cause web page access error .

# yyds Dry inventory # Inventory a common error in a web crawler _Python introduction _04

Modify the , You can run , in addition , There are also many requests in the page details , Remember a little sleep Next , That's all right. . Here is the detailed code , Interested friends , You can run down .

      
      
       
       import requests
       
       
from lxml import etree
       
       
from fake_useragent import UserAgent
       
       
import time
       
       

       
       

       
       
class kitchen(object):
       
       
    u = 0
       
       

       
       
    def __init__(self):
       
       
        self.url = "https://www.xiachufang.com/category/40076/"
       
       
        ua = UserAgent(verify_ssl=False)
       
       
        for i in range(1, 50):
       
       
            self.headers = {
       
       
                'User-Agent': ua.random,
       
       

       
       
            }
       
       

       
       
    ''' Send a request    Get a response '''
       
       

       
       
    def get_page(self, url):
       
       
        res = requests.get(url=url, headers=self.headers)
       
       
        html = res.content.decode("utf-8")
       
       
        time.sleep(2)
       
       
        return html
       
       

       
       
    def parse_page(self, html):
       
       
        parse_html = etree.HTML(html)
       
       
        image_src_list = parse_html.xpath('//li/div/a/@href')
       
       
        for i in image_src_list:
       
       
            try:
       
       
                url = "https://www.xiachufang.com" + i
       
       
                # print(url)
       
       
                html1 = self.get_page(url)  #  The second request occurred 
       
       
                parse_html1 = etree.HTML(html1)
       
       
                # print(parse_html1)
       
       
                num = parse_html1.xpath('.//h2[@id="steps"]/text()')[0].strip()
       
       

       
       
                name = parse_html1.xpath('.//li[@class="container"]/p/text()')
       
       
                ingredients = parse_html1.xpath('.//td//a/text()')
       
       
                self.u += 1
       
       
                # print(self.u)
       
       
                # print(str(self.u)+"."+house_dict[" name   call  :"]+":")
       
       
                # da=tuple(house_dict[" material   material :"])
       
       
                food_info = '''  
       
       
     The first  %s  Kind of 
       
       
    
       
       
     food   name  : %s
       
       
     primary   material  : %s
       
       
     Next   load   chain   Pick up  : %s,
       
       
    =================================================================
       
       
                        ''' % (str(self.u), num, ingredients, url)
       
       
                # print(food_info)
       
       

       
       
                f = open(' The kitchen menu .txt', 'a', encoding='utf-8')
       
       
                f.write(str(food_info))
       
       
                print(str(food_info))
       
       
                f.close()
       
       
            except:
       
       
                print('xpath Didn't get the content ！')
       
       

       
       
    def main(self):
       
       
        startPage = int(input(" The start page :"))
       
       
        endPage = int(input(" End page :"))
       
       
        for page in range(startPage, endPage + 1):
       
       
            url = self.url.format(page)
       
       
            html = self.get_page(url)
       
       
            self.parse_page(html)
       
       
            time.sleep(2.4)
       
       
            print("==================================== The first  %s  page   climb   take   become   work ====================================" % page)
       
       

       
       

       
       
if __name__ == '__main__':
       
       
    imageSpider = kitchen()
       
       
    imageSpider.main()
      
      
      
      
       
       1.
       
       2.
       
       3.
       
       4.
       
       5.
       
       6.
       
       7.
       
       8.
       
       9.
       
       10.
       
       11.
       
       12.
       
       13.
       
       14.
       
       15.
       
       16.
       
       17.
       
       18.
       
       19.
       
       20.
       
       21.
       
       22.
       
       23.
       
       24.
       
       25.
       
       26.
       
       27.
       
       28.
       
       29.
       
       30.
       
       31.
       
       32.
       
       33.
       
       34.
       
       35.
       
       36.
       
       37.
       
       38.
       
       39.
       
       40.
       
       41.
       
       42.
       
       43.
       
       44.
       
       45.
       
       46.
       
       47.
       
       48.
       
       49.
       
       50.
       
       51.
       
       52.
       
       53.
       
       54.
       
       55.
       
       56.
       
       57.
       
       58.
       
       59.
       
       60.
       
       61.
       
       62.
       
       63.
       
       64.
       
       65.
       
       66.
       
       67.
       
       68.
       
       69.
       
       70.
       
       71.
       
       72.
       
       73.
       
       74.
       
       75.

The results will be saved to a txt Inside the document , As shown in the figure below ：

# yyds Dry inventory # Inventory a common error in a web crawler _Python Programming _05 Come across this url Splicing problem , Recommended urljoin The way , The sample code is as follows ：

      
      
       
       from urllib.parse import urljoin
       
       
source_url = 'https://www.baidu.com/'
       
       
child_url1 = '/robots.txt'
       
       
child_url2 = 'robots.txt'
       
       
final_url1 = urljoin(source_url, child_url1)
       
       
final_url2 = urljoin(source_url, child_url2)
       
       
print(final_url1)
       
       
print(final_url2)
      
      
      
      
       
       1.
       
       2.
       
       3.
       
       4.
       
       5.
       
       6.
       
       7.
       
       8.

The results are shown in the following figure ： # yyds Dry inventory # Inventory a common error in a web crawler _python_06

urljoin The function of is to connect two parameters url, Fill the missing part of the second parameter with the of the first parameter , If the second has a complete path , The second is the main .

3、 ... and 、 summary

Hello everyone , I'm Pippi . This article mainly reviews a common error problem in a web crawler , In this paper, specific analysis and code demonstration are given to solve this problem , Help the fans solve the problem smoothly . Finally, I gave you an url The way of splicing , It is still very commonly used in web crawlers .

Finally, thank the fans 【 Rain is rain 】 put questions to , thank 【Python Initiate 】 The specific analysis and code demonstration are given , Thank you fans 【꯭】、【 Ash · Gioro 】、【 Luna 】、【dcpeng】、【 Mr. Yu Liang 】 And others participate in learning and communication .

friends , Practice it quickly ！ If in the process of learning , Have encountered any Python problem , Welcome to add my friend , I'll pull you in Python The learning exchange group discusses learning together .

版权声明
本文为[Python advanced]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204211109376681.html

当前位置：网站首页>Yyds dry goods inventory a common error in a web crawler

Yyds dry goods inventory a common error in a web crawler

One 、 Preface

Two 、 To solve the process

3、 ... and 、 summary

边栏推荐

猜你喜欢

随机推荐