当前位置：网站首页>Jupyter notebook crawling web pages

Jupyter notebook crawling web pages

2022-04-23 05:08:00 【FOWng_ lp】

urllib Send a request

Take Baidu for example

from urllib import request

url = "https://www.baidu.com" # Get a response 

res = request.urlopen(url)

print(res.info())# Response head 

print(res.getcode())# Status code  2xx（ normal ） 3xx（ forward ）4xx（404） 5xx（ Server internal error ）

print(res.geturl())# Return response address

utf-8

html = res.read()
html = html.decode("utf-8")
print(html)

The crawler （ Take popular reviews for example ）

add to header Information

url = "https://www.dianping.com" # Get a response 
header={
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
}
req = request.Request(url,headers=header)
res = request.urlopen(req)
print(res.info())# Response head 
print(res.getcode())# Status code  2xx（ normal ） 3xx（ forward ） 4xx（404） 5xx（ Server internal error ）
print(res.geturl())# Return response address

Used request

request Send a request

import requests
url = "https://www.baidu.com"
res = requests.get(url)
print(res.encoding)
print(res.headers) # If there is no 'Content-Type' ,encoding = utf-8  Yes Content-Type Words , If set charset, So charset Subject to , There is no setting ISO-8859-1
print(res.url)

Running results

ISO-8859-1
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 05 Apr 2020 05:50:23 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:24:33 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
https://www.baidu.com/

res.encoding = "utf-8"
print(res.text)

Running results

Insert picture description here

The crawler （ Take popular reviews for example ）

add to header Information

import requests
url = "https://www.dianping.com"
header = {
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
}
res = requests.get(url,headers=header)
print(res.encoding)
print(res.headers) # If there is no 'Content-Type' ,encoding = utf-8  Yes Content-Type Words , If set charset, So charset Subject to , There is no setting ISO-8859-1
print(res.url)
print(res.status_code)
print(res.text)

BuautifulSoup4 Parsing content

Sichuan health commission as an example

from bs4 import BeautifulSoup
import requests
url = 'http://wsjkw.sc.gov.cn/scwsjkw/gzbd/fyzt.shtml'
res = requests.get(url)
res.encoding = 'utf-8'
html = res.text
soup = BeautifulSoup(html)# Wrapper class 
soup.find('h2').text
a = soup.find('a')
print(a)
print(a.attrs)
print(a.attrs['href'])

Running results

<a href="/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac.shtml" target="_blank"><img alt=" The latest situation of New Coronavirus pneumonia in Hunan Province （4 month ..." src="/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac/images/b1bc5f23725045d7940a854fbe2d70a9.jpg
"/></a>
{'target': '_blank', 'href': '/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac.shtml'}
/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac.shtml

url_new = "http://wsjkw.sc.gov.cn" + a.attrs['href'] # Use the public part of the website to add href Assemble a new one url Address 
# In this way, you can get all the tags inside 
#url_new# Here is the latest address 
res = requests.get(url_new)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)
contest = soup.find('p')
print(contest)

Running results
Insert picture description here

re Parsing content （Regular Expression）

python Built in regular expression module

re.search(regex,str)
1. stay str Find a string that meets the criteria in , No match returns None
2. The returned results can be grouped , You can add parentheses to the string to separate data
groups()
group(index) Return grouped content

import re
text = contest.text
#print(text)
patten = " newly added (\d+) Of the confirmed cases "
res = re.search(patten,text)
print(res)

Running results
Insert picture description here

Insert picture description here

Supplement to regular expressions

Crawling Tencent data

How to handle the interface of Tencent historical data list
Please use the latest interface address

版权声明
本文为[FOWng_ lp]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204220549345999.html

当前位置：网站首页>Jupyter notebook crawling web pages

Jupyter notebook crawling web pages

urllib Send a request

Take Baidu for example

utf-8

The crawler （ Take popular reviews for example ）

request Send a request

The crawler （ Take popular reviews for example ）

BuautifulSoup4 Parsing content

re Parsing content （Regular Expression）

Crawling Tencent data

边栏推荐

猜你喜欢

随机推荐