当前位置:网站首页>Jupyter notebook crawling web pages
Jupyter notebook crawling web pages
2022-04-23 05:08:00 【FOWng_ lp】
urllib Send a request
Take Baidu for example
from urllib import request
url = "https://www.baidu.com" # Get a response
res = request.urlopen(url)
print(res.info())# Response head
print(res.getcode())# Status code 2xx( normal ) 3xx( forward )4xx(404) 5xx( Server internal error )
print(res.geturl())# Return response address
utf-8
html = res.read()
html = html.decode("utf-8")
print(html)
The crawler ( Take popular reviews for example )
add to header Information
url = "https://www.dianping.com" # Get a response
header={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
}
req = request.Request(url,headers=header)
res = request.urlopen(req)
print(res.info())# Response head
print(res.getcode())# Status code 2xx( normal ) 3xx( forward ) 4xx(404) 5xx( Server internal error )
print(res.geturl())# Return response address
Used request
request Send a request
import requests
url = "https://www.baidu.com"
res = requests.get(url)
print(res.encoding)
print(res.headers) # If there is no 'Content-Type' ,encoding = utf-8 Yes Content-Type Words , If set charset, So charset Subject to , There is no setting ISO-8859-1
print(res.url)
Running results
ISO-8859-1
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 05 Apr 2020 05:50:23 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:24:33 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
https://www.baidu.com/
res.encoding = "utf-8"
print(res.text)
Running results
The crawler ( Take popular reviews for example )
add to header Information
import requests
url = "https://www.dianping.com"
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
}
res = requests.get(url,headers=header)
print(res.encoding)
print(res.headers) # If there is no 'Content-Type' ,encoding = utf-8 Yes Content-Type Words , If set charset, So charset Subject to , There is no setting ISO-8859-1
print(res.url)
print(res.status_code)
print(res.text)
BuautifulSoup4 Parsing content
Sichuan health commission as an example
from bs4 import BeautifulSoup
import requests
url = 'http://wsjkw.sc.gov.cn/scwsjkw/gzbd/fyzt.shtml'
res = requests.get(url)
res.encoding = 'utf-8'
html = res.text
soup = BeautifulSoup(html)# Wrapper class
soup.find('h2').text
a = soup.find('a')
print(a)
print(a.attrs)
print(a.attrs['href'])
Running results
<a href="/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac.shtml" target="_blank"><img alt=" The latest situation of New Coronavirus pneumonia in Hunan Province (4 month ..." src="/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac/images/b1bc5f23725045d7940a854fbe2d70a9.jpg
"/></a>
{'target': '_blank', 'href': '/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac.shtml'}
/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac.shtml
url_new = "http://wsjkw.sc.gov.cn" + a.attrs['href'] # Use the public part of the website to add href Assemble a new one url Address
# In this way, you can get all the tags inside
#url_new# Here is the latest address
res = requests.get(url_new)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)
contest = soup.find('p')
print(contest)
Running results
re Parsing content (Regular Expression)
python Built in regular expression module
re.search(regex,str)
1. stay str Find a string that meets the criteria in , No match returns None
2. The returned results can be grouped , You can add parentheses to the string to separate data
groups()
group(index) Return grouped content
import re
text = contest.text
#print(text)
patten = " newly added (\d+) Of the confirmed cases "
res = re.search(patten,text)
print(res)
Running results
Supplement to regular expressions
Crawling Tencent data
How to handle the interface of Tencent historical data list
Please use the latest interface address
版权声明
本文为[FOWng_ lp]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220549345999.html
边栏推荐
- Detailed explanation of hregionserver
- Docker installation and mysql5 7 installation
- leetcode——启发式搜索
- Installing kuberneters using kubedm
- Streamexecutionenvironment of Flink source code
- Mac 进入mysql终端命令
- JS determines whether the numeric string contains characters
- scp命令详解
- 静态流水线和动态流水线的区别认识
- [2022 ICLR] Pyraformer: Low-Complexity Pyramidal Attention for Long-Range 时空序列建模和预测
猜你喜欢
[2021] Spatio-Temporal Graph Contrastive Learning
[2022 ICLR] Pyraformer: Low-Complexity Pyramidal Attention for Long-Range 时空序列建模和预测
At pgconf Asia Chinese technology forum, listen to Tencent cloud experts' in-depth understanding of database technology
数据安全问题已成隐患,看vivo如何让“用户数据”重新披甲
MySQL memo (for your own query)
A trinomial expression that causes a null pointer
Detailed explanation of the differences between TCP and UDP
Independent station operation | Facebook marketing artifact - chat robot manychat
Cross border e-commerce | Facebook and instagram: which social media is more suitable for you?
The 8 diagrams let you see the execution sequence of async / await and promise step by step
随机推荐
calendar. Pit point of getactualmaximum (calendar. Day_of_month)
[database] MySQL multi table query (I)
mysql5. 7. X data authorization leads to 1141
The difference between static pipeline and dynamic pipeline
L2-011 play binary tree (build tree + BFS)
DIY is an excel version of subnet calculator
Using MySQL with Oracle
TypeError: ‘Collection‘ object is not callable. If you meant to call the ......
Machine learning - linear regression
The vscode ipynb file does not have code highlighting and code completion solutions
[2022 ICLR] Pyraformer: Low-Complexity Pyramidal Attention for Long-Range 时空序列建模和预测
Docker installation and mysql5 7 installation
[WinUI3]編寫一個仿Explorer文件管理器
Basic theory of Flink
Spell it! Two A-level universities and six B-level universities have abolished master's degree programs in software engineering!
MySQL slow query
Innovation training (XI) airline ticket crawling company information
In aggregated query without group by, expression 1 of select list contains nonaggregated column
The applet calls the function of scanning QR code and jumps to the path specified by QR code
What are instruction cycles, machine cycles, and clock cycles?