当前位置:网站首页>Jupyter notebook crawling web pages
Jupyter notebook crawling web pages
2022-04-23 05:08:00 【FOWng_ lp】
urllib Send a request
Take Baidu for example
from urllib import request
url = "https://www.baidu.com" # Get a response
res = request.urlopen(url)
print(res.info())# Response head
print(res.getcode())# Status code 2xx( normal ) 3xx( forward )4xx(404) 5xx( Server internal error )
print(res.geturl())# Return response address
utf-8
html = res.read()
html = html.decode("utf-8")
print(html)
The crawler ( Take popular reviews for example )
add to header Information
url = "https://www.dianping.com" # Get a response
header={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
}
req = request.Request(url,headers=header)
res = request.urlopen(req)
print(res.info())# Response head
print(res.getcode())# Status code 2xx( normal ) 3xx( forward ) 4xx(404) 5xx( Server internal error )
print(res.geturl())# Return response address
Used request
request Send a request
import requests
url = "https://www.baidu.com"
res = requests.get(url)
print(res.encoding)
print(res.headers) # If there is no 'Content-Type' ,encoding = utf-8 Yes Content-Type Words , If set charset, So charset Subject to , There is no setting ISO-8859-1
print(res.url)
Running results
ISO-8859-1
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 05 Apr 2020 05:50:23 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:24:33 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
https://www.baidu.com/
res.encoding = "utf-8"
print(res.text)
Running results
The crawler ( Take popular reviews for example )
add to header Information
import requests
url = "https://www.dianping.com"
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
}
res = requests.get(url,headers=header)
print(res.encoding)
print(res.headers) # If there is no 'Content-Type' ,encoding = utf-8 Yes Content-Type Words , If set charset, So charset Subject to , There is no setting ISO-8859-1
print(res.url)
print(res.status_code)
print(res.text)
BuautifulSoup4 Parsing content
Sichuan health commission as an example
from bs4 import BeautifulSoup
import requests
url = 'http://wsjkw.sc.gov.cn/scwsjkw/gzbd/fyzt.shtml'
res = requests.get(url)
res.encoding = 'utf-8'
html = res.text
soup = BeautifulSoup(html)# Wrapper class
soup.find('h2').text
a = soup.find('a')
print(a)
print(a.attrs)
print(a.attrs['href'])
Running results
<a href="/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac.shtml" target="_blank"><img alt=" The latest situation of New Coronavirus pneumonia in Hunan Province (4 month ..." src="/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac/images/b1bc5f23725045d7940a854fbe2d70a9.jpg
"/></a>
{'target': '_blank', 'href': '/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac.shtml'}
/scwsjkw/gzbd01/2020/4/6/2d06e73d4ee14597bb375ece4b6f02ac.shtml
url_new = "http://wsjkw.sc.gov.cn" + a.attrs['href'] # Use the public part of the website to add href Assemble a new one url Address
# In this way, you can get all the tags inside
#url_new# Here is the latest address
res = requests.get(url_new)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)
contest = soup.find('p')
print(contest)
Running results
re Parsing content (Regular Expression)
python Built in regular expression module
re.search(regex,str)
1. stay str Find a string that meets the criteria in , No match returns None
2. The returned results can be grouped , You can add parentheses to the string to separate data
groups()
group(index) Return grouped content
import re
text = contest.text
#print(text)
patten = " newly added (\d+) Of the confirmed cases "
res = re.search(patten,text)
print(res)
Running results
Supplement to regular expressions
Crawling Tencent data
How to handle the interface of Tencent historical data list
Please use the latest interface address
版权声明
本文为[FOWng_ lp]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220549345999.html
边栏推荐
- Acid of MySQL transaction
- 洛谷P2731骑马修栅栏
- [winui3] Écrivez une copie du gestionnaire de fichiers Explorer
- Analysis of POM files
- 【数据库】表的查看、修改和删除
- Innovation training (XI) airline ticket crawling company information
- MySQL memo (for your own query)
- Chapter III project schedule management of information system project manager summary
- Harmonious dormitory (linear DP / interval DP)
- js 判斷數字字符串中是否含有字符
猜你喜欢
使用zerotier让异地设备组局域网
Perfect test of coil in wireless charging system with LCR meter
Download PDF from HowNet (I don't want to use CAJViewer anymore!!!)
One month countdown, pgconf What are the highlights of the latest outlook of asia2021 Asian Conference?
Making message board with PHP + MySQL
[WinUI3]编写一个仿Explorer文件管理器
Cross border e-commerce | Facebook and instagram: which social media is more suitable for you?
Use the built-in function of win to transfer files between two computers in the same LAN (the speed is the same as that between local disks)
数据安全问题已成隐患,看vivo如何让“用户数据”重新披甲
[2021] Spatio-Temporal Graph Contrastive Learning
随机推荐
Sword finger offer: push in and pop-up sequence of stack
[WinUI3]編寫一個仿Explorer文件管理器
The applet calls the function of scanning QR code and jumps to the path specified by QR code
信息学奥赛一本通 1212:LETTERS | OpenJudge 2.5 156:LETTERS
Docker installation and mysql5 7 installation
vscode ipynb文件没有代码高亮和代码补全解决方法
Installing kuberneters using kubedm
Deep learning notes - semantic segmentation and data sets
深度学习笔记 —— 微调
Data security has become a hidden danger. Let's see how vivo can make "user data" armor again
L2-011 play binary tree (build tree + BFS)
泰克示波器DPO3054自校准SPC失败维修
View, modify and delete [database] table
独立站运营 | FaceBook营销神器——聊天机器人ManyChat
Learning Android from scratch -- baseactivity and activitycollector
[2022 ICLR] Pyraformer: Low-Complexity Pyramidal Attention for Long-Range 时空序列建模和预测
Locks and transactions in MySQL
QPushButton slot function is triggered multiple times
Summary of R & D technology
Set Chrome browser background to eye protection (eye escort / darkreader plug-in)