当前位置:网站首页>Reptile exercises (1)
Reptile exercises (1)
2022-04-23 15:04:00 【Zhang Lifan】
Don't just follow the path .Make your own trail .
Don't just walk along the road , Go your own way .
This press release is of great commemorative significance , Published on birthday , Started my career of online note taking , And deepened the infinite love for reptiles , I hope you can give me support !!! Please support me for the first time !!! It will be wonderful in the future .
10.( Choose a question 1) The target site https://www.sogou.com/
requirement :
1. The user enters what to search , Start page and end page
2. Crawl the source code of relevant pages according to the content entered by the user
3. Save the acquired data to the local
import requests
word = input(" Please enter the search content ")
start = int(input(" Please enter the start page "))
end = int(input(" Please enter the end page "))
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
for n in range(start, end + 1):
url = f'https://www.sogou.com/web?query={word}&page={n}'
# print(url)
response = requests.get(url, headers=headers)
with open(f'{word} Of the {n} page .html', "w", encoding="utf-8")as file:
file.write(response.content.decode("utf-8"))
One 、 Analyze the web
1. Enter the website first
python - Sogou search (sogou.com)
https://www.sogou.com/web?query=python&_ast=1650447467&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=7606&sst0=1650447682406&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650447682406 2. Search separately “Python”,“ China ” And compare the websites .


3. Put each parameter in "&" Separate the symbols
Conclusion :(1) Parameters query Is the search object , Chinese characters will be escaped ( Code after escape without memory )
(2) Parameters ie Is the escape encoding type , With "utf-8" For example , This parameter can be used in Network- Payload-ie=utf-8, Some bags can also be in Network-Response-charset='utf-8', Pictured :

4. However, these two parameters can not meet the page turning and crawling requirements of the test questions , So we have to turn the page manually and then check 
https://www.sogou.com/web?query=python&page=2&ie=utf8
5. Finally, it's the critical moment ! We found the law , And the meaning of important parameters , You can build a common URL 了 , Pictured :
url = f'https://www.sogou.com/web?query={word}&page={n}'
# Assign variable parameters with variables , In this way, a new URL
Two 、 Look for parameters
https://www.sogou.com/web?query=Python&_asf=www.sogou.com&_ast=&w=01019900&p=40040100&ie=utf8&from=index-nologin&s_from=index&sut=12736&sst0=1650428312860&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428312860
https://www.sogou.com/web?query=java&_ast=1650428313&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=10734&sst0=1650428363389&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428363389
https://www.sogou.com/web?query=C%E8%AF%AD%E8%A8%80&_ast=1650428364&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=11662&sst0=1650428406805&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428406805
https://www.sogou.com/web?
https://www.sogou.com/web?query=Python&
https://www.sogou.com/web?query=Python&page=2&ie=utf8
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
'cookie' = "IPLOC=CN3600; SUID=191166B6364A910A00000000625F8708; SUV=1650427656976942; browerV=3; osV=1; ABTEST=0|1650428297|v17; SNUID=636A1DCD7B7EA775332A80CB7B347D43; sst0=663; ld=llllllllll2AgcuYlllllpJK6jclllllHuYSpyllllDlllllVllll5@@@@@@@@@@; LSTMV=229,37; LCLKINT=1424"
'URl' = "https://www.sogou.com/web?query=Python&_ast=1650429998&_asf=www.sogou.com&w=01029901&cid=&s_from=result_up&sut=5547&sst0=1650430005573&lkt=0,0,0&sugsuv=1650427656976942&sugtime=1650430005573"
url="https://www.sogou.com/web?query={}&page={}:
# UA To be in dictionary form headers receive
1.headers Error of :
" ":" ",
# Build the format of the dictionary ,',' Never forget
# headers It's a keyword. You can't write it wrong , If you make a mistake, you will have the following error reports
import requests
url = "https://www.bxwxorg.com/"
hearders = {
'cookie':'Hm_lvt_46329db612a10d9ae3a668a40c152e0e=1650361322; mc_user={"id":"20812","name":"20220415","avatar":"0","pass":"2a5552bf13f8fa04f5ea26d15699233e","time":1650363349}; Hm_lpvt_46329db612a10d9ae3a668a40c152e0e=1650363378',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
response = requests.get(url, hearders=hearders)
print(response.content.decode("UTF-8"))
Traceback (most recent call last):
File "D:/pythonproject/ The second assignment .py", line 141, in <module>
response = requests.get(url, hearders=hearders)
File "D:\python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "D:\python37\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'hearders'
# reason : Three hearders Write consistently , however headers Is the key word , So the report type is wrong
# But it's written heades There will be another form of error reporting
import requests
word = input(" Please enter the search content ")
start = int(input(" Please enter the start page "))
end = int(input(" Please enter the end page "))
heades = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
for n in range(start, end + 1):
url = f'https://www.sogou.com/web?query={word}&page={n}'
# print(url)
response = requests.get(url, headers=headers)
with open(f'{word} Of the {n} page .html', "w", encoding="utf-8")as file:
file.write(response.content.decode("utf-8"))
Traceback (most recent call last):
File "D:/pythonproject/ The second assignment .py", line 117, in <module>
response = requests.get(url, headers=headers)
NameError: name 'headers' is not defined
# reason : Three hearders Inconsistent writing , So the registration is wrong
# The correct way of writing is , You'd better not make a mistake !
import requests
url = "https://www.bxwxorg.com/"
headers = {
'cookie':'Hm_lvt_46329db612a10d9ae3a668a40c152e0e=1650361322; mc_user={"id":"20812","name":"20220415","avatar":"0","pass":"2a5552bf13f8fa04f5ea26d15699233e","time":1650363349}; Hm_lpvt_46329db612a10d9ae3a668a40c152e0e=1650363378',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
response = requests.get(url, headers=headers)
print(response.content.decode("UTF-8"))
3、 ... and 、 loop
for n in range(start, end + 1):
# Why 'end+1' Well : because range The range of a function is left closed and right open, so it's true end It's not worth it , Add at the back 1 You can get the previous real value
Because it's crawling through the pages , Need to use every page URL, So we should build URL Call into the loop
Four 、requests Basic introduction of :( Jianghu people "urllib3" )
1. install :Win + R --> cmd --> Input pip install requests
# If it cannot be called after downloading, it is because : The module is installed in Python In its own environment , The virtual environment you use doesn't have this Databases , To specify the environment
# Specify the environment : The input terminal where python Find the installation path --> File --> setting --> project --> project interpreter --> Find the settings Icon --> Add --> system interpreter --> ... --> (where python In the path ) --> OK --> Apply application --> OK
2. The request module of the crawler urllib.request modular (urllib3)
(1) Common methods :
urllib.request.urlopen(" website ")
effect : Make a request to the website , And get the corresponding
Byte stream = response.read().decode('utf-8')
urllib.request request(" website ", headers=" Dictionaries ")
# because urlopen() Refactoring is not supported User-Agent
(2)requests modular
(2-1)request Common usage of :
requsts.get( website )
(2-2) The object of the response response Methods :
response.text return Unicode Formatted data (str)
response.content Return byte stream data ( Binary system )
response.conten.decode('utf-8') Decode manually
response.url return url
response.encode() = " code "
(3). adopt requsts Module to send POST request :
cookie:
cookie The user identity is determined by the information recorded on the client
HTTP Is a kind of connection protocol. The interaction between client and server is limited to “ request / Respond to ” Disconnect when finished , Next time, please Time finding , The server will think of it as a new client , In order to maintain the connection between them , Let the server know that this is Request initiated by the previous user , Client information must be saved in one place
session;
session It is to determine the user identity through the information recorded on the server , there session It means a conversation


In this way, you can get the page source code !!!
Don't worry , The best things come when you least expect them to . So what we're going to do is : Try with hope , Wait for the good to come . May your world always be full of sunshine , There will be no unnecessary sadness . May you be on the road of life , Reach the place you want , With you , It's your favorite look .
Today is a happy day ,4 month 21 Number , It's not just my birthday , It's also my first time CSDN Draft , As a cute new , Please give me your support , In the afternoon, I will post a link , Please join us for three times !!!
2002/4/21
Happy birthday to Xiao Fanfan

版权声明
本文为[Zhang Lifan]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231410193450.html
边栏推荐
- Explanation and example application of the principle of logistic regression in machine learning
- Advanced version of array simulation queue - ring queue (real queuing)
- Little red book timestamp2 (2022 / 04 / 22)
- Leetcode151 - invert words in string - String - simulation
- Leetcode149 - maximum number of points on a line - Math - hash table
- The win10 taskbar notification area icon is missing
- When splicing HQL, the new field does not appear in the construction method
- 8.3 language model and data set
- Thread synchronization, life cycle
- 冰冰学习笔记:一步一步带你实现顺序表
猜你喜欢

Advanced version of array simulation queue - ring queue (real queuing)

UML learning_ Day2

LeetCode162-寻找峰值-二分-数组

剑指 Offer II 019. 最多删除一个字符得到回文(简单)

What is the effect of Zhongfu Jinshi wealth class 29800? Walk with professional investors to make investment easier

We reference My97DatePicker to realize the use of time plug-in

LeetCode 练习——396. 旋转函数

eolink 如何助力远程办公

三、梯度下降求解最小θ

我的 Raspberry Pi Zero 2W 折腾笔记,记录一些遇到的问题和解决办法
随机推荐
[proteus simulation] automatic range (range < 10V) switching digital voltmeter
Is asemi ultrafast recovery diode interchangeable with Schottky diode
8.5 concise implementation of cyclic neural network
Model location setting in GIS data processing -cesium
如何打开Win10启动文件夹?
My raspberry PI zero 2W tossing notes record some problems encountered and solutions
LeetCode162-寻找峰值-二分-数组
I/O复用的高级应用:同时处理 TCP 和 UDP 服务
Little red book timestamp2 (2022 / 04 / 22)
Share 20 tips for ES6 that should not be missed
1 - first knowledge of go language
Alexnet model
冰冰学习笔记:一步一步带你实现顺序表
Redis主从同步
Swift: entry of program, swift calls OC@_ silgen_ Name, OC calls swift, dynamic, string, substring
Thinkphp5 + data large screen display effect
Alexnet model
OC 转 Swift 条件编译、标记、宏、 Log、 版本检测、过期提示
封面和标题中的关键词怎么写?做自媒体为什么视频没有播放量
填充每个节点的下一个右侧节点指针 II [经典层次遍历 | 视为链表 ]