当前位置:网站首页>Reptile exercises (1)
Reptile exercises (1)
2022-04-23 15:04:00 【Zhang Lifan】
Don't just follow the path .Make your own trail .
Don't just walk along the road , Go your own way .
This press release is of great commemorative significance , Published on birthday , Started my career of online note taking , And deepened the infinite love for reptiles , I hope you can give me support !!! Please support me for the first time !!! It will be wonderful in the future .
10.( Choose a question 1) The target site https://www.sogou.com/
requirement :
1. The user enters what to search , Start page and end page
2. Crawl the source code of relevant pages according to the content entered by the user
3. Save the acquired data to the local
import requests
word = input(" Please enter the search content ")
start = int(input(" Please enter the start page "))
end = int(input(" Please enter the end page "))
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
for n in range(start, end + 1):
url = f'https://www.sogou.com/web?query={word}&page={n}'
# print(url)
response = requests.get(url, headers=headers)
with open(f'{word} Of the {n} page .html', "w", encoding="utf-8")as file:
file.write(response.content.decode("utf-8"))
One 、 Analyze the web
1. Enter the website first
python - Sogou search (sogou.com)https://www.sogou.com/web?query=python&_ast=1650447467&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=7606&sst0=1650447682406&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650447682406 2. Search separately “Python”,“ China ” And compare the websites .
3. Put each parameter in "&" Separate the symbols
Conclusion :(1) Parameters query Is the search object , Chinese characters will be escaped ( Code after escape without memory )
(2) Parameters ie Is the escape encoding type , With "utf-8" For example , This parameter can be used in Network- Payload-ie=utf-8, Some bags can also be in Network-Response-charset='utf-8', Pictured :
4. However, these two parameters can not meet the page turning and crawling requirements of the test questions , So we have to turn the page manually and then check
https://www.sogou.com/web?query=python&page=2&ie=utf8
5. Finally, it's the critical moment ! We found the law , And the meaning of important parameters , You can build a common URL 了 , Pictured :
url = f'https://www.sogou.com/web?query={word}&page={n}'
# Assign variable parameters with variables , In this way, a new URL
Two 、 Look for parameters
https://www.sogou.com/web?query=Python&_asf=www.sogou.com&_ast=&w=01019900&p=40040100&ie=utf8&from=index-nologin&s_from=index&sut=12736&sst0=1650428312860&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428312860
https://www.sogou.com/web?query=java&_ast=1650428313&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=10734&sst0=1650428363389&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428363389
https://www.sogou.com/web?query=C%E8%AF%AD%E8%A8%80&_ast=1650428364&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=11662&sst0=1650428406805&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428406805
https://www.sogou.com/web?
https://www.sogou.com/web?query=Python&
https://www.sogou.com/web?query=Python&page=2&ie=utf8
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
'cookie' = "IPLOC=CN3600; SUID=191166B6364A910A00000000625F8708; SUV=1650427656976942; browerV=3; osV=1; ABTEST=0|1650428297|v17; SNUID=636A1DCD7B7EA775332A80CB7B347D43; sst0=663; ld=llllllllll2AgcuYlllllpJK6jclllllHuYSpyllllDlllllVllll5@@@@@@@@@@; LSTMV=229,37; LCLKINT=1424"
'URl' = "https://www.sogou.com/web?query=Python&_ast=1650429998&_asf=www.sogou.com&w=01029901&cid=&s_from=result_up&sut=5547&sst0=1650430005573&lkt=0,0,0&sugsuv=1650427656976942&sugtime=1650430005573"
url="https://www.sogou.com/web?query={}&page={}:
# UA To be in dictionary form headers receive
1.headers Error of :
" ":" ",
# Build the format of the dictionary ,',' Never forget
# headers It's a keyword. You can't write it wrong , If you make a mistake, you will have the following error reports
import requests
url = "https://www.bxwxorg.com/"
hearders = {
'cookie':'Hm_lvt_46329db612a10d9ae3a668a40c152e0e=1650361322; mc_user={"id":"20812","name":"20220415","avatar":"0","pass":"2a5552bf13f8fa04f5ea26d15699233e","time":1650363349}; Hm_lpvt_46329db612a10d9ae3a668a40c152e0e=1650363378',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
response = requests.get(url, hearders=hearders)
print(response.content.decode("UTF-8"))
Traceback (most recent call last):
File "D:/pythonproject/ The second assignment .py", line 141, in <module>
response = requests.get(url, hearders=hearders)
File "D:\python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "D:\python37\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'hearders'
# reason : Three hearders Write consistently , however headers Is the key word , So the report type is wrong
# But it's written heades There will be another form of error reporting
import requests
word = input(" Please enter the search content ")
start = int(input(" Please enter the start page "))
end = int(input(" Please enter the end page "))
heades = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
for n in range(start, end + 1):
url = f'https://www.sogou.com/web?query={word}&page={n}'
# print(url)
response = requests.get(url, headers=headers)
with open(f'{word} Of the {n} page .html', "w", encoding="utf-8")as file:
file.write(response.content.decode("utf-8"))
Traceback (most recent call last):
File "D:/pythonproject/ The second assignment .py", line 117, in <module>
response = requests.get(url, headers=headers)
NameError: name 'headers' is not defined
# reason : Three hearders Inconsistent writing , So the registration is wrong
# The correct way of writing is , You'd better not make a mistake !
import requests
url = "https://www.bxwxorg.com/"
headers = {
'cookie':'Hm_lvt_46329db612a10d9ae3a668a40c152e0e=1650361322; mc_user={"id":"20812","name":"20220415","avatar":"0","pass":"2a5552bf13f8fa04f5ea26d15699233e","time":1650363349}; Hm_lpvt_46329db612a10d9ae3a668a40c152e0e=1650363378',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
response = requests.get(url, headers=headers)
print(response.content.decode("UTF-8"))
3、 ... and 、 loop
for n in range(start, end + 1):
# Why 'end+1' Well : because range The range of a function is left closed and right open, so it's true end It's not worth it , Add at the back 1 You can get the previous real value
Because it's crawling through the pages , Need to use every page URL, So we should build URL Call into the loop
Four 、requests Basic introduction of :( Jianghu people "urllib3" )
1. install :Win + R --> cmd --> Input pip install requests
# If it cannot be called after downloading, it is because : The module is installed in Python In its own environment , The virtual environment you use doesn't have this Databases , To specify the environment
# Specify the environment : The input terminal where python Find the installation path --> File --> setting --> project --> project interpreter --> Find the settings Icon --> Add --> system interpreter --> ... --> (where python In the path ) --> OK --> Apply application --> OK
2. The request module of the crawler urllib.request modular (urllib3)
(1) Common methods :
urllib.request.urlopen(" website ")
effect : Make a request to the website , And get the corresponding
Byte stream = response.read().decode('utf-8')
urllib.request request(" website ", headers=" Dictionaries ")
# because urlopen() Refactoring is not supported User-Agent
(2)requests modular
(2-1)request Common usage of :
requsts.get( website )
(2-2) The object of the response response Methods :
response.text return Unicode Formatted data (str)
response.content Return byte stream data ( Binary system )
response.conten.decode('utf-8') Decode manually
response.url return url
response.encode() = " code "
(3). adopt requsts Module to send POST request :
cookie:
cookie The user identity is determined by the information recorded on the client
HTTP Is a kind of connection protocol. The interaction between client and server is limited to “ request / Respond to ” Disconnect when finished , Next time, please Time finding , The server will think of it as a new client , In order to maintain the connection between them , Let the server know that this is Request initiated by the previous user , Client information must be saved in one place
session;
session It is to determine the user identity through the information recorded on the server , there session It means a conversation
In this way, you can get the page source code !!!
Don't worry , The best things come when you least expect them to . So what we're going to do is : Try with hope , Wait for the good to come . May your world always be full of sunshine , There will be no unnecessary sadness . May you be on the road of life , Reach the place you want , With you , It's your favorite look .
Today is a happy day ,4 month 21 Number , It's not just my birthday , It's also my first time CSDN Draft , As a cute new , Please give me your support , In the afternoon, I will post a link , Please join us for three times !!!
2002/4/21
Happy birthday to Xiao Fanfan
版权声明
本文为[Zhang Lifan]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231410193450.html
边栏推荐
- Introduction to distributed transaction Seata
- 你還不知道責任鏈模式的使用場景嗎?
- Epoll's et, lt working mode -- example program
- 博睿数据携手F5共同构建金融科技从代码到用户的全数据链DNA
- Redis主从同步
- Daily question - leetcode396 - rotation function - recursion
- Leetcode149 - maximum number of points on a line - Math - hash table
- 【thymeleaf】处理空值和使用安全操作符
- LeetCode149-直线上最多的点数-数学-哈希表
- Flink datastream type system typeinformation
猜你喜欢
UML project example -- UML diagram description of tiktok
每日一题-LeetCode396-旋转函数-递推
1 - first knowledge of go language
capacitance
8.4 realization of recurrent neural network from zero
大文件如何快速上传?
Introduction to Arduino for esp8266 serial port function
QT Detailed explanation of pro file
Interviewer: let's talk about the process of class loading and the mechanism of class loading (parental delegation mechanism)
[NLP] HMM hidden Markov + Viterbi word segmentation
随机推荐
大文件如何快速上传?
解决computed属性与input的blur事件冲突问题
Swift - literal, literal protocol, conversion between basic data types and dictionary / array
脏读、不可重复读和幻读介绍
1990年1月1日是星期一,定义函数date_to_week(year,month,day),实现功能输入年月日后返回星期几,例如date_to_week(2020,11,1),返回:星期日。 提示:
Epolloneshot event of epoll -- instance program
January 1, 1990 is Monday. Define the function date_ to_ Week (year, month, day), which realizes the function of returning the day of the week after inputting the year, month and day, such as date_ to
Thinkphp5 + data large screen display effect
UML learning_ Day2
OPPO数据湖统一存储技术实践
win10 任务栏通知区图标不见了
[proteus simulation] automatic range (range < 10V) switching digital voltmeter
Thread synchronization, life cycle
LeetCode149-直线上最多的点数-数学-哈希表
js——實現點擊複制功能
OC 转 Swift 条件编译、标记、宏、 Log、 版本检测、过期提示
Brute force of DVWA low -- > High
8.5 concise implementation of cyclic neural network
LeetCode167-两数之和II-双指针-二分-数组-查找
2-Go变量操作