当前位置:网站首页>Reptile exercises (1)
Reptile exercises (1)
2022-04-23 15:04:00 【Zhang Lifan】
Don't just follow the path .Make your own trail .
Don't just walk along the road , Go your own way .
This press release is of great commemorative significance , Published on birthday , Started my career of online note taking , And deepened the infinite love for reptiles , I hope you can give me support !!! Please support me for the first time !!! It will be wonderful in the future .
10.( Choose a question 1) The target site https://www.sogou.com/
requirement :
1. The user enters what to search , Start page and end page
2. Crawl the source code of relevant pages according to the content entered by the user
3. Save the acquired data to the local
import requests
word = input(" Please enter the search content ")
start = int(input(" Please enter the start page "))
end = int(input(" Please enter the end page "))
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
for n in range(start, end + 1):
url = f'https://www.sogou.com/web?query={word}&page={n}'
# print(url)
response = requests.get(url, headers=headers)
with open(f'{word} Of the {n} page .html', "w", encoding="utf-8")as file:
file.write(response.content.decode("utf-8"))
One 、 Analyze the web
1. Enter the website first
python - Sogou search (sogou.com)
https://www.sogou.com/web?query=python&_ast=1650447467&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=7606&sst0=1650447682406&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650447682406 2. Search separately “Python”,“ China ” And compare the websites .


3. Put each parameter in "&" Separate the symbols
Conclusion :(1) Parameters query Is the search object , Chinese characters will be escaped ( Code after escape without memory )
(2) Parameters ie Is the escape encoding type , With "utf-8" For example , This parameter can be used in Network- Payload-ie=utf-8, Some bags can also be in Network-Response-charset='utf-8', Pictured :

4. However, these two parameters can not meet the page turning and crawling requirements of the test questions , So we have to turn the page manually and then check 
https://www.sogou.com/web?query=python&page=2&ie=utf8
5. Finally, it's the critical moment ! We found the law , And the meaning of important parameters , You can build a common URL 了 , Pictured :
url = f'https://www.sogou.com/web?query={word}&page={n}'
# Assign variable parameters with variables , In this way, a new URL
Two 、 Look for parameters
https://www.sogou.com/web?query=Python&_asf=www.sogou.com&_ast=&w=01019900&p=40040100&ie=utf8&from=index-nologin&s_from=index&sut=12736&sst0=1650428312860&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428312860
https://www.sogou.com/web?query=java&_ast=1650428313&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=10734&sst0=1650428363389&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428363389
https://www.sogou.com/web?query=C%E8%AF%AD%E8%A8%80&_ast=1650428364&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=11662&sst0=1650428406805&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428406805
https://www.sogou.com/web?
https://www.sogou.com/web?query=Python&
https://www.sogou.com/web?query=Python&page=2&ie=utf8
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
'cookie' = "IPLOC=CN3600; SUID=191166B6364A910A00000000625F8708; SUV=1650427656976942; browerV=3; osV=1; ABTEST=0|1650428297|v17; SNUID=636A1DCD7B7EA775332A80CB7B347D43; sst0=663; ld=llllllllll2AgcuYlllllpJK6jclllllHuYSpyllllDlllllVllll5@@@@@@@@@@; LSTMV=229,37; LCLKINT=1424"
'URl' = "https://www.sogou.com/web?query=Python&_ast=1650429998&_asf=www.sogou.com&w=01029901&cid=&s_from=result_up&sut=5547&sst0=1650430005573&lkt=0,0,0&sugsuv=1650427656976942&sugtime=1650430005573"
url="https://www.sogou.com/web?query={}&page={}:
# UA To be in dictionary form headers receive
1.headers Error of :
" ":" ",
# Build the format of the dictionary ,',' Never forget
# headers It's a keyword. You can't write it wrong , If you make a mistake, you will have the following error reports
import requests
url = "https://www.bxwxorg.com/"
hearders = {
'cookie':'Hm_lvt_46329db612a10d9ae3a668a40c152e0e=1650361322; mc_user={"id":"20812","name":"20220415","avatar":"0","pass":"2a5552bf13f8fa04f5ea26d15699233e","time":1650363349}; Hm_lpvt_46329db612a10d9ae3a668a40c152e0e=1650363378',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
response = requests.get(url, hearders=hearders)
print(response.content.decode("UTF-8"))
Traceback (most recent call last):
File "D:/pythonproject/ The second assignment .py", line 141, in <module>
response = requests.get(url, hearders=hearders)
File "D:\python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "D:\python37\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'hearders'
# reason : Three hearders Write consistently , however headers Is the key word , So the report type is wrong
# But it's written heades There will be another form of error reporting
import requests
word = input(" Please enter the search content ")
start = int(input(" Please enter the start page "))
end = int(input(" Please enter the end page "))
heades = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
for n in range(start, end + 1):
url = f'https://www.sogou.com/web?query={word}&page={n}'
# print(url)
response = requests.get(url, headers=headers)
with open(f'{word} Of the {n} page .html', "w", encoding="utf-8")as file:
file.write(response.content.decode("utf-8"))
Traceback (most recent call last):
File "D:/pythonproject/ The second assignment .py", line 117, in <module>
response = requests.get(url, headers=headers)
NameError: name 'headers' is not defined
# reason : Three hearders Inconsistent writing , So the registration is wrong
# The correct way of writing is , You'd better not make a mistake !
import requests
url = "https://www.bxwxorg.com/"
headers = {
'cookie':'Hm_lvt_46329db612a10d9ae3a668a40c152e0e=1650361322; mc_user={"id":"20812","name":"20220415","avatar":"0","pass":"2a5552bf13f8fa04f5ea26d15699233e","time":1650363349}; Hm_lpvt_46329db612a10d9ae3a668a40c152e0e=1650363378',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
response = requests.get(url, headers=headers)
print(response.content.decode("UTF-8"))
3、 ... and 、 loop
for n in range(start, end + 1):
# Why 'end+1' Well : because range The range of a function is left closed and right open, so it's true end It's not worth it , Add at the back 1 You can get the previous real value
Because it's crawling through the pages , Need to use every page URL, So we should build URL Call into the loop
Four 、requests Basic introduction of :( Jianghu people "urllib3" )
1. install :Win + R --> cmd --> Input pip install requests
# If it cannot be called after downloading, it is because : The module is installed in Python In its own environment , The virtual environment you use doesn't have this Databases , To specify the environment
# Specify the environment : The input terminal where python Find the installation path --> File --> setting --> project --> project interpreter --> Find the settings Icon --> Add --> system interpreter --> ... --> (where python In the path ) --> OK --> Apply application --> OK
2. The request module of the crawler urllib.request modular (urllib3)
(1) Common methods :
urllib.request.urlopen(" website ")
effect : Make a request to the website , And get the corresponding
Byte stream = response.read().decode('utf-8')
urllib.request request(" website ", headers=" Dictionaries ")
# because urlopen() Refactoring is not supported User-Agent
(2)requests modular
(2-1)request Common usage of :
requsts.get( website )
(2-2) The object of the response response Methods :
response.text return Unicode Formatted data (str)
response.content Return byte stream data ( Binary system )
response.conten.decode('utf-8') Decode manually
response.url return url
response.encode() = " code "
(3). adopt requsts Module to send POST request :
cookie:
cookie The user identity is determined by the information recorded on the client
HTTP Is a kind of connection protocol. The interaction between client and server is limited to “ request / Respond to ” Disconnect when finished , Next time, please Time finding , The server will think of it as a new client , In order to maintain the connection between them , Let the server know that this is Request initiated by the previous user , Client information must be saved in one place
session;
session It is to determine the user identity through the information recorded on the server , there session It means a conversation


In this way, you can get the page source code !!!
Don't worry , The best things come when you least expect them to . So what we're going to do is : Try with hope , Wait for the good to come . May your world always be full of sunshine , There will be no unnecessary sadness . May you be on the road of life , Reach the place you want , With you , It's your favorite look .
Today is a happy day ,4 month 21 Number , It's not just my birthday , It's also my first time CSDN Draft , As a cute new , Please give me your support , In the afternoon, I will post a link , Please join us for three times !!!
2002/4/21
Happy birthday to Xiao Fanfan

版权声明
本文为[Zhang Lifan]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231410193450.html
边栏推荐
- The difference between having and where in SQL
- Leetcode151 - invert words in string - String - simulation
- OC 转 Swift 条件编译、标记、宏、 Log、 版本检测、过期提示
- thinkphp5+数据大屏展示效果
- Leetcode162 - find peak - dichotomy - array
- Swift - literal, literal protocol, conversion between basic data types and dictionary / array
- OC to swift conditional compilation, marking, macro, log, version detection, expiration prompt
- Is asemi ultrafast recovery diode interchangeable with Schottky diode
- Mds55-16-asemi rectifier module mds55-16
- LeetCode165-比较版本号-双指针-字符串
猜你喜欢

Leetcode167 - sum of two numbers II - double pointer - bisection - array - Search

Five data types of redis

1 - first knowledge of go language

do(Local scope)、初始化器、内存冲突、Swift指针、inout、unsafepointer、unsafeBitCast、successor、

分享 20 个不容错过的 ES6 的技巧

The art of automation

我的 Raspberry Pi Zero 2W 折腾笔记,记录一些遇到的问题和解决办法

Programming philosophy - automatic loading, dependency injection and control inversion

大文件如何快速上传?

UML project example -- UML diagram description of tiktok
随机推荐
[thymeleaf] handle null values and use safe operators
I/O复用的高级应用:同时处理 TCP 和 UDP 服务
解决computed属性与input的blur事件冲突问题
Don't you know the usage scenario of the responsibility chain model?
Advanced version of array simulation queue - ring queue (real queuing)
1-初识Go语言
大文件如何快速上传?
Mds55-16-asemi rectifier module mds55-16
Epolloneshot event of epoll -- instance program
Set up an AI team in the game world and start the super parametric multi-agent "chaos fight"
JUC学习记录(2022.4.22)
js——實現點擊複制功能
LeetCode165-比较版本号-双指针-字符串
January 1, 1990 is Monday. Define the function date_ to_ Week (year, month, day), which realizes the function of returning the day of the week after inputting the year, month and day, such as date_ to
Introduction to distributed transaction Seata
Practice of unified storage technology of oppo data Lake
Detailed comparison between asemi three-phase rectifier bridge and single-phase rectifier bridge
like和regexp差别
Share 20 tips for ES6 that should not be missed
What is the main purpose of PCIe X1 slot?