当前位置:网站首页>Basic usage of crawler requests
Basic usage of crawler requests
2022-04-23 17:58:00 【Feng Ye 520】
Preface
We used urllib library , This is a good tool for getting started , To understand some basic concepts of reptiles , It is helpful to master the process of crawling . After introduction , We need to learn more advanced content and tools to facilitate our crawling . In this section, let's briefly introduce requests Basic usage of the library .
notes :Python The version is still based on 2.7
Official documents
Most of the following is from official documents , This paper makes some modifications and summaries . To learn more, you can refer to
install
utilize pip install
$ pip install requests
1 |
$ pip install requests |
Or make use of easy_install
$ easy_install requests
1 |
$ easy_install requests |
The installation can be completed by the above two methods .
introduce
First of all, let's introduce a small example to experience
Python
import requests r = requests.get('http://cuiqingcai.com') print type(r) print r.status_code print r.encoding #print r.text print r.cookies
1 2 3 4 5 6 7 8 |
import requests
r = requests.get('http://cuiqingcai.com') print type(r) print r.status_code print r.encoding #print r.text print r.cookies |
The above code we requested the website address , Then print out the type of return result , Status code , Encoding mode ,Cookies The content such as .
The operation results are as follows
<class 'requests.models.Response'> 200 UTF-8 <RequestsCookieJar[]>
1 2 3 4 |
<class 'requests.models.Response'> 200 UTF-8 <RequestsCookieJar[]> |
How to , Is it convenient . Don't worry. , It's more convenient in the back .
Basic request
requests The library provides http All the basic request methods . for example
r = requests.post("http://httpbin.org/post") r = requests.put("http://httpbin.org/put") r = requests.delete("http://httpbin.org/delete") r = requests.head("http://httpbin.org/get") r = requests.options("http://httpbin.org/get")
1 2 3 4 5 |
r = requests.post("http://httpbin.org/post") r = requests.put("http://httpbin.org/put") r = requests.delete("http://httpbin.org/delete") r = requests.head("http://httpbin.org/get") r = requests.options("http://httpbin.org/get") |
Um. , In a word .
basic GET request
The most basic GET The request can be used directly get Method
r = requests.get("http://httpbin.org/get")
1 |
r = requests.get("http://httpbin.org/get") |
If you want to add parameters , You can use params Parameters
import requests payload = {'key1': 'value1', 'key2': 'value2'} r = requests.get("http://httpbin.org/get", params=payload) print r.url
1 2 3 4 5 |
import requests
payload = {'key1': 'value1', 'key2': 'value2'} r = requests.get("http://httpbin.org/get", params=payload) print r.url |
Running results
http://httpbin.org/get?key2=value2&key1=value1
1 |
http://httpbin.org/get?key2=value2&key1=value1 |
If you want to ask JSON file , You can use json() Method resolution
For example, write a JSON The file is named a.json, The contents are as follows
["foo", "bar", { "foo": "bar" }]
1 2 3 |
["foo", "bar", { "foo": "bar" }] |
Use the following program to request and parse
Python
import requests r = requests.get("a.json") print r.text print r.json()
1 2 3 4 5 |
import requests
r = requests.get("a.json") print r.text print r.json() |
The operation results are as follows , One is to output content directly , Another way is to use json() Method resolution , Feel the difference between them
["foo", "bar", { "foo": "bar" }] [u'foo', u'bar', {u'foo': u'bar'}]
1 2 3 4 |
["foo", "bar", { "foo": "bar" }] [u'foo', u'bar', {u'foo': u'bar'}] |
If you want to get the raw socket response from the server , You can get r.raw . However, it needs to be set in the initial request stream=True .
r = requests.get('https://github.com/timeline.json', stream=True) r.raw <requests.packages.urllib3.response.HTTPResponse object at 0x101194810> r.raw.read(10) '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
1 2 3 4 5 |
r = requests.get('https://github.com/timeline.json', stream=True) r.raw <requests.packages.urllib3.response.HTTPResponse object at 0x101194810> r.raw.read(10) '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03' |
In this way, the original socket content of the web page is obtained .
If you want to add headers, It can be transmitted headers Parameters
Python
import requests payload = {'key1': 'value1', 'key2': 'value2'} headers = {'content-type': 'application/json'} r = requests.get("http://httpbin.org/get", params=payload, headers=headers) print r.url
1 2 3 4 5 6 |
import requests
payload = {'key1': 'value1', 'key2': 'value2'} headers = {'content-type': 'application/json'} r = requests.get("http://httpbin.org/get", params=payload, headers=headers) print r.url |
adopt headers Parameter can add... In the request header headers Information
basic POST request
about POST The request for , We usually need to add some parameters to it . Then the most basic method of parameter transfer can be used data This parameter .
import requests payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post("http://httpbin.org/post", data=payload) print r.text
1 2 3 4 5 |
import requests
payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post("http://httpbin.org/post", data=payload) print r.text |
Running results
{ "args": {}, "data": "", "files": {}, "form": { "key1": "value1", "key2": "value2" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
{ "args": {}, "data": "", "files": {}, "form": { "key1": "value1", "key2": "value2" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" } |
You can see that the parameters are passed successfully , Then the server returns the data we sent .
Sometimes the information we need to send is not in form , We need to pass JSON Format data in the past , So we can use json.dumps() Method to serialize form data .
import json import requests url = 'http://httpbin.org/post' payload = {'some': 'data'} r = requests.post(url, data=json.dumps(payload)) print r.text
1 2 3 4 5 6 7 |
import json import requests
url = 'http://httpbin.org/post' payload = {'some': 'data'} r = requests.post(url, data=json.dumps(payload)) print r.text |
Running results
{ "args": {}, "data": "{\"some\": \"data\"}", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "16", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": { "some": "data" }, "url": "http://httpbin.org/post" }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{ "args": {}, "data": "{\"some\": \"data\"}", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "16", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": { "some": "data" }, "url": "http://httpbin.org/post" } |
By the above methods , We can POST JSON Formatted data
If you want to upload a file , Then use it directly file Parameters can be
Create a new one a.txt The file of , It says Hello World!
import requests url = 'http://httpbin.org/post' files = {'file': open('test.txt', 'rb')} r = requests.post(url, files=files) print r.text
1 2 3 4 5 6 |
import requests
url = 'http://httpbin.org/post' files = {'file': open('test.txt', 'rb')} r = requests.post(url, files=files) print r.text |
You can see that the running results are as follows
{ "args": {}, "data": "", "files": { "file": "Hello World!" }, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "156", "Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
{ "args": {}, "data": "", "files": { "file": "Hello World!" }, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "156", "Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" } |
In this way, we successfully completed the upload of a file .
requests It supports streaming upload , This allows you to send large streams or files without first reading them into memory . To use streaming upload , Just provide a class file object for your request body
with open('massive-body') as f: requests.post('http://some.url/streamed', data=f)
1 2 |
with open('massive-body') as f: requests.post('http://some.url/streamed', data=f) |
This is a very practical and convenient function .
Cookies
If a response contains cookie, Then we can use cookies Variables to get
import requests url = 'http://example.com' r = requests.get(url) print r.cookies print r.cookies['example_cookie_name']
1 2 3 4 5 6 |
import requests
url = 'http://example.com' r = requests.get(url) print r.cookies print r.cookies['example_cookie_name'] |
The above procedure is just an example , It can be used cookies Variable to get the cookies
In addition, it can be used cookies Variable to send... To the server cookies Information
import requests url = 'http://httpbin.org/cookies' cookies = dict(cookies_are='working') r = requests.get(url, cookies=cookies) print r.text
1 2 3 4 5 6 |
import requests
url = 'http://httpbin.org/cookies' cookies = dict(cookies_are='working') r = requests.get(url, cookies=cookies) print r.text |
Running results
'{"cookies": {"cookies_are": "working"}}'
1 |
'{"cookies": {"cookies_are": "working"}}' |
Yes, it has been successfully sent to the server cookies
The timeout configuration
You can use timeout Variable to configure the maximum request time
requests.get('http://github.com', timeout=0.001)
1 |
requests.get('http://github.com', timeout=0.001) |
notes :timeout Only valid for connection process , Nothing to do with the download of the response body .
in other words , This time is limited to the requested time . Even if it comes back response It contains a lot of content , It takes time to download , But it doesn't work .
Conversation object
In the above request , Every request is actually a new request . That is to say, each of our requests is opened by a different browser . That is, it is not a conversation , Even if the same URL is requested . such as
import requests requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = requests.get("http://httpbin.org/cookies") print(r.text)
1 2 3 4 5 |
import requests
requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = requests.get("http://httpbin.org/cookies") print(r.text) |
The result is
{ "cookies": {} }
1 2 3 |
{ "cookies": {} } |
Obviously , It's not in a conversation , Can't get cookies, So in some sites , What should we do to keep a lasting conversation ? It's like browsing Taobao with a browser , Jump between different tabs , In fact, this is to establish a long-term conversation .
The solution is as follows
import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = s.get("http://httpbin.org/cookies") print(r.text)
1 2 3 4 5 6 |
import requests
s = requests.Session() s.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = s.get("http://httpbin.org/cookies") print(r.text) |
Here we have asked twice , One is setting up cookies, One is to get cookies
Running results
{ "cookies": { "sessioncookie": "123456789" } }
1 2 3 4 5 |
{ "cookies": { "sessioncookie": "123456789" } } |
Discovery can succeed in obtaining cookies 了 , That's how establishing a conversation works . Give it a try .
So since the session is a global variable , So we can definitely use it for global configuration .
import requests s = requests.Session() s.headers.update({'x-test': 'true'}) r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'}) print r.text
1 2 3 4 5 6 |
import requests
s = requests.Session() s.headers.update({'x-test': 'true'}) r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'}) print r.text |
adopt s.headers.update Method set headers The variable of . Then we set up a... In the request headers, So what happens ?
It's simple , Both variables are passed .
Running results
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true", "X-Test2": "true" } }
1 2 3 4 5 6 7 8 9 10 |
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true", "X-Test2": "true" } } |
If get Method of transmission headers also x-test Well ?
r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'})
1 |
r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'}) |
Um. , It will override the global configuration
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true" } }
1 2 3 4 5 6 7 8 9 |
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true" } } |
So if you don't want a variable in the global configuration ? It's simple , Set to None that will do
r = s.get('http://httpbin.org/headers', headers={'x-test': None})
1 |
r = s.get('http://httpbin.org/headers', headers={'x-test': None}) |
Running results
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" } }
1 2 3 4 5 6 7 8 |
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" } } |
Um. , That's all session Basic usage of conversation
SSL Certificate validation
Now it's everywhere https The first website ,Requests It can be for HTTPS Request validation SSL certificate , It's like web Browser is the same . To check the... Of a host SSL certificate , You can use verify Parameters
Now? 12306 The certificate is invalid , Let's test it
import requests r = requests.get('https://kyfw.12306.cn/otn/', verify=True) print r.text
1 2 3 4 |
import requests
r = requests.get('https://kyfw.12306.cn/otn/', verify=True) print r.text |
result
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
1 |
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590) |
That's the case
Let's try github Of
import requests r = requests.get('https://github.com', verify=True) print r.text
1 2 3 4 |
import requests
r = requests.get('https://github.com', verify=True) print r.text |
Um. , Normal request , I will not output the content .
If we want to skip just now 12306 Certificate validation of , hold verify Set to False that will do
import requests r = requests.get('https://kyfw.12306.cn/otn/', verify=False) print r.text
1 2 3 4 |
import requests
r = requests.get('https://kyfw.12306.cn/otn/', verify=False) print r.text |
Find out and ask for it . By default verify yes True, So if you need to , You need to manually set this variable .
agent
If you need to use a proxy , You can do this by providing proxies Parameter to configure a single request
import requests proxies = { "https": "http://41.118.132.69:4433" } r = requests.post("http://httpbin.org/post", proxies=proxies) print r.text
1 2 3 4 5 6 7 |
import requests
proxies = { "https": "http://41.118.132.69:4433" } r = requests.post("http://httpbin.org/post", proxies=proxies) print r.text |
You can also use environment variables HTTP_PROXY and HTTPS_PROXY To configure the agent
export HTTP_PROXY="http://10.10.1.10:3128" export HTTPS_PROXY="http://10.10.1.10:1080"
1 2 |
export HTTP_PROXY="http://10.10.1.10:3128" export HTTPS_PROXY="http://10.10.1.10:1080" |
Through the above methods , You can easily set up agents .
API
The above explains requests The most commonly used parameter in , If you need to use more , Please refer to the official documentation API
Conclusion
The above is a summary of requests The basic usage of , If you have a foundation for reptiles , So it's sure to get started soon , I won't repeat it here .
Practice is the king , Put your money into practice as soon as possible .
original text :https://cuiqingcai.com/2556.html
版权声明
本文为[Feng Ye 520]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230546152033.html
边栏推荐
- 2022 judgment questions and answers for operation of refrigeration and air conditioning equipment
- 587. Install fence / Sword finger offer II 014 Anagrams in strings
- 102. Sequence traversal of binary tree
- 2022年流动式起重机司机国家题库模拟考试平台操作
- On the problem of V-IF display and hiding
- 油猴网站地址
- Land cover / use data product download
- QTableWidget使用讲解
- Random number generation of C #
- Vite configure proxy proxy to solve cross domain
猜你喜欢
2022 Jiangxi Photovoltaic Exhibition, China distributed Photovoltaic Exhibition, Nanchang solar energy utilization Exhibition
JS get link? The following parameter name or value, according to the URL? Judge the parameters after
C1 notes [task training chapter I]
k8s之实现redis一主多从动态扩缩容
云原生虚拟化:基于 Kubevirt 构建边缘计算实例
Nat commun | current progress and open challenges of applied deep learning in Bioscience
Land cover / use data product download
Examination question bank and online simulation examination of the third batch (main person in charge) of special operation certificate of safety officer a certificate in Guangdong Province in 2022
Uniapp custom search box adaptation applet alignment capsule
Hcip fifth experiment
随机推荐
Yolov4 pruning [with code]
The ultimate experience, the audio and video technology behind the tiktok
SystemVerilog (VI) - variable
Write a regular
Add drag and drop function to El dialog
I/O多路复用及其相关详解
Using files to save data (C language)
Cross domain settings of Chrome browser -- including new and old versions
2022 judgment questions and answers for operation of refrigeration and air conditioning equipment
Go语言JSON包使用
ES6 new method
2022 Jiangxi Photovoltaic Exhibition, China Distributed Photovoltaic Exhibition, Nanchang Solar Energy Utilization Exhibition
2022年上海市安全员C证操作证考试题库及模拟考试
I / O multiplexing and its related details
Commonly used functions -- spineros:: and spineros::)
2022年流动式起重机司机国家题库模拟考试平台操作
MySQL进阶之索引【分类,性能分析,使用,设计原则】
C1 notes [task training chapter I]
An example of linear regression based on tensorflow
C network related operations