当前位置:网站首页>Basic usage of crawler requests
Basic usage of crawler requests
2022-04-23 17:58:00 【Feng Ye 520】
Preface
We used urllib library , This is a good tool for getting started , To understand some basic concepts of reptiles , It is helpful to master the process of crawling . After introduction , We need to learn more advanced content and tools to facilitate our crawling . In this section, let's briefly introduce requests Basic usage of the library .
notes :Python The version is still based on 2.7
Official documents
Most of the following is from official documents , This paper makes some modifications and summaries . To learn more, you can refer to
install
utilize pip install
$ pip install requests
1 |
$ pip install requests |
Or make use of easy_install
$ easy_install requests
1 |
$ easy_install requests |
The installation can be completed by the above two methods .
introduce
First of all, let's introduce a small example to experience
Python
import requests r = requests.get('http://cuiqingcai.com') print type(r) print r.status_code print r.encoding #print r.text print r.cookies
1 2 3 4 5 6 7 8 |
import requests
r = requests.get('http://cuiqingcai.com') print type(r) print r.status_code print r.encoding #print r.text print r.cookies |
The above code we requested the website address , Then print out the type of return result , Status code , Encoding mode ,Cookies The content such as .
The operation results are as follows
<class 'requests.models.Response'> 200 UTF-8 <RequestsCookieJar[]>
1 2 3 4 |
<class 'requests.models.Response'> 200 UTF-8 <RequestsCookieJar[]> |
How to , Is it convenient . Don't worry. , It's more convenient in the back .
Basic request
requests The library provides http All the basic request methods . for example
r = requests.post("http://httpbin.org/post") r = requests.put("http://httpbin.org/put") r = requests.delete("http://httpbin.org/delete") r = requests.head("http://httpbin.org/get") r = requests.options("http://httpbin.org/get")
1 2 3 4 5 |
r = requests.post("http://httpbin.org/post") r = requests.put("http://httpbin.org/put") r = requests.delete("http://httpbin.org/delete") r = requests.head("http://httpbin.org/get") r = requests.options("http://httpbin.org/get") |
Um. , In a word .
basic GET request
The most basic GET The request can be used directly get Method
r = requests.get("http://httpbin.org/get")
1 |
r = requests.get("http://httpbin.org/get") |
If you want to add parameters , You can use params Parameters
import requests payload = {'key1': 'value1', 'key2': 'value2'} r = requests.get("http://httpbin.org/get", params=payload) print r.url
1 2 3 4 5 |
import requests
payload = {'key1': 'value1', 'key2': 'value2'} r = requests.get("http://httpbin.org/get", params=payload) print r.url |
Running results
http://httpbin.org/get?key2=value2&key1=value1
1 |
http://httpbin.org/get?key2=value2&key1=value1 |
If you want to ask JSON file , You can use json() Method resolution
For example, write a JSON The file is named a.json, The contents are as follows
["foo", "bar", { "foo": "bar" }]
1 2 3 |
["foo", "bar", { "foo": "bar" }] |
Use the following program to request and parse
Python
import requests r = requests.get("a.json") print r.text print r.json()
1 2 3 4 5 |
import requests
r = requests.get("a.json") print r.text print r.json() |
The operation results are as follows , One is to output content directly , Another way is to use json() Method resolution , Feel the difference between them
["foo", "bar", { "foo": "bar" }] [u'foo', u'bar', {u'foo': u'bar'}]
1 2 3 4 |
["foo", "bar", { "foo": "bar" }] [u'foo', u'bar', {u'foo': u'bar'}] |
If you want to get the raw socket response from the server , You can get r.raw . However, it needs to be set in the initial request stream=True .
r = requests.get('https://github.com/timeline.json', stream=True) r.raw <requests.packages.urllib3.response.HTTPResponse object at 0x101194810> r.raw.read(10) '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
1 2 3 4 5 |
r = requests.get('https://github.com/timeline.json', stream=True) r.raw <requests.packages.urllib3.response.HTTPResponse object at 0x101194810> r.raw.read(10) '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03' |
In this way, the original socket content of the web page is obtained .
If you want to add headers, It can be transmitted headers Parameters
Python
import requests payload = {'key1': 'value1', 'key2': 'value2'} headers = {'content-type': 'application/json'} r = requests.get("http://httpbin.org/get", params=payload, headers=headers) print r.url
1 2 3 4 5 6 |
import requests
payload = {'key1': 'value1', 'key2': 'value2'} headers = {'content-type': 'application/json'} r = requests.get("http://httpbin.org/get", params=payload, headers=headers) print r.url |
adopt headers Parameter can add... In the request header headers Information
basic POST request
about POST The request for , We usually need to add some parameters to it . Then the most basic method of parameter transfer can be used data This parameter .
import requests payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post("http://httpbin.org/post", data=payload) print r.text
1 2 3 4 5 |
import requests
payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post("http://httpbin.org/post", data=payload) print r.text |
Running results
{ "args": {}, "data": "", "files": {}, "form": { "key1": "value1", "key2": "value2" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
{ "args": {}, "data": "", "files": {}, "form": { "key1": "value1", "key2": "value2" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" } |
You can see that the parameters are passed successfully , Then the server returns the data we sent .
Sometimes the information we need to send is not in form , We need to pass JSON Format data in the past , So we can use json.dumps() Method to serialize form data .
import json import requests url = 'http://httpbin.org/post' payload = {'some': 'data'} r = requests.post(url, data=json.dumps(payload)) print r.text
1 2 3 4 5 6 7 |
import json import requests
url = 'http://httpbin.org/post' payload = {'some': 'data'} r = requests.post(url, data=json.dumps(payload)) print r.text |
Running results
{ "args": {}, "data": "{\"some\": \"data\"}", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "16", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": { "some": "data" }, "url": "http://httpbin.org/post" }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{ "args": {}, "data": "{\"some\": \"data\"}", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "16", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": { "some": "data" }, "url": "http://httpbin.org/post" } |
By the above methods , We can POST JSON Formatted data
If you want to upload a file , Then use it directly file Parameters can be
Create a new one a.txt The file of , It says Hello World!
import requests url = 'http://httpbin.org/post' files = {'file': open('test.txt', 'rb')} r = requests.post(url, files=files) print r.text
1 2 3 4 5 6 |
import requests
url = 'http://httpbin.org/post' files = {'file': open('test.txt', 'rb')} r = requests.post(url, files=files) print r.text |
You can see that the running results are as follows
{ "args": {}, "data": "", "files": { "file": "Hello World!" }, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "156", "Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
{ "args": {}, "data": "", "files": { "file": "Hello World!" }, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "156", "Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" } |
In this way, we successfully completed the upload of a file .
requests It supports streaming upload , This allows you to send large streams or files without first reading them into memory . To use streaming upload , Just provide a class file object for your request body
with open('massive-body') as f: requests.post('http://some.url/streamed', data=f)
1 2 |
with open('massive-body') as f: requests.post('http://some.url/streamed', data=f) |
This is a very practical and convenient function .
Cookies
If a response contains cookie, Then we can use cookies Variables to get
import requests url = 'http://example.com' r = requests.get(url) print r.cookies print r.cookies['example_cookie_name']
1 2 3 4 5 6 |
import requests
url = 'http://example.com' r = requests.get(url) print r.cookies print r.cookies['example_cookie_name'] |
The above procedure is just an example , It can be used cookies Variable to get the cookies
In addition, it can be used cookies Variable to send... To the server cookies Information
import requests url = 'http://httpbin.org/cookies' cookies = dict(cookies_are='working') r = requests.get(url, cookies=cookies) print r.text
1 2 3 4 5 6 |
import requests
url = 'http://httpbin.org/cookies' cookies = dict(cookies_are='working') r = requests.get(url, cookies=cookies) print r.text |
Running results
'{"cookies": {"cookies_are": "working"}}'
1 |
'{"cookies": {"cookies_are": "working"}}' |
Yes, it has been successfully sent to the server cookies
The timeout configuration
You can use timeout Variable to configure the maximum request time
requests.get('http://github.com', timeout=0.001)
1 |
requests.get('http://github.com', timeout=0.001) |
notes :timeout Only valid for connection process , Nothing to do with the download of the response body .
in other words , This time is limited to the requested time . Even if it comes back response It contains a lot of content , It takes time to download , But it doesn't work .
Conversation object
In the above request , Every request is actually a new request . That is to say, each of our requests is opened by a different browser . That is, it is not a conversation , Even if the same URL is requested . such as
import requests requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = requests.get("http://httpbin.org/cookies") print(r.text)
1 2 3 4 5 |
import requests
requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = requests.get("http://httpbin.org/cookies") print(r.text) |
The result is
{ "cookies": {} }
1 2 3 |
{ "cookies": {} } |
Obviously , It's not in a conversation , Can't get cookies, So in some sites , What should we do to keep a lasting conversation ? It's like browsing Taobao with a browser , Jump between different tabs , In fact, this is to establish a long-term conversation .
The solution is as follows
import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = s.get("http://httpbin.org/cookies") print(r.text)
1 2 3 4 5 6 |
import requests
s = requests.Session() s.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = s.get("http://httpbin.org/cookies") print(r.text) |
Here we have asked twice , One is setting up cookies, One is to get cookies
Running results
{ "cookies": { "sessioncookie": "123456789" } }
1 2 3 4 5 |
{ "cookies": { "sessioncookie": "123456789" } } |
Discovery can succeed in obtaining cookies 了 , That's how establishing a conversation works . Give it a try .
So since the session is a global variable , So we can definitely use it for global configuration .
import requests s = requests.Session() s.headers.update({'x-test': 'true'}) r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'}) print r.text
1 2 3 4 5 6 |
import requests
s = requests.Session() s.headers.update({'x-test': 'true'}) r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'}) print r.text |
adopt s.headers.update Method set headers The variable of . Then we set up a... In the request headers, So what happens ?
It's simple , Both variables are passed .
Running results
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true", "X-Test2": "true" } }
1 2 3 4 5 6 7 8 9 10 |
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true", "X-Test2": "true" } } |
If get Method of transmission headers also x-test Well ?
r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'})
1 |
r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'}) |
Um. , It will override the global configuration
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true" } }
1 2 3 4 5 6 7 8 9 |
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true" } } |
So if you don't want a variable in the global configuration ? It's simple , Set to None that will do
r = s.get('http://httpbin.org/headers', headers={'x-test': None})
1 |
r = s.get('http://httpbin.org/headers', headers={'x-test': None}) |
Running results
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" } }
1 2 3 4 5 6 7 8 |
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" } } |
Um. , That's all session Basic usage of conversation
SSL Certificate validation
Now it's everywhere https The first website ,Requests It can be for HTTPS Request validation SSL certificate , It's like web Browser is the same . To check the... Of a host SSL certificate , You can use verify Parameters
Now? 12306 The certificate is invalid , Let's test it
import requests r = requests.get('https://kyfw.12306.cn/otn/', verify=True) print r.text
1 2 3 4 |
import requests
r = requests.get('https://kyfw.12306.cn/otn/', verify=True) print r.text |
result
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
1 |
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590) |
That's the case
Let's try github Of
import requests r = requests.get('https://github.com', verify=True) print r.text
1 2 3 4 |
import requests
r = requests.get('https://github.com', verify=True) print r.text |
Um. , Normal request , I will not output the content .
If we want to skip just now 12306 Certificate validation of , hold verify Set to False that will do
import requests r = requests.get('https://kyfw.12306.cn/otn/', verify=False) print r.text
1 2 3 4 |
import requests
r = requests.get('https://kyfw.12306.cn/otn/', verify=False) print r.text |
Find out and ask for it . By default verify yes True, So if you need to , You need to manually set this variable .
agent
If you need to use a proxy , You can do this by providing proxies Parameter to configure a single request
import requests proxies = { "https": "http://41.118.132.69:4433" } r = requests.post("http://httpbin.org/post", proxies=proxies) print r.text
1 2 3 4 5 6 7 |
import requests
proxies = { "https": "http://41.118.132.69:4433" } r = requests.post("http://httpbin.org/post", proxies=proxies) print r.text |
You can also use environment variables HTTP_PROXY and HTTPS_PROXY To configure the agent
export HTTP_PROXY="http://10.10.1.10:3128" export HTTPS_PROXY="http://10.10.1.10:1080"
1 2 |
export HTTP_PROXY="http://10.10.1.10:3128" export HTTPS_PROXY="http://10.10.1.10:1080" |
Through the above methods , You can easily set up agents .
API
The above explains requests The most commonly used parameter in , If you need to use more , Please refer to the official documentation API
Conclusion
The above is a summary of requests The basic usage of , If you have a foundation for reptiles , So it's sure to get started soon , I won't repeat it here .
Practice is the king , Put your money into practice as soon as possible .
original text :https://cuiqingcai.com/2556.html
版权声明
本文为[Feng Ye 520]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230546152033.html
边栏推荐
- An example of linear regression based on tensorflow
- Where is the configuration file of tidb server?
- SQL optimization for advanced learning of MySQL [insert, primary key, sort, group, page, count]
- .104History
- Applet learning notes (I)
- 油猴网站地址
- 2022江西光伏展,中國分布式光伏展會,南昌太陽能利用展
- undefined reference to `Nabo::NearestNeighbourSearch
- Summary of common server error codes
- Anchor location - how to set the distance between the anchor and the top of the page. The anchor is located and offset from the top
猜你喜欢
MySQL_ 01_ Simple data retrieval
C# 网络相关操作
Auto. JS custom dialog box
Future usage details
Logic regression principle and code implementation
SystemVerilog (VI) - variable
Comparison between xtask and kotlin coroutine
JS get link? The following parameter name or value, according to the URL? Judge the parameters after
2021 Great Wall Cup WP
102. Sequence traversal of binary tree
随机推荐
I / O multiplexing and its related details
2022 judgment questions and answers for operation of refrigeration and air conditioning equipment
Gaode map search, drag and drop query address
Auto.js 自定义对话框
Submit local warehouse and synchronize code cloud warehouse
[appium] write scripts by designing Keyword Driven files
Operation of 2022 mobile crane driver national question bank simulation examination platform
ES6 face test questions (reference documents)
Detailed deployment of flask project
C1小笔记【任务训练篇二】
[binary number] maximum depth of binary tree + maximum depth of n-ary tree
2022 Shanghai safety officer C certificate operation certificate examination question bank and simulation examination
SystemVerilog (VI) - variable
JS interview question: FN call. call. call. Call (FN2) parsing
2022年茶艺师(初级)考试模拟100题及模拟考试
Uniapp custom search box adaptation applet alignment capsule
云原生虚拟化:基于 Kubevirt 构建边缘计算实例
Timestamp to formatted date
String function in MySQL
2022江西储能技术展会,中国电池展,动力电池展,燃料电池展