当前位置：网站首页>Basic usage of crawler requests

Basic usage of crawler requests

2022-04-23 17:58:00 【Feng Ye 520】

Preface

We used urllib library , This is a good tool for getting started , To understand some basic concepts of reptiles , It is helpful to master the process of crawling . After introduction , We need to learn more advanced content and tools to facilitate our crawling . In this section, let's briefly introduce requests Basic usage of the library .

notes ：Python The version is still based on 2.7

Official documents

Most of the following is from official documents , This paper makes some modifications and summaries . To learn more, you can refer to

Official documents

install

utilize pip install

$ pip install requests

1	$ pip install requests

Or make use of easy_install

$ easy_install requests

1	$ easy_install requests

The installation can be completed by the above two methods .

introduce

First of all, let's introduce a small example to experience

Python

import requests r = requests.get('http://cuiqingcai.com') print type(r) print r.status_code print r.encoding #print r.text print r.cookies

import requests

r = requests.get('http://cuiqingcai.com')

print type(r)

print r.status_code

print r.encoding

#print r.text

print r.cookies

The above code we requested the website address , Then print out the type of return result , Status code , Encoding mode ,Cookies The content such as .

The operation results are as follows

200

UTF-8

<RequestsCookieJar[]>

How to , Is it convenient . Don't worry. , It's more convenient in the back .

Basic request

requests The library provides http All the basic request methods . for example

r = requests.post("http://httpbin.org/post") r = requests.put("http://httpbin.org/put") r = requests.delete("http://httpbin.org/delete") r = requests.head("http://httpbin.org/get") r = requests.options("http://httpbin.org/get")

r = requests.post("http://httpbin.org/post")

r = requests.put("http://httpbin.org/put")

r = requests.delete("http://httpbin.org/delete")

r = requests.head("http://httpbin.org/get")

r = requests.options("http://httpbin.org/get")

Um. , In a word .

basic GET request

The most basic GET The request can be used directly get Method

r = requests.get("http://httpbin.org/get")

1	r = requests.get("http://httpbin.org/get")

If you want to add parameters , You can use params Parameters

import requests payload = {'key1': 'value1', 'key2': 'value2'} r = requests.get("http://httpbin.org/get", params=payload) print r.url

import requests

payload = {'key1': 'value1', 'key2': 'value2'}

r = requests.get("http://httpbin.org/get", params=payload)

print r.url

Running results

http://httpbin.org/get?key2=value2&key1=value1

1	http://httpbin.org/get?key2=value2&key1=value1

If you want to ask JSON file , You can use json() Method resolution

For example, write a JSON The file is named a.json, The contents are as follows

["foo", "bar", { "foo": "bar" }]

["foo", "bar", {

"foo": "bar"

}]

Use the following program to request and parse

Python

import requests r = requests.get("a.json") print r.text print r.json()

import requests

r = requests.get("a.json")

print r.text

print r.json()

The operation results are as follows , One is to output content directly , Another way is to use json() Method resolution , Feel the difference between them

["foo", "bar", { "foo": "bar" }] [u'foo', u'bar', {u'foo': u'bar'}]

["foo", "bar", {

"foo": "bar"

}]

[u'foo', u'bar', {u'foo': u'bar'}]

If you want to get the raw socket response from the server , You can get r.raw . However, it needs to be set in the initial request stream=True .

r = requests.get('https://github.com/timeline.json', stream=True) r.raw <requests.packages.urllib3.response.HTTPResponse object at 0x101194810> r.raw.read(10) '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

r = requests.get('https://github.com/timeline.json', stream=True)

r.raw

<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>

r.raw.read(10)

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

In this way, the original socket content of the web page is obtained .

If you want to add headers, It can be transmitted headers Parameters

Python

import requests payload = {'key1': 'value1', 'key2': 'value2'} headers = {'content-type': 'application/json'} r = requests.get("http://httpbin.org/get", params=payload, headers=headers) print r.url

import requests

payload = {'key1': 'value1', 'key2': 'value2'}

headers = {'content-type': 'application/json'}

r = requests.get("http://httpbin.org/get", params=payload, headers=headers)

print r.url

adopt headers Parameter can add... In the request header headers Information

basic POST request

about POST The request for , We usually need to add some parameters to it . Then the most basic method of parameter transfer can be used data This parameter .

import requests payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post("http://httpbin.org/post", data=payload) print r.text

import requests

payload = {'key1': 'value1', 'key2': 'value2'}

r = requests.post("http://httpbin.org/post", data=payload)

print r.text

Running results

{ "args": {}, "data": "", "files": {}, "form": { "key1": "value1", "key2": "value2" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" }

{

"args": {},

"data": "",

"files": {},

"form": {

"key1": "value1",

"key2": "value2"

"headers": {

"Accept": "*/*",

"Accept-Encoding": "gzip, deflate",

"Content-Length": "23",

"Content-Type": "application/x-www-form-urlencoded",

"Host": "httpbin.org",

"User-Agent": "python-requests/2.9.1"

"json": null,

"url": "http://httpbin.org/post"

}

You can see that the parameters are passed successfully , Then the server returns the data we sent .

Sometimes the information we need to send is not in form , We need to pass JSON Format data in the past , So we can use json.dumps() Method to serialize form data .

import json import requests url = 'http://httpbin.org/post' payload = {'some': 'data'} r = requests.post(url, data=json.dumps(payload)) print r.text

import json

import requests

url = 'http://httpbin.org/post'

payload = {'some': 'data'}

r = requests.post(url, data=json.dumps(payload))

print r.text

Running results

{ "args": {}, "data": "{\"some\": \"data\"}", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "16", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": { "some": "data" }, "url": "http://httpbin.org/post" }

{

"args": {},

"data": "{\"some\": \"data\"}",

"files": {},

"form": {},

"headers": {

"Accept": "*/*",

"Accept-Encoding": "gzip, deflate",

"Content-Length": "16",

"Host": "httpbin.org",

"User-Agent": "python-requests/2.9.1"

"json": {

"some": "data"

"url": "http://httpbin.org/post"

}

By the above methods , We can POST JSON Formatted data

If you want to upload a file , Then use it directly file Parameters can be

Create a new one a.txt The file of , It says Hello World!

import requests url = 'http://httpbin.org/post' files = {'file': open('test.txt', 'rb')} r = requests.post(url, files=files) print r.text

import requests

url = 'http://httpbin.org/post'

files = {'file': open('test.txt', 'rb')}

r = requests.post(url, files=files)

print r.text

You can see that the running results are as follows

{ "args": {}, "data": "", "files": { "file": "Hello World!" }, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "156", "Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" }

{

"args": {},

"data": "",

"files": {

"file": "Hello World!"

"form": {},

"headers": {

"Accept": "*/*",

"Accept-Encoding": "gzip, deflate",

"Content-Length": "156",

"Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000",

"Host": "httpbin.org",

"User-Agent": "python-requests/2.9.1"

"json": null,

"url": "http://httpbin.org/post"

}

In this way, we successfully completed the upload of a file .

requests It supports streaming upload , This allows you to send large streams or files without first reading them into memory . To use streaming upload , Just provide a class file object for your request body

with open('massive-body') as f: requests.post('http://some.url/streamed', data=f)

1 2	with open('massive-body') as f: requests.post('http://some.url/streamed', data=f)

This is a very practical and convenient function .

Cookies

If a response contains cookie, Then we can use cookies Variables to get

import requests url = 'http://example.com' r = requests.get(url) print r.cookies print r.cookies['example_cookie_name']

import requests

url = 'http://example.com'

r = requests.get(url)

print r.cookies

print r.cookies['example_cookie_name']

The above procedure is just an example , It can be used cookies Variable to get the cookies

In addition, it can be used cookies Variable to send... To the server cookies Information

import requests url = 'http://httpbin.org/cookies' cookies = dict(cookies_are='working') r = requests.get(url, cookies=cookies) print r.text

import requests

url = 'http://httpbin.org/cookies'

cookies = dict(cookies_are='working')

r = requests.get(url, cookies=cookies)

print r.text

Running results

'{"cookies": {"cookies_are": "working"}}'

1	'{"cookies": {"cookies_are": "working"}}'

Yes, it has been successfully sent to the server cookies

The timeout configuration

You can use timeout Variable to configure the maximum request time

requests.get('http://github.com', timeout=0.001)

1	requests.get('http://github.com', timeout=0.001)

notes ：timeout Only valid for connection process , Nothing to do with the download of the response body .

in other words , This time is limited to the requested time . Even if it comes back response It contains a lot of content , It takes time to download , But it doesn't work .

Conversation object

In the above request , Every request is actually a new request . That is to say, each of our requests is opened by a different browser . That is, it is not a conversation , Even if the same URL is requested . such as

import requests requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = requests.get("http://httpbin.org/cookies") print(r.text)

import requests

requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789')

r = requests.get("http://httpbin.org/cookies")

print(r.text)

The result is

{ "cookies": {} }

{

"cookies": {}

}

Obviously , It's not in a conversation , Can't get cookies, So in some sites , What should we do to keep a lasting conversation ？ It's like browsing Taobao with a browser , Jump between different tabs , In fact, this is to establish a long-term conversation .

The solution is as follows

import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = s.get("http://httpbin.org/cookies") print(r.text)

import requests

s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')

r = s.get("http://httpbin.org/cookies")

print(r.text)

Here we have asked twice , One is setting up cookies, One is to get cookies

Running results

{ "cookies": { "sessioncookie": "123456789" } }

{

"cookies": {

"sessioncookie": "123456789"

}

Discovery can succeed in obtaining cookies 了 , That's how establishing a conversation works . Give it a try .

So since the session is a global variable , So we can definitely use it for global configuration .

import requests s = requests.Session() s.headers.update({'x-test': 'true'}) r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'}) print r.text

import requests

s = requests.Session()

s.headers.update({'x-test': 'true'})

r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})

print r.text

adopt s.headers.update Method set headers The variable of . Then we set up a... In the request headers, So what happens ？

It's simple , Both variables are passed .

Running results

{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true", "X-Test2": "true" } }

{

"headers": {

"Accept": "*/*",

"Accept-Encoding": "gzip, deflate",

"Host": "httpbin.org",

"User-Agent": "python-requests/2.9.1",

"X-Test": "true",

"X-Test2": "true"

}

If get Method of transmission headers also x-test Well ？

r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'})

1	r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'})

Um. , It will override the global configuration

{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true" } }

{

"headers": {

"Accept": "*/*",

"Accept-Encoding": "gzip, deflate",

"Host": "httpbin.org",

"User-Agent": "python-requests/2.9.1",

"X-Test": "true"

}

So if you don't want a variable in the global configuration ？ It's simple , Set to None that will do

r = s.get('http://httpbin.org/headers', headers={'x-test': None})

1	r = s.get('http://httpbin.org/headers', headers={'x-test': None})

Running results

{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" } }

{

"headers": {

"Accept": "*/*",

"Accept-Encoding": "gzip, deflate",

"Host": "httpbin.org",

"User-Agent": "python-requests/2.9.1"

}

Um. , That's all session Basic usage of conversation

SSL Certificate validation

Now it's everywhere https The first website ,Requests It can be for HTTPS Request validation SSL certificate , It's like web Browser is the same . To check the... Of a host SSL certificate , You can use verify Parameters

Now? 12306 The certificate is invalid , Let's test it

import requests r = requests.get('https://kyfw.12306.cn/otn/', verify=True) print r.text

import requests

r = requests.get('https://kyfw.12306.cn/otn/', verify=True)

print r.text

result

requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)

1	requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)

That's the case

Let's try github Of

import requests r = requests.get('https://github.com', verify=True) print r.text

import requests

r = requests.get('https://github.com', verify=True)

print r.text

Um. , Normal request , I will not output the content .

If we want to skip just now 12306 Certificate validation of , hold verify Set to False that will do

import requests r = requests.get('https://kyfw.12306.cn/otn/', verify=False) print r.text

import requests

r = requests.get('https://kyfw.12306.cn/otn/', verify=False)

print r.text

Find out and ask for it . By default verify yes True, So if you need to , You need to manually set this variable .

agent

If you need to use a proxy , You can do this by providing proxies Parameter to configure a single request

import requests proxies = { "https": "http://41.118.132.69:4433" } r = requests.post("http://httpbin.org/post", proxies=proxies) print r.text

import requests

proxies = {

"https": "http://41.118.132.69:4433"

}

r = requests.post("http://httpbin.org/post", proxies=proxies)

print r.text

You can also use environment variables HTTP_PROXY and HTTPS_PROXY To configure the agent

export HTTP_PROXY="http://10.10.1.10:3128" export HTTPS_PROXY="http://10.10.1.10:1080"

1 2	export HTTP_PROXY="http://10.10.1.10:3128" export HTTPS_PROXY="http://10.10.1.10:1080"

Through the above methods , You can easily set up agents .

API

The above explains requests The most commonly used parameter in , If you need to use more , Please refer to the official documentation API

API

Conclusion

The above is a summary of requests The basic usage of , If you have a foundation for reptiles , So it's sure to get started soon , I won't repeat it here .

Practice is the king , Put your money into practice as soon as possible .

original text :https://cuiqingcai.com/2556.html

版权声明
本文为[Feng Ye 520]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204230546152033.html