当前位置:网站首页>爬虫基本原理介绍、实现以及问题解决
爬虫基本原理介绍、实现以及问题解决
2022-08-10 19:16:00 【InfoQ】
一、爬虫的意义
1.前言
2.爬虫能做什么
大量自动化
公开的数据
公开shuju
3.爬虫有什么意义

二、爬虫的实现
1.爬虫的基础原理
2.api的获取

graphql
payload = {"operation_name": "userPublicProfile", #查询数据库请求内容
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"查询对象"}'
}
3.爬虫实现
import requests as rq
from urllib.parse import urlencode
headers={ #请求头信息
"Referer":"https://leetcode.cn",
}
payload = {"operation_name": "userPublicProfile", #查询数据库请求内容
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"romantic-haibty42"}'
}
res = rq.post("https://leetcode.cn/graphql/"+"?"+ urlencode(payload),headers = headers)
print(res.text)

acTotal
3三、反爬解决方案
1.反爬的实现方式
2.反爬的解决方法

3.反爬的实现代码

tiqu
# coding=utf-8
# !/usr/bin/env python
import json
import threading
import time
import requests as rq
from urllib.parse import urlencode
headers={
"Referer":"https://leetcode.cn",
}
payload = {"operation_name": "userPublicProfile",
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"kingley"}'
}
username = "romantic-haibty42"
def int_csrf(proxies,header):
sess= rq.session()
sess.proxies = proxies
sess.head("https://leetcode.cn/graphql/")
header['x-csrftoken'] = sess.cookies["csrftoken"]
testUrl = 'https://api.myip.la/en?json'
# 核心业务
def testPost(host, port):
proxies = {
'http': 'socks5://{}:{}'.format(host, port),
'https': 'socks5://{}:{}'.format(host, port),
}
res = ""
while True:
try:
header = headers
# print(res.status_code)
chaxun = payload
chaxun['variables'] = json.dumps({"userSlug" : f"{username}"})
res = rq.post("https://leetcode.cn/graphql/"+"?"+ urlencode(chaxun),headers = header,proxies=proxies)
print(host,res.text)
except Exception as e:
print(e)
break
class ThreadFactory(threading.Thread):
def __init__(self, host, port):
threading.Thread.__init__(self)
self.host = host
self.port = port
def run(self):
testPost(self.host, self.port)
# 提取代理的链接 json类型的返回值 socks5方式
tiqu = ''
while 1 == 1:
# 每次提取10个,放入线程中
resp = rq.get(url=tiqu, timeout=5)
try:
if resp.status_code == 200:
dataBean = json.loads(resp.text)
else:
print("获取失败")
time.sleep(1)
continue
except ValueError:
print("获取失败")
time.sleep(1)
continue
else:
# 解析json数组
print("code=", dataBean)
code = dataBean["code"]
if code == 0:
threads = []
for proxy in dataBean["data"]:
threads.append(ThreadFactory(proxy["ip"], proxy["port"]))
for t in threads: # 开启线程
t.start()
time.sleep(0.01)
for t in threads: # 阻塞线程
t.join()
# break
break

4.IPIDEA还能做什么


四、总结
边栏推荐
- @Autowired annotation --required a single bean, but 2 were found causes and solutions
- 【SemiDrive源码分析】【MailBox核间通信】51 - DCF_IPCC_Property实现原理分析 及 代码实战
- Transferrin-modified vincristine-tetrandrine liposomes | transferrin-modified co-loaded paclitaxel and genistein liposomes (reagents)
- flask生成路由的2种方式和反向生成url
- Apple Font Lookup
- 【二叉树】二叉搜索树的后序遍历序列
- UnitTest中的Path must be within the project 问题
- leetcode 84.柱状图中最大的矩形 单调栈应用
- uni-app 数据上拉加载更多功能
- Keras deep learning combat (17) - image segmentation using U-Net architecture
猜你喜欢
史上最全GIS相关软件(CAD、FME、Arcgis、ArcgisPro)
[email prot"/>
Transferrin-modified osthole long-circulating liposomes/PEG-PLGA nanoparticles loaded with notoginsenoside R1 ([email prot
机器学习|模型评估——AUC
Linux服务器安装Redis,详细步骤。
『牛客|每日一题』岛屿数量
Common ports and services
Ferritin particle-loaded raltitrexed/pemetrexed/sulfadesoxine/adamantane (scientific research reagent)
whois信息收集&企业备案信息
赎金信问题答记
QoS Quality of Service Six Router Congestion Management
随机推荐
转铁蛋白(Tf)修饰去氢骆驼蓬碱磁纳米脂质体/香豆素-6脂质体/多柔比星脂质体
flask装饰器版登录、session
测试/开发程序员值这么多钱么?“我“不会愿赌服输......
@Autowired annotation --required a single bean, but 2 were found causes and solutions
网站架构探测&chrome插件用于信息收集
【SemiDrive源码分析】【MailBox核间通信】51 - DCF_IPCC_Property实现原理分析 及 代码实战
电脑开不了机是什么原因?
几行深度学习代码设计包含功能位点的候选免疫原、酶活性位点、蛋白结合蛋白、金属配位蛋白
巧用RoaringBitMap处理海量数据内存diff问题
echart 特例-多分组X轴
leetcode 547.省份数量 并查集
30分钟使用百度EasyDL实现健康码/行程码智能识别
一维数组动态和问题答记
(十二) findContours函数的hierarchy详解
Ransom Letter Questions and Answers
QoS Quality of Service Six Router Congestion Management
Demis Hassabis:AI 的强大,超乎我们的想象
Optimization is a habit The starting point is to 'stand close to the critical'
UE4 - 河流流体插件Fluid Flux
从 Delta 2.0 开始聊聊我们需要怎样的数据湖