当前位置:网站首页>爬虫基本原理介绍、实现以及问题解决
爬虫基本原理介绍、实现以及问题解决
2022-08-10 19:16:00 【InfoQ】
一、爬虫的意义
1.前言
2.爬虫能做什么
大量自动化
公开的数据
公开shuju
3.爬虫有什么意义

二、爬虫的实现
1.爬虫的基础原理
2.api的获取

graphql
payload = {"operation_name": "userPublicProfile", #查询数据库请求内容
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"查询对象"}'
}
3.爬虫实现
import requests as rq
from urllib.parse import urlencode
headers={ #请求头信息
"Referer":"https://leetcode.cn",
}
payload = {"operation_name": "userPublicProfile", #查询数据库请求内容
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"romantic-haibty42"}'
}
res = rq.post("https://leetcode.cn/graphql/"+"?"+ urlencode(payload),headers = headers)
print(res.text)

acTotal
3三、反爬解决方案
1.反爬的实现方式
2.反爬的解决方法

3.反爬的实现代码

tiqu
# coding=utf-8
# !/usr/bin/env python
import json
import threading
import time
import requests as rq
from urllib.parse import urlencode
headers={
"Referer":"https://leetcode.cn",
}
payload = {"operation_name": "userPublicProfile",
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"kingley"}'
}
username = "romantic-haibty42"
def int_csrf(proxies,header):
sess= rq.session()
sess.proxies = proxies
sess.head("https://leetcode.cn/graphql/")
header['x-csrftoken'] = sess.cookies["csrftoken"]
testUrl = 'https://api.myip.la/en?json'
# 核心业务
def testPost(host, port):
proxies = {
'http': 'socks5://{}:{}'.format(host, port),
'https': 'socks5://{}:{}'.format(host, port),
}
res = ""
while True:
try:
header = headers
# print(res.status_code)
chaxun = payload
chaxun['variables'] = json.dumps({"userSlug" : f"{username}"})
res = rq.post("https://leetcode.cn/graphql/"+"?"+ urlencode(chaxun),headers = header,proxies=proxies)
print(host,res.text)
except Exception as e:
print(e)
break
class ThreadFactory(threading.Thread):
def __init__(self, host, port):
threading.Thread.__init__(self)
self.host = host
self.port = port
def run(self):
testPost(self.host, self.port)
# 提取代理的链接 json类型的返回值 socks5方式
tiqu = ''
while 1 == 1:
# 每次提取10个,放入线程中
resp = rq.get(url=tiqu, timeout=5)
try:
if resp.status_code == 200:
dataBean = json.loads(resp.text)
else:
print("获取失败")
time.sleep(1)
continue
except ValueError:
print("获取失败")
time.sleep(1)
continue
else:
# 解析json数组
print("code=", dataBean)
code = dataBean["code"]
if code == 0:
threads = []
for proxy in dataBean["data"]:
threads.append(ThreadFactory(proxy["ip"], proxy["port"]))
for t in threads: # 开启线程
t.start()
time.sleep(0.01)
for t in threads: # 阻塞线程
t.join()
# break
break

4.IPIDEA还能做什么


四、总结
边栏推荐
- 优雅退出在Golang中的实现
- FEMRL: A Framework for Large-Scale Privacy-Preserving Linkage of Patients’ Electronic Health Rec论文总结
- uni-app 数据上拉加载更多功能
- FEMRL: A Framework for Large-Scale Privacy-Preserving Linkage of Patients’ Electronic Health Rec Paper Summary
- Keras deep learning combat (17) - image segmentation using U-Net architecture
- 链表应用----约瑟夫问题
- 测试/开发程序员值这么多钱么?“我“不会愿赌服输......
- 【毕业设计】基于STM32的天气预报盒子 - 嵌入式 单片机 物联网
- 越折腾越好用的 3 款开源 APP
- 报错:runtime error: reference binding to null pointer of type ‘std::vector<int, std::allocator<int>>‘
猜你喜欢
随机推荐
laya打包发布apk
Apache DolphinScheduler 3.0.0 正式版发布!
血红素-金纳米颗粒(Heme-AuNP)复合纳米酶|金纳米颗粒核多孔空心碳纳米球壳([email protected])纳米酶
idea插件 协议 。。 公司申请软件用
代理模式的使用总结
C语言写数据库
[CNN] Brush SOTA's trick
“蔚来杯“2022牛客暑期多校训练营7 F
电脑开不了机是什么原因?
QoS Quality of Service Six Router Congestion Management
mysql踩坑----case when then用法
1D Array Dynamics and Question Answers
Today's bug, click on the bug that the Windows dynamic wallpaper disappears in the win10 taskbar, and no solution has been found yet.
echart 特例-多分组X轴
Multifunctional Nanozyme Ag/PANI | Flexible Substrate Nano ZnO Enzyme | Rhodium Sheet Nanozyme | Ag-Rh Alloy Nanoparticle Nanozyme | Iridium Ruthenium Alloy/Iridium Oxide Biomimetic Nanozyme
Pt/CeO2 monatomic nanoparticles enzyme | H - rGO - Pt @ Pd NPs enzyme | carbon nanotube load platinum nanoparticles peptide modified nano enzyme | leukemia antagonism FeOPtPEG composite nano enzyme
MATLAB设计,FPGA实现,联合ISE和Modelsim仿真的FIR滤波器设计
flask的配置文件
【luogu CF1534F2】Falling Sand (Hard Version)(性质)(dfs)(线段树 / 单调队列 / 贪心)
【语义分割】2015-UNet MICCAI