当前位置:网站首页>爬虫基本原理介绍、实现以及问题解决
爬虫基本原理介绍、实现以及问题解决
2022-08-10 19:16:00 【InfoQ】
一、爬虫的意义
1.前言
2.爬虫能做什么
大量自动化
公开的数据
公开shuju
3.爬虫有什么意义
二、爬虫的实现
1.爬虫的基础原理
2.api的获取
graphql
payload = {"operation_name": "userPublicProfile", #查询数据库请求内容
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"查询对象"}'
}
3.爬虫实现
import requests as rq
from urllib.parse import urlencode
headers={ #请求头信息
"Referer":"https://leetcode.cn",
}
payload = {"operation_name": "userPublicProfile", #查询数据库请求内容
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"romantic-haibty42"}'
}
res = rq.post("https://leetcode.cn/graphql/"+"?"+ urlencode(payload),headers = headers)
print(res.text)
acTotal
3三、反爬解决方案
1.反爬的实现方式
2.反爬的解决方法
3.反爬的实现代码
tiqu
# coding=utf-8
# !/usr/bin/env python
import json
import threading
import time
import requests as rq
from urllib.parse import urlencode
headers={
"Referer":"https://leetcode.cn",
}
payload = {"operation_name": "userPublicProfile",
"query": '''query userPublicProfile($userSlug: String!) {
userProfilePublicProfile(userSlug: $userSlug) {
username
submissionProgress {
acTotal
}
}
}
''',
"variables": '{"userSlug":"kingley"}'
}
username = "romantic-haibty42"
def int_csrf(proxies,header):
sess= rq.session()
sess.proxies = proxies
sess.head("https://leetcode.cn/graphql/")
header['x-csrftoken'] = sess.cookies["csrftoken"]
testUrl = 'https://api.myip.la/en?json'
# 核心业务
def testPost(host, port):
proxies = {
'http': 'socks5://{}:{}'.format(host, port),
'https': 'socks5://{}:{}'.format(host, port),
}
res = ""
while True:
try:
header = headers
# print(res.status_code)
chaxun = payload
chaxun['variables'] = json.dumps({"userSlug" : f"{username}"})
res = rq.post("https://leetcode.cn/graphql/"+"?"+ urlencode(chaxun),headers = header,proxies=proxies)
print(host,res.text)
except Exception as e:
print(e)
break
class ThreadFactory(threading.Thread):
def __init__(self, host, port):
threading.Thread.__init__(self)
self.host = host
self.port = port
def run(self):
testPost(self.host, self.port)
# 提取代理的链接 json类型的返回值 socks5方式
tiqu = ''
while 1 == 1:
# 每次提取10个,放入线程中
resp = rq.get(url=tiqu, timeout=5)
try:
if resp.status_code == 200:
dataBean = json.loads(resp.text)
else:
print("获取失败")
time.sleep(1)
continue
except ValueError:
print("获取失败")
time.sleep(1)
continue
else:
# 解析json数组
print("code=", dataBean)
code = dataBean["code"]
if code == 0:
threads = []
for proxy in dataBean["data"]:
threads.append(ThreadFactory(proxy["ip"], proxy["port"]))
for t in threads: # 开启线程
t.start()
time.sleep(0.01)
for t in threads: # 阻塞线程
t.join()
# break
break
4.IPIDEA还能做什么
四、总结
边栏推荐
- XML小讲
- 【luogu CF1534F2】Falling Sand (Hard Version)(性质)(dfs)(线段树 / 单调队列 / 贪心)
- 赎金信问题答记
- 铁蛋白-AHLL纳米颗粒|人表皮生长因子-铁蛋白重链亚基纳米粒子(EGF-5Cys-FTH1)|铁蛋白颗粒包载氯霉素Chloramphenicol-Ferritin
- 2022杭电多校七 Black Magic (签到)
- 【毕业设计】基于Stm32的智能疫情防控门禁系统 - 单片机 嵌入式 物联网
- 2022 Hangdian Multi-School Seven Black Magic (Sign-in)
- 运维面试题(每日一题)
- 线性结构----链表
- 【图像分类】2018-MobileNetV2
猜你喜欢
2022 Hangdian Multi-School Seven Black Magic (Sign-in)
FEMRL: A Framework for Large-Scale Privacy-Preserving Linkage of Patients’ Electronic Health Rec Paper Summary
铁蛋白-AHLL纳米颗粒|人表皮生长因子-铁蛋白重链亚基纳米粒子(EGF-5Cys-FTH1)|铁蛋白颗粒包载氯霉素Chloramphenicol-Ferritin
Modern Privacy-Preserving Record Linkage Techniques: An Overview论文总结
We used 48h to co-create a web game: Dice Crush, to participate in international competitions
Ransom Letter Questions and Answers
【无标题】基于Huffman和LZ77的GZIP压缩
你不知道的浏览器页面渲染机制
多功能纳米酶Ag/PANI|柔性衬底纳米ZnO酶|铑片纳米酶|Ag-Rh合金纳米颗粒纳米酶|铱钌合金/氧化铱仿生纳米酶
苹果字体查找
随机推荐
【Knowledge Sharing】What is SEI in the field of audio and video development?
【luogu CF1534F2】Falling Sand (Hard Version)(性质)(dfs)(线段树 / 单调队列 / 贪心)
Introduction to 3 d games beginners essential 】 【 modeling knowledge
一维数组动态和问题答记
机器学习|模型评估——AUC
杭电多校七 1003-Counting Stickmen(组合数学)
【毕业设计】基于STM32的天气预报盒子 - 嵌入式 单片机 物联网
The servlet mapping path matching resolution
每日一R「03」Borrow 语义与引用
The 2021 ICPC Asia Shanghai Regional Programming Contest D、E
flask生成路由的2种方式和反向生成url
[Teach you how to make a small game] Write a function with only a few lines of native JS to play sound effects, play BGM, and switch BGM
The servlet mapping path matching resolution
【语义分割】2015-UNet MICCAI
“2022零信任神兽方阵”启动调研,欢迎各单位填报信息
皮质-皮质网络的多尺度交流
「POJ 3666」Making the Grade 题解(两种做法)
转铁蛋白(Tf)修饰去氢骆驼蓬碱磁纳米脂质体/香豆素-6脂质体/多柔比星脂质体
QoS Quality of Service Seven Switch Congestion Management
电脑开不了机是什么原因?