当前位置:网站首页>magical_spider远程采集方案
magical_spider远程采集方案
2022-08-11 07:47:00 【考古学家lx(李玺)】
magical_spider
一个神奇的蜘蛛项目,源码架构很简单,适用于数据采集任务。
index页面示例:
项目地址
https://github.com/lixi5338619/magical_spider
使用说明
1、配置settings.py,启动 flask 服务
2、测试代码参考demo文件内容,运行过程主要借助runflow.py。
import requests
host = 'http://127.0.0.1:5000'
def magical_start(project_name,base_url = 'http://www.lxspider.com'):
# 1、create browser and select session_id
result = requests.post(f'{
host}/create',data={
'name':project_name,'url':base_url}).json()
session_id,process_url = result['session_id'],result['process_url']
return session_id,process_url
def magical_request(session_id,process_url,request_url):
# 2、request browser_xhr
data = {
'session_id':session_id,'process_url':process_url,
'request_url':request_url,'request_type':'get'}
result = requests.post(f'{
host}/xhr',data=data).json()
return result['result']
def magical_close(session_id,process_url,process_name):
# 4、close browser
close_data = {
'session_id':session_id,'process_url':process_url,'process_name':process_name}
requests.post(f'{
host}/close',data=close_data).json()
3、测试代码
GET请求
from demo.runflow import magical_start,magical_request,magical_close
project_name = 'cnipa'
base_url = 'https://www.cnipa.gov.cn'
session_id,process_url = magical_start(project_name,base_url)
print(len(magical_request(session_id, process_url,'https://www.cnipa.gov.cn/col/col57/index.html')))
magical_close(session_id,process_url,project_name)
POST请求
from demo.runflow import magical_start,magical_request,magical_close
import json
project_name = 'chinadrugtrials'
base_url = 'http://www.chinadrugtrials.org.cn'
session_id,process_url = magical_start(project_name,base_url)
data = {
"id": "","ckm_index": "","sort": "desc","sort2": "","rule": "CTR","secondLevel": "0","currentpage": "2","keywords": "","reg_no": "","indication": "","case_no": "","drugs_name": "","drugs_type": "","appliers": "","communities": "","researchers": "","agencies": "","state": ""}
formdata = json.dumps(data)
print(magical_request(session_id=session_id, process_url=process_url,
request_url='http://www.chinadrugtrials.org.cn/clinicaltrials.searchlist.dhtml',
request_type='post',formdata=formdata
))
magical_close(session_id,process_url,project_name)
4、index页可以查看和管理当前运行中的任务,也能查看系统内存和磁盘使用情况。
5、demo文件夹中有任务流程汇总runflow.py,以及抖音、药监局案例,单任务和多任务示例。
linux部署
1.安装chrome (自行选择安装位置)
yum install https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
2.检查chrome的版本
google-chrome --version
3.安装对应版本的 chromedriver_linux64
比如我的chrome版本是104.0.5112.79
wget https://npm.taobao.org/mirrors/chromedriver/104.0.5112.79/chromedriver_linux64.zip
4.解压
unzip chromedriver_linux64
5.授权
chmod 777 chromedriver
6.修改项目代码settings.py中的chromedriver路径
7.安装python依赖后启动flask项目
- Python依赖 :flask、sqlite3、selenium、websockets、opencv-python、numpy
- flask启动方式:python3 server.py
8.开启服务器端口访问权限
9.运行项目测试
边栏推荐
- leetcode:69. x 的平方根
- tf中自减操作;tf.assign_sub()
- 1101 B是A的多少倍 (15 分)
- Pico neo3 Unity Packaging Settings
- 1106 2019 Sequence (15 points)
- Kaldi语音识别工具编译问题记录(踩坑记录)
- 如何仅更改 QGroupBox 标题的字体?
- There may be fields that cannot be serialized in the abnormal object of cdc and sqlserver. Is there anyone who can understand it? Help me to answer
- 我的创作纪念日丨感恩这365天来有你相伴,不忘初心,各自精彩
- 1036 跟奥巴马一起编程 (15 分)
猜你喜欢
1076 Wifi密码 (15 分)
2.1 - Gradient Descent
Write a resume like this, easy to get the interviewer
零基础SQL教程: 基础查询 05
Hibernate 的 Session 缓存相关操作
Use tf.argmax in Tensorflow to return the index of the maximum value of the tensor along the specified dimension
1056 组合数的和 (15 分)
【云原生】云原生在网络安全领域的应用
Active users of mobile banking grew rapidly in June, hitting a half-year high
Square, multi-power, square root calculation in Tf
随机推荐
My creative anniversary丨Thank you for being with you for these 365 days, not forgetting the original intention, and each is wonderful
Find the latest staff salary and the last staff salary changes
Use tf.argmax in Tensorflow to return the index of the maximum value of the tensor along the specified dimension
[C语言] sscanf如何实现sscanf_s?
1051 Multiplication of Complex Numbers (15 points)
Write a resume like this, easy to get the interviewer
1051 复数乘法 (15 分)
Pico neo3在Unity中的交互操作
Hibernate 的 Session 缓存相关操作
1106 2019 Sequence (15 points)
There may be fields that cannot be serialized in the abnormal object of cdc and sqlserver. Is there anyone who can understand it? Help me to answer
【43. 字符串相乘】
1036 跟奥巴马一起编程 (15 分)
matrix multiplication in tf
基于微信小程序的租房小程序
抽象类和接口
The softmax function is used in TF;
项目2-年收入判断
The most complete documentation on Excel's implementation of grouped summation
场地预订系统,帮助场馆提高坪效