当前位置:网站首页>Scrapy_Redis distributed processing
Scrapy_Redis distributed processing
2022-08-08 06:32:00 【feifeiyechuan】
redis 是一个非关系型的数据库,支持分布式处理
- Redis It is currently recognized as the fastest in-memory key-value database
- Redis as a buffer for temporary data,It can make full use of the high-speed reading and writing capabilities of the memory to greatly improve the crawling efficiency of the crawler.
- scrapy-redis 是为了更方便地实现 Scrapy 分布式爬取,And some provided by Redis 为基础的组件.
- scrapy-redis 把 deque 换成 redis 数据库,能让多个 spider 读取同一个 redis 数据库里,Solved the main problem of distribution.
接下来,The main document is how it is directly implementedScrapy_redisdistributed processing:
1、Scrapy_redis 模块安装
pip install scrapy_redis2、Scrapy配置
spider文件中,爬虫类

settings.pyNew settings in file:(You can assign it in)
# 添加配置
# 过滤器 去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Schedule state persistence 也可以不用设置
SCHEDULER_PERSIST = True
# Request scheduling uses priority queues 也可以不用设置
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
# redis Port and address used
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
# 或者使用 REDIS_URL = 'redis://127.0.0.1:6379'
# Pay attention to this,姜redispipeline放到item_pipelines中,If you have already written,直接将该pipe放进
#item_pipeline中即可,Do not write separately,leading to an overlay that appears later
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 100,
}当然,Other distributed machines are a bit different,
Only this one is different:修改REDIS_HOST = '服务器主机的ip地址(可以通过cmd命令下ipconfig查看)',其余都一样
3、redis安装及配置
(1)下载redis压缩包
![]()
文档:Redis-x64-3.2.100.zip
链接:http://note.youdao.com/noteshare?id=49b64d39412e68ef43d3db4727b1e04c&sub=85B94D1B540446DF93E5B1C70426AB48
(2)解压Redis-x64-3.2.100.zip到指定文件夹

(3)配置path路径
Add the unzipped file path to the system environment variablepath路径下

(4)win+R ---> cmd启动命令行
输入redis-server,出现如下图,说明配置成功

(5)Do the same on another computer,安装redis
(6)作为服务器端的电脑,需要将redis.windows-service.conf配置文件中修改如下:(Operations that allow remote access)

打开此配置文件:修改如下:
1)打开配置文件把下面对应的注释掉
# bind 127.0.0.1
2)Redis默认不是以守护进程的方式运行,可以通过该配置项修改,使用yes启用守护进程,设置为no
daemonize no
3)保护模式
protected-mode no
(7)如此,Start the server on the machineredis-server服务
cmd命令下,Enter the start server command:(redis-server 配置文件绝对路径)
redis-server E:\redis\redis.windows-service.confmay pressenter键以后,No prompts appear,没事,If no error is reported, the service is started,Don't panic at all
(8)服务器端,分布式机器 启动爬虫
![]()
Enter after starting the crawler,The crawler waits by defaultredis数据库中的 dangdang键key对应的value值,只要redisin the databasekey:dangdang 对应的值value,The crawler begins
redis_key = 'dangdang' (9)Given that I don'tredis命令,也不想学,Just install a little dolphin equivalent(mysql)The same graphical interface,进行数据操作
A、下载:redis-desktop-manager-0.8.8.384.exe
链接:https://pan.baidu.com/s/1-3e86v5nBgnEmrxni63iJQ
提取码:kun4
复制这段内容后打开百度网盘手机App,操作更方便哦
B、安装: 随便安装,next一路通畅
C、点击执行redis可视化工具

D、连接redis服务(Of course, you must start the local one firstredis服务,才可以连接上)
看图操作:
E、双击左上角root,如果报错,See if the service is started

F、添加键值


save保存之后,Immediately pay attention to reptilesterminal中,Server-side and distributed machines are about to be distributed 爬虫,其他的就不用管了.

(10)通过使用scrapy_redisSuch a distributed architecture,We can also resume from a break point,Even if you shut down the server,下次启动的时候,The distributed crawler can continue on the original basis
执行一段时间后,We found that there are three data tables:
dangdang:dupefilter Deduplicate the data sheet
dangdang:items This is saveditem的数据表
dangdang:requests This is currently not being saved 请求的Request对象,if requested,就会被执行.pop操作
4、这才是重点
参考资料:
(1(scrapy_redisThe principle of low-level code analysis)https://cuiqingcai.com/6058.html
(2 (redis 数据库知识)https://www.cnblogs.com/jinxiao-pu/p/6838011.html
5、源码:
xxBook crawler:https://github.com/steamfeifei/Scrapy_redis_spiderBook
边栏推荐
- cnn convolutional neural network backpropagation, convolutional neural network dimension change
- Sentinel流控规则绑定nacos持久化
- 最完整的分布式架构设计图谱
- Graphical LeetCode - 636. Exclusive Time of Functions (Difficulty: Moderate)
- 学生管理系统
- 爬取实习吧前四页的招聘信息
- 10道集合框架面试题(含解析),来看看你会多少
- 模板引擎art-template
- Why should Latches be avoided in digital IC design?
- 2022届暑期实习笔经面经总结,已拿微软微信offer
猜你喜欢
随机推荐
PAT乙级-B1029 旧键盘(20)
convolutional neural network image recognition, convolutional neural network image processing
节流与防抖
Validated plan
docker 安装 Redis 并配置持久化
Tidb cdc
tf.train.MonitoredTrainingSession 控制 checkpoint 保存数量
leetcode 232. Implement Queue using Stacks 用栈实现队列(简单)
Test and Debug
课堂作业--密码强度判断
刚学,这是怎么回事,SQL怎么转运错误啊
YoloV4训练自己的数据集(六)之Yolo -Tiny
分类任务说明
EOF指令在C语言中的作用
一、TF2 常用命令
YoloV4训练自己的数据集(一)
2-SAT
postgis 数据表 迁移时错误解决方法
YoloV4训练自己的数据集(五)
node模块









