当前位置:网站首页>How to judge an IP is a crawler
How to judge an IP is a crawler
2022-08-08 21:32:00 【oHuangBing】
Determine the crawler by IP
If you look at the server log and see the dense IP address, you can tell at a glance which IPs are crawlers and those IPs are normal crawlers, like this:
In this dense log, it is not easy to distinguish not only the real crawler IP, but also the fake crawler IP.
If you look at the server log, we can first judge whether it is a crawler or a normal user through the User-agent, for example:
Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
This is SemrushBot's crawler
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
This is the bing search engine crawler
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.97 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
This is the crawler of Google search engine
......
These are the User-agents of crawlers. As long as the small partners who have developed crawlers know that User-agent can be faked, it is inaccurate to judge crawlers only by User-agent. We also need to judge by IP address.Is it a reptile.
66.249.71.19 - - [19/May/2021:06:25:52 +0800] "GET /history/16521060410/2019 HTTP/1.1" 302 257 "-" "Mozilla/5.0 (Linux;Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.97 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
In the above log, the first one is the IP of the crawler. Are you sure it is the crawler (spider) IP of the Google search engine?
We can check by IP, and we can see that the Hostname is: crawl-66-249-71-19.googlebot.com
The IP address of this Hostname obtained by ping is: 66.249.71.19
This is the crawler (spider) IP of the Google search engine for sure.
But for some uncertainties, we can also pass IP query - crawler identification
a> This website queries the specific information of the crawler.The specific operation is not repeated here. You can directly enter the IP to query the detailed information of the crawler. You can also refer to this article: Crawler performs IP identification, there are specific usages.
Through some of the above steps, it should be easy to judge whether it is a crawler by IP.
边栏推荐
猜你喜欢
随机推荐
最简单的idea构建微服务模块
day12 Elasticserach
Property or method “XXX“ is not defined on the instance but referenced during render.
【公开课预告】:AV1编码器的优化及其在流媒体和实时通讯中的应用
MATLAB综合实例:部门工资统计图分析
为什么要做LiveVideoStack课程?
全国基础地理数据库数据预处理
数据库week01
修改浏览器滚动条样式
百度 IP 查询
day11 基于Rest的操作、查询聚合索引
ES6新特性let和const
如何改变数组对象里面的key 键名字
EasyExcel上传文件并使用Postman测试
3分钟写个VBA:Excel工作簿所有子表数据一键汇总
爬虫系列:使用 MySQL 存储数据
scrapy爬当当网书籍信息
Redis之集群部署、哨兵集群
一、Canvas应用的背景(个人理解)及基础语法
“文化数字化战略新型基础设施暨文化艺术链生态建设发布会”成功召开