当前位置:网站首页>How to judge an IP is a crawler
How to judge an IP is a crawler
2022-08-08 21:32:00 【oHuangBing】
Determine the crawler by IP
If you look at the server log and see the dense IP address, you can tell at a glance which IPs are crawlers and those IPs are normal crawlers, like this:
In this dense log, it is not easy to distinguish not only the real crawler IP, but also the fake crawler IP.
If you look at the server log, we can first judge whether it is a crawler or a normal user through the User-agent, for example:
Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
This is SemrushBot's crawler
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
This is the bing search engine crawler
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.97 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
This is the crawler of Google search engine
......
These are the User-agents of crawlers. As long as the small partners who have developed crawlers know that User-agent can be faked, it is inaccurate to judge crawlers only by User-agent. We also need to judge by IP address.Is it a reptile.
66.249.71.19 - - [19/May/2021:06:25:52 +0800] "GET /history/16521060410/2019 HTTP/1.1" 302 257 "-" "Mozilla/5.0 (Linux;Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.97 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
In the above log, the first one is the IP of the crawler. Are you sure it is the crawler (spider) IP of the Google search engine?
We can check by IP, and we can see that the Hostname is: crawl-66-249-71-19.googlebot.com
The IP address of this Hostname obtained by ping is: 66.249.71.19
This is the crawler (spider) IP of the Google search engine for sure.
But for some uncertainties, we can also pass IP query - crawler identification
a> This website queries the specific information of the crawler.The specific operation is not repeated here. You can directly enter the IP to query the detailed information of the crawler. You can also refer to this article: Crawler performs IP identification, there are specific usages.
Through some of the above steps, it should be easy to judge whether it is a crawler by IP.
边栏推荐
猜你喜欢
随机推荐
Conditional-DETR 论文解析
selenium基本使用
用js写一个简单的前世今生
音视频技术开发周刊 | 257
【公开课预告】:AV1编码器的优化及其在流媒体和实时通讯中的应用
【Export PDF-Project Application】
Conformer论文以及代码解析(下)
H5 移动端调取手机相机或相册
如果只让我推荐一本书。(第2弹)
Property or method “XXX“ is not defined on the instance but referenced during render.
修改浏览器滚动条样式
用Multisim13.0进行混频器的仿真
复合索引使用
1个不为人知的 Jupyter notebook 使用技巧,今天分享出来。
爬虫系列:读取 CSV、PDF、Word 文档
分别用BeautifulSoup和scrapy爬取某一城市天气预报
爬虫系列:使用 MySQL 存储数据
全国基础地理数据库数据预处理
峰会•沙龙•招聘 | 记零数科技多线并进的一天
day11 基于Rest的操作、查询聚合索引