当前位置:网站首页>Anti crawler (0): are you still climbing naked with selenium? You're being watched! Crack webdriver anti crawler
Anti crawler (0): are you still climbing naked with selenium? You're being watched! Crack webdriver anti crawler
2022-04-23 05:17:00 【zzzzls~】
List of articles
selenium brief introduction
When we use requests When grabbing a page , The result may be different from what you see in the browser , Normally displayed page data , Use reuquests But there was no result . This is because requests It's all raw HTML file , The pages in the browser go through Javascript The result generated after data processing , These data come from a variety of sources , It could be through AJax Loaded , It could be passing by Javascript And generated after calculation by a specific algorithm .
There are usually two solutions :
- Deep excavation Ajax The logic of , Completely find out the interface address and its encryption parameter construction logic , Reuse Python Reappear , structure Ajax request
- By simulating the browser , Bypass this process .
Here we mainly introduce the second way , Simulated browser crawling .
Selenium It's an automated testing tool , It can drive the browser to perform specific operations . For example, click on , Pull down and so on , At the same time, you can also get the source code of the page currently rendered by the browser , Achieve What you see is what you get . For some use Javascript For dynamically rendered pages , This kind of grabbing is very effective !

The crawler
however , Use Selenium call ChromeSriver To open the web , There is still a certain difference from opening the web page normally . Now many websites have added right Selenium Detection of , To prevent some reptiles from crawling maliciously .
Most of the time , The basic principle of detection is to detect... In the current browser window window.navigator Whether the object contains webdriver This attribute . Under normal use of the browser , This property is undefined, Then once we use selenium, This property is initialized to true, Many websites pass Javascript Judge whether this property implements simple anti selenium Reptiles .
At this time, we may think of passing Javascript Just put this webdriver Property is empty , For example, by calling execute_script Method to execute the following code :
Object.defineProperty(navigator, "webdriver", {
get: () => undefined})
This line Javascript You can really put webdriver Property is empty , however execute_script Call this line Javascript The statement is actually executed after the page is loaded , Implemented too late , The website has been on the page long before the page rendering webdriver Property is detected , All the above methods can not achieve the effect .
Reflect the crawler
Anti climbing measures based on the above example , We can mainly use the following methods to solve :
To configure Selenium Options
option.add_experimental_option("excludeSwitches", ['enable-automation'])
however ChromeDriver 79.0.3945.36 Version has been modified to exclude... In non headless mode “ Enable automation ” when window.navigator.webdriver It's an undefined problem , For normal use , Need to put Chrome Roll back 79 Previous version , And find the corresponding ChromeDriver edition , That's how it works !
Of course , You can also refer to CDP(Chrome Devtools-Protocol) file , Use driver.execute_cdp_cmd stay selenium Call in CDP The order of . The following code only needs to be executed once , Then just don't close this driver Open window , No matter how many URLs you open , It will be on all the websites that come with it JS Execute this statement before , So as to hide webdriver Purpose .
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
# hide Under the control of automatic software This a few word
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path=r"E:\chromedriver\chromedriver.exe", options=options)
# modify webdriver value
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
})
driver.get('https://www.baidu.com')
In addition, the following configurations can also be removed webdriver features
options = Options()
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
Control the open browser
Since the use of selenium There are some specific parameters in the open browser , Then we can find another way , Open a real browser directly and manually , And then use selenium Don't you just control !
-
utilize Chrome DevTools Protocol opens a browser , It allows customers to check and debug Chrome browser
(1) Close all open Chrome window
(2) open CMD, Entering commands on the command line :
# here Chrome The path of needs to be modified to your local Chrome Installation position # --remote-debugging-port Specify any open ports "C:\Program Files(x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222If the path is right , A new Chrome window
-
Use selenium Connect this open Chrome window
from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options() # The port here needs to be consistent with the port used in the previous step # Most other blogs use it here 127.0.0.1:9222, Tested unable to connect , The proposal USES localhost:9222 # For specific reasons, see : https://www.codenong.com/6827310/ options.add_experimental_option("debuggerAddress", "localhost:9222") driver = webdriver.Chrome(executable_path=r"E:\chromedriver\chromedriver.exe", options=options) driver.get('https://www.baidu.com')
However, there are some disadvantages in using this method :
Once the browser starts ,selenium The browser configuration in no longer takes effect , for example –-proxy-server etc. , Of course, you can also start from the beginning Chrome Add... When
mitmproxy A middleman
mitmproxy Actually sum fiddler/charles The principle of other packet capture tools is somewhat similar , As a third party , It will disguise itself as your browser and send a request to the server , Server returned response It will be passed to your browser , You can Change the delivery of this data by writing a script , So as to realize the control of the server “ cheating ” And to the client “ cheating ”
Some websites use separate js Documents to identify webdriver Result , We can go through mitmproxy Interception identification webdriver identifier Of js file , And falsify the correct results .
Reference resources : Use mitmproxy + python Act as interceptor agent
To be continued …
Actually , It's not just webdriver,selenium After opening the browser , There will also be these signature codes :
webdriver
__driver_evaluate
__webdriver_evaluate
__selenium_evaluate
__fxdriver_evaluate
__driver_unwrapped
__webdriver_unwrapped
__selenium_unwrapped
__fxdriver_unwrapped
_Selenium_IDE_Recorder
_selenium
calledSelenium
_WEBDRIVER_ELEM_CACHE
ChromeDriverw
driver-evaluate
webdriver-evaluate
selenium-evaluate
webdriverCommand
webdriver-evaluate-response
__webdriverFunc
__webdriver_script_fn
__$webdriverAsyncExecutor
__lastWatirAlert
__lastWatirConfirm
__lastWatirPrompt
...
If you don't believe it , We can do an experiment , Separate use Normal browser , selenium+Chrome,selenium+Chrome headless Open up this website :https://bot.sannysoft.com/

Of course , These examples are not intended to undermine your confidence , I just hope you don't start to be complacent when you learn some techniques , Always keep a pure heart , Keep moving forward with a passion for Technology . Reptiles and anti reptiles, this war without gunsmoke , It's still going on …
版权声明
本文为[zzzzls~]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220547251831.html
边栏推荐
- [2022 ICLR] Pyramid: low complexity pyramid attention for long range spatiotemporal sequence modeling and prediction
- Using MySQL with Oracle
- Transaction isolation level of MySQL transactions
- Collaboration future object and concurrent futures
- 学习笔记:Unity CustomSRP-12-HDR
- App Store年交易额100万美元只缴15%佣金,中小开发者心里很矛盾
- 青岛敏捷之旅,来了!
- The source of anxiety of graduating college students looking for technology development jobs
- Musk and twitter storm drama
- At pgconf Asia Chinese technology forum, listen to Tencent cloud experts' in-depth understanding of database technology
猜你喜欢

数据安全问题已成隐患,看vivo如何让“用户数据”重新披甲

Minimum spanning tree -- unblocked project hdu1863

7-4 is it too fat (10 points) PTA

MySQL external connection, internal connection, self connection, natural connection, cross connection

DevOps生命周期,你想知道的全都在这里了!

To understand Devops, you must read these ten books!

2022/4/22

Good simple recursive problem, string recursive training

了解 DevOps,必读这十本书!

2021-09-23
随机推荐
Five key technologies to improve the devsecops framework
MySQL foreign key constraint
MySQL views the SQL statement details executed by the optimizer
Solution of how to log in with mobile phone verification code in wireless network
Golang memory escape
看板快速启动指南
mariadb数据库的主从复制
Logrus set log format and output function name
Chapter II project scope management of information system project manager summary
Swing display time (click once to display once)
Redis data type usage scenario
Redis lost key and bigkey
Live delivery form template - automatically display pictures - automatically associate series products
Mairadb数据库基本操作之数据管理
[2021] Spatio-Temporal Graph Contrastive Learning
MySQL external connection, internal connection, self connection, natural connection, cross connection
使用 Kears 实现ResNet-34 CNN
引入精益管理方式,需要提前做到这九点
Chapter III project schedule management of information system project manager summary
Graphics.FromImage报错“Graphics object cannot be created from an image that has an indexed pixel ...”