当前位置:网站首页>Anti crawler (0): are you still climbing naked with selenium? You're being watched! Crack webdriver anti crawler
Anti crawler (0): are you still climbing naked with selenium? You're being watched! Crack webdriver anti crawler
2022-04-23 05:17:00 【zzzzls~】
List of articles
selenium brief introduction
When we use requests When grabbing a page , The result may be different from what you see in the browser , Normally displayed page data , Use reuquests But there was no result . This is because requests It's all raw HTML file , The pages in the browser go through Javascript The result generated after data processing , These data come from a variety of sources , It could be through AJax Loaded , It could be passing by Javascript And generated after calculation by a specific algorithm .
There are usually two solutions :
- Deep excavation Ajax The logic of , Completely find out the interface address and its encryption parameter construction logic , Reuse Python Reappear , structure Ajax request
- By simulating the browser , Bypass this process .
Here we mainly introduce the second way , Simulated browser crawling .
Selenium It's an automated testing tool , It can drive the browser to perform specific operations . For example, click on , Pull down and so on , At the same time, you can also get the source code of the page currently rendered by the browser , Achieve What you see is what you get . For some use Javascript For dynamically rendered pages , This kind of grabbing is very effective !
The crawler
however , Use Selenium call ChromeSriver To open the web , There is still a certain difference from opening the web page normally . Now many websites have added right Selenium Detection of , To prevent some reptiles from crawling maliciously .
Most of the time , The basic principle of detection is to detect... In the current browser window window.navigator
Whether the object contains webdriver
This attribute . Under normal use of the browser , This property is undefined
, Then once we use selenium, This property is initialized to true
, Many websites pass Javascript Judge whether this property implements simple anti selenium Reptiles .
At this time, we may think of passing Javascript Just put this webdriver Property is empty , For example, by calling execute_script
Method to execute the following code :
Object.defineProperty(navigator, "webdriver", {
get: () => undefined})
This line Javascript You can really put webdriver Property is empty , however execute_script Call this line Javascript The statement is actually executed after the page is loaded , Implemented too late , The website has been on the page long before the page rendering webdriver Property is detected , All the above methods can not achieve the effect .
Reflect the crawler
Anti climbing measures based on the above example , We can mainly use the following methods to solve :
To configure Selenium Options
option.add_experimental_option("excludeSwitches", ['enable-automation'])
however ChromeDriver 79.0.3945.36
Version has been modified to exclude... In non headless mode “ Enable automation ” when window.navigator.webdriver
It's an undefined problem , For normal use , Need to put Chrome Roll back 79 Previous version , And find the corresponding ChromeDriver edition , That's how it works !
Of course , You can also refer to CDP(Chrome Devtools-Protocol)
file , Use driver.execute_cdp_cmd
stay selenium Call in CDP
The order of . The following code only needs to be executed once , Then just don't close this driver Open window , No matter how many URLs you open , It will be on all the websites that come with it JS Execute this statement before , So as to hide webdriver Purpose .
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
# hide Under the control of automatic software This a few word
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path=r"E:\chromedriver\chromedriver.exe", options=options)
# modify webdriver value
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
})
driver.get('https://www.baidu.com')
In addition, the following configurations can also be removed webdriver features
options = Options()
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
Control the open browser
Since the use of selenium There are some specific parameters in the open browser , Then we can find another way , Open a real browser directly and manually , And then use selenium Don't you just control !
-
utilize Chrome DevTools Protocol opens a browser , It allows customers to check and debug Chrome browser
(1) Close all open Chrome window
(2) open CMD, Entering commands on the command line :
# here Chrome The path of needs to be modified to your local Chrome Installation position # --remote-debugging-port Specify any open ports "C:\Program Files(x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222
If the path is right , A new Chrome window
-
Use selenium Connect this open Chrome window
from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options() # The port here needs to be consistent with the port used in the previous step # Most other blogs use it here 127.0.0.1:9222, Tested unable to connect , The proposal USES localhost:9222 # For specific reasons, see : https://www.codenong.com/6827310/ options.add_experimental_option("debuggerAddress", "localhost:9222") driver = webdriver.Chrome(executable_path=r"E:\chromedriver\chromedriver.exe", options=options) driver.get('https://www.baidu.com')
However, there are some disadvantages in using this method :
Once the browser starts ,selenium The browser configuration in no longer takes effect , for example –-proxy-server
etc. , Of course, you can also start from the beginning Chrome Add... When
mitmproxy A middleman
mitmproxy
Actually sum fiddler/charles
The principle of other packet capture tools is somewhat similar , As a third party , It will disguise itself as your browser and send a request to the server , Server returned response It will be passed to your browser , You can Change the delivery of this data by writing a script , So as to realize the control of the server “ cheating ” And to the client “ cheating ”
Some websites use separate js Documents to identify webdriver Result , We can go through mitmproxy Interception identification webdriver identifier Of js file , And falsify the correct results .
Reference resources : Use mitmproxy + python Act as interceptor agent
To be continued …
Actually , It's not just webdriver,selenium After opening the browser , There will also be these signature codes :
webdriver
__driver_evaluate
__webdriver_evaluate
__selenium_evaluate
__fxdriver_evaluate
__driver_unwrapped
__webdriver_unwrapped
__selenium_unwrapped
__fxdriver_unwrapped
_Selenium_IDE_Recorder
_selenium
calledSelenium
_WEBDRIVER_ELEM_CACHE
ChromeDriverw
driver-evaluate
webdriver-evaluate
selenium-evaluate
webdriverCommand
webdriver-evaluate-response
__webdriverFunc
__webdriver_script_fn
__$webdriverAsyncExecutor
__lastWatirAlert
__lastWatirConfirm
__lastWatirPrompt
...
If you don't believe it , We can do an experiment , Separate use Normal browser , selenium+Chrome
,selenium+Chrome headless
Open up this website :https://bot.sannysoft.com/
Of course , These examples are not intended to undermine your confidence , I just hope you don't start to be complacent when you learn some techniques , Always keep a pure heart , Keep moving forward with a passion for Technology . Reptiles and anti reptiles, this war without gunsmoke , It's still going on …
版权声明
本文为[zzzzls~]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220547251831.html
边栏推荐
- calendar. Pit point of getactualmaximum (calendar. Day_of_month)
- DevOps生命周期,你想知道的全都在这里了!
- MySQL uses or to query SQL, and SQL execution is very slow
- Publish your own wheel - pypi packaging upload practice
- JS Array常见方法
- 我这位老程序员对时代危险和机遇的一点感悟?
- Semi synchronous replication of MariaDB
- configmap
- Transaction isolation level of MySQL transactions
- Barcode generation and decoding, QR code generation and decoding
猜你喜欢
Simple application of parallel search set (red alarm)
好的测试数据管理,到底要怎么做?
Solution of how to log in with mobile phone verification code in wireless network
Minimum spanning tree -- unblocked project hdu1863
了解 DevOps,必读这十本书!
2021 年 25 大 DevOps 工具(下)
The introduction of lean management needs to achieve these nine points in advance
何时适合进行自动化测试?(下)
Blender程序化地形制作
Independent station operation | Facebook marketing artifact - chat robot manychat
随机推荐
Traversal of tree
Golang memory escape
API slow interface analysis
Master-slave replication of MariaDB database
configmap
Let the LAN group use the remote device
The applet calls the function of scanning QR code and jumps to the path specified by QR code
Logrus set log format and output function name
Deep learning notes - data expansion
MySQL realizes row to column SQL
即将毕业的大学生找技术开发工作的焦虑根源
使用 Kears 实现ResNet-34 CNN
Jupyter notebook crawling web pages
Minimum spanning tree -- unblocked project hdu1863
好的测试数据管理,到底要怎么做?
calendar. Pit point of getactualmaximum (calendar. Day_of_month)
数字化转型失败,有哪些原因?
Restful toolkit of idea plug-in
Detailed explanation of hregionserver
Study notes: unity customsrp-13-colorgrading