How can I make a Selenium script undetectable using GeckoDriver and Firefox through Python?
The fact that selenium driven Firefox / GeckoDriver gets detected doesn't depends on any specific GeckoDriver or Firefox version. The Websites themselves can detect the network traffic and can identify the Browser Client i.e. Web Browser as WebDriver controled.
As per the documentation of the WebDriver Interface
in the latest editor's draft of WebDriver - W3C Living Document the webdriver-active
flag which is initially set as false, is set to true when the user agent is under remote control i.e. when controlled through Selenium.
Now that the NavigatorAutomationInformation
interface should not be exposed on WorkerNavigator
.
So,
webdriver
Returns true if webdriver-active flag is set, false otherwise.
where as,
navigator.webdriver
Defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, for example so that alternate code paths can be triggered during automation.
So, the bottom line is:
Selenium identifies itself
However some generic approaches to avoid getting detected while web-scraping are as follows:
- The first and foremost attribute a website can determine your script/program is through your monitor size. So it is recommended not to use the conventional Viewport.
- If you need to send multiple requests to a website, you need to keep on changing the User Agent on each request. Here you can find a detailed discussion on Way to change Google Chrome user agent in Selenium?
- To simulate human like behavior you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing
time.sleep(secs)
. Here you can find a detailed discussion on How to sleep webdriver in python for milliseconds
Can I run Selenium using geckodriver without a webpage detecting marionette?
Using FirefoxDriver with Selenium but getting detected is quite common now as:
Selenium identifies itself
You can find a detailed discussion in How to make Selenium script undetectable using GeckoDriver and Firefox through Python?
Marionette
As per the documentation, Marionette, is the automation driver for Mozilla’s Gecko engine. It can remotely control the UI and the internal JavaScript of a Gecko platform, such as Mozilla Firefox. It can control both the chrome (i.e. menus and functions) or the content (the webpage loaded inside the browsing context), giving a high level of control and ability to replicate user actions. In addition to performing actions on the browser, Marionette can also read the properties and attributes of the DOM. Now, marionette shares much of the same API as Selenium/WebDriver, with additional commands to interact with Gecko’s chrome interface. Its goal is to replicate what Selenium does for web content, i.e. to enable the tester to have the ability to send commands to remotely control a user agent.
We have also discussed in details about Why Firefox requires GeckoDriver? within this thread
Finally, in the discussion Difference between webdriver.firefox.marionette & webdriver.gecko.driver we discussed about initializing Firefox sessions using legacy Firefox 47.x browsers and GeckoDriver enabled Firefox >47.x browsers. The conclusion was when using Firefox browsers > v77.x you have to mandatorily use GeckoDriver which extensively uses the marionette. So configuring marionette
as false
won't help us out. While using the latest version of geckodriver, selenium and firefox, you have to use the marionette by default.
If you still want to initialise a Firefox browsing session without using marionette you need to configure "marionette"
to false
as follows:
System.setProperty("webdriver.gecko.driver", "C://path//to//geckodriver.exe");
DesiredCapabilities dc = new DesiredCapabilities();
dc.setCapability("marionatte", false);
FirefoxOptions opt = new FirefoxOptions();
opt.merge(dc);
FirefoxDriver driver = new FirefoxDriver(opt);
driver.get("https://stackoverflow.com");
System.out.println("Application opened");
System.out.println("Page Title is : "+driver.getTitle());
driver.quit();
You can find a couple of relevant discussions in:
- org.openqa.selenium.SessionNotCreatedException: Unable to find a matching set of capabilities while initiating Firefox v37 through Selenium v3.11.0
- selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities with Firefox 46 through Selenium
- How can Geckodriver/Firefox work without Marionette? (running python selenium 3 against FF 53)
The other questions:
Can I change the setCapabilites on/off while the driver is running?: The short answer is No, you can't change the capabilites while the session initiated by the webdriver is
In Progress
and you can find a couple of detailed discussions in:- Change ChromeOptions in an existing webdriver
- Set capability on already running selenium webdriver
Is it easier to do this using ChromeDriver?: Again the precise answer is No, ChromeDriver also gets detected and you can find a couple of detailed discussions in:
- Is there a version of selenium that is not detectable ? can selenium be truly undetectable?
- Webpage Is Detecting Selenium Webdriver with Chromedriver as a bot
Outro
Here you can find a detailed discussion on Which Firefox browser versions supported for given Geckodriver version?
How to Conceal WebDriver in Geckodriver from BotD in Java?
When using Selenium driven GeckoDriver initiated firefox Browsing Context
The webdriver-active flag is set to true
when the user agent is under remote control. It is initially false
.
where, webdriver
returns true
if webdriver-active flag is set, false
otherwise.
As:
navigator.webdriver Defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, for
example so that alternate code paths can be triggered during
automation.
Further @whimboo
in his comments confirmed:
This implementation have to be conformant to this requirement. As such
we will not provide a way to circumvent that.
Conclusion
So, the bottom line is:
Selenium identifies itself
and there is no way to conceal the fact that the browser is WebDriver driven.
Recommendations
However some pundits have suggested some different approaches which can conceal the fact that the Mozilla Firefox browser is WebDriver controled through the usage of Firefox Profiles and Proxies as follows:
selenium4 compatible python code
from selenium.webdriver import Firefox
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
profile_path = r'C:\Users\Admin\AppData\Roaming\Mozilla\Firefox\Profiles\s8543x41.default-release'
options=Options()
options.set_preference('profile', profile_path)
options.set_preference('network.proxy.type', 1)
options.set_preference('network.proxy.socks', '127.0.0.1')
options.set_preference('network.proxy.socks_port', 9050)
options.set_preference('network.proxy.socks_remote_dns', False)
service = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = Firefox(service=service, options=options)
driver.get("https://www.google.com")
driver.quit()
Potential Solution
A potential solution would be to use the tor browser as follows:
selenium4 compatible python code
from selenium.webdriver import Firefox
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import os
torexe = os.popen(r'C:\Users\username\Desktop\Tor Browser\Browser\TorBrowser\Tor\tor.exe')
profile_path = r'C:\Users\username\Desktop\Tor Browser\Browser\TorBrowser\Data\Browser\profile.default'
firefox_options=Options()
firefox_options.set_preference('profile', profile_path)
firefox_options.set_preference('network.proxy.type', 1)
firefox_options.set_preference('network.proxy.socks', '127.0.0.1')
firefox_options.set_preference('network.proxy.socks_port', 9050)
firefox_options.set_preference("network.proxy.socks_remote_dns", False)
firefox_options.binary_location = r'C:\Users\username\Desktop\Tor Browser\Browser\firefox.exe'
service = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = webdriver.Firefox(service=service, options=firefox_options)
driver.get("https://www.tiktok.com/")
References
You can find a couple of relevant detailed discussions in
- How to initiate a Tor Browser 9.5 which uses the default Firefox to 68.9.0esr using GeckoDriver and Selenium through Python
- How to connect to Tor browser using Python
- How to use Tor with Chrome browser through Selenium
Is there a version of Selenium WebDriver that is not detectable?
The fact that selenium driven WebDriver gets detected doesn't depends on any specific Selenium, Chrome or ChromeDriver version. The Websites themselves can detect the network traffic and can identify the Browser Client i.e. Web Browser as WebDriver controled.
However some generic approaches to avoid getting detected while web-scraping are as follows:
- The first and foremost attribute a website can determine your script/program is through your monitor size. So it is recommended not to use the conventional Viewport.
- If you need to send multiple requests to a website, you need to keep on changing the user-agent on each request. You can find a detailed discussion in Way to change Google Chrome user agent in Selenium?
- To simulate human like behavior you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing
time.sleep(secs)
. Here you can find a detailed discussion on How to sleep webdriver in python for milliseconds
@Antoine Vastel in his blog site Detecting Chrome Headless mentioned several approaches, which distinguish the Chrome browser from a headless Chrome browser.
User agent: The user agent attribute is commonly used to detect the OS as well as the browser of the user. With Chrome version 59 it has the following value:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36
A check for the presence of Chrome headless can be done through:
if (/HeadlessChrome/.test(window.navigator.userAgent)) {
console.log("Chrome headless detected");
}
Plugins:
navigator.plugins
returns an array of plugins present in the browser. Typically, on Chrome we find default plugins, such asChrome PDF viewer
orGoogle Native Client
. On the opposite, in headless mode, the array returned contains no plugin.A check for the presence of Plugins can be done through:
if(navigator.plugins.length == 0) {
console.log("It may be Chrome headless");
}
Languages: In Chrome two Javascript attributes enable to obtain languages used by the
user: navigator.language
andnavigator.languages
. The first one is the language of the browser UI, while the second one is an array of string representing the user’s preferred languages. However, in headless mode,navigator.languages
returns an empty string.A check for the presence of Languages can be done through:
if(navigator.languages == "") {
console.log("Chrome headless detected");
}
WebGL: WebGL is an API to perform 3D rendering in an HTML canvas. With this API, it is possible to query for the vendor of the graphic driver as well as the renderer of the graphic driver. With a vanilla Chrome and Linux, we can obtain the following values for renderer and vendor:
Google SwiftShader
andGoogle Inc.
. In headless mode, we can obtainMesa OffScreen
, which is the technology used for rendering without using any sort of window system andBrian Paul
, which is the program that started the open source Mesa graphics library.A check for the presence of WebGL can be done through:
var canvas = document.createElement('canvas');
var gl = canvas.getContext('webgl');
var debugInfo = gl.getExtension('WEBGL_debug_renderer_info');
var vendor = gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);
var renderer = gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
if(vendor == "Brian Paul" && renderer == "Mesa OffScreen") {
console.log("Chrome headless detected");
}Not all Chrome headless will have the same values for vendor and renderer. Others keep values that could also be found on non headless version. However,
Mesa Offscreen
andBrian Paul
indicates the presence of the headless version.
Browser features: Modernizr library enables to test if a wide range of HTML and CSS features are present in a browser. The only difference we found between Chrome and headless Chrome was that the latter did not have the hairline feature, which detects support for
hidpi/retina hairlines
.A check for the presence of hairline feature can be done through:
if(!Modernizr["hairline"]) {
console.log("It may be Chrome headless");
}
Missing image: The last on our list also seems to be the most robust, comes from the dimension of the image used by Chrome in case an image cannot be loaded. In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browser, but are different from zero. In a headless Chrome, the image has a width and an height equal to zero.
A check for the presence of Missing image can be done through:
var body = document.getElementsByTagName("body")[0];
var image = document.createElement("img");
image.src = "http://iloveponeydotcom32188.jg";
image.setAttribute("id", "fakeimage");
body.appendChild(image);
image.onerror = function(){
if(image.width == 0 && image.height == 0) {
console.log("Chrome headless detected");
}
}
References
You can find a couple of similar discussions in:
- How to bypass Google captcha with Selenium and python?
- How to make Selenium script undetectable using GeckoDriver and Firefox through Python?
tl; dr
- Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
- How does recaptcha 3 know I'm using selenium/chromedriver?
- Selenium and non-headless browser keeps asking for Captcha
Related Topics
A Good Way to Get the Charset/Encoding of an Http Response in Python
How to Install Pycrypto on Windows
How to Get the Duration of a Video in Python
Print to Standard Printer from Python
How to Make a Surface with a Transparent Background in Pygame
Reversing a Regular Expression in Python
How to Add Multiple Annotations to a Barplot
Cannot Install Lxml on MAC Os X 10.9
Add Pygame Module in Pycharm Ide
How to Properly Assert That an Exception Gets Raised in Pytest
How to Retrieve Inserted Id After Inserting Row in SQLite Using Python
How to Pipe a Subprocess Call to a Text File
How to Move a Tick Label in Matplotlib