Python requests. 403 Forbidden
It seems the page rejects GET
requests that do not identify a User-Agent
. I visited the page with a browser (Chrome) and copied the User-Agent
header of the GET
request (look in the Network tab of the developer tools):
import requests
url = 'http://worldagnetwork.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.content.decode())
# <!doctype html>
# <!--[if lt IE 7 ]><html class="no-js ie ie6" lang="en"> <![endif]-->
# <!--[if IE 7 ]><html class="no-js ie ie7" lang="en"> <![endif]-->
# <!--[if IE 8 ]><html class="no-js ie ie8" lang="en"> <![endif]-->
# <!--[if (gte IE 9)|!(IE)]><!--><html class="no-js" lang="en"> <!--<![endif]-->
# ...
Python requests response 403 forbidden
The code here
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
page = requests.get("https://www.auto24.ee/", headers=headers)
print(page.text)
Always will get something as the following
<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>
<p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
</div>
<div class="cf-column">
<h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>
<p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can
run an anti-virus scan on your device to make sure it is not infected with malware.</p>
The website is protected by CloudFlare. By standard means, there is minimal chance of being able to access the WebSite through automation such as requests or selenium. You are seeing 403 since your client is detected as a robot. There may be some arbitrary methods to bypass CloudFlare that could be found elsewhere, but the WebSite is working as intended. There must be a ton of data submitted through headers and cookies that show your request is valid, and since you are simply submitting only a user agent, CloudFlare is triggered. Simply spoofing another user-agent is not even close to enough to not trigger a captcha, CloudFlare checks for MANY things.
I suggest you look at selenium here since it simulates a real browser, or research guides to (possibly?) bypass Cloudflare with requests.
Update
Found 2 python libraries cloudscraper and cfscrape. Both are not usable for this site since it uses cloudflare v2 unless you pay for a premium version.
python requests: 403 Forbidden Error While Downlaoding Pdf File from www.researchgate.net
Forbidden was caused because of TLS verification failure. I had to change the TLS adapter to make it work.
import ssl
import requests
from requests.adapters import HTTPAdapter
from urllib3.poolmanager import PoolManager
from urllib3.util import ssl_
CIPHERS = "ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:AES256-SHA"
class TLSAdapter(HTTPAdapter):
def __init__(self, ssl_options=0, *args, **kwargs):
self.ssl_options = ssl_options
super().__init__(*args, **kwargs)
def init_poolmanager(self, *args, **kwargs):
context = ssl_.create_urllib3_context(ciphers=CIPHERS, cert_reqs=ssl.CERT_REQUIRED, options=self.ssl_options)
self.poolmanager = PoolManager(*args, ssl_context=context, **kwargs)
url = "https://www.researchgate.net/profile/Corina-Florescu/publication/318596725_PositionRank_An_Unsupervised_Approach_to_Keyphrase_Extraction_from_Scholarly_Documents/links/5972182c0f7e9b40168fe63d/PositionRank-An-Unsupervised-Approach-to-Keyphrase-Extraction-from-Scholarly-Documents.pdf"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53",
}
adapter = TLSAdapter(ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1)
with requests.session() as session:
session.mount("https://www.researchgate.net/", adapter)
response = session.get(url, headers=headers)
print(response.status_code) # 200
Related Topics
Circular List Iterator in Python
Locale Date Formatting in Python
How to Keep Track of Class Instances
Permissionerror: [Errno 13] in Python
":=" Syntax and Assignment Expressions: What and Why
What's a Good Rate Limiting Algorithm
Which Python Packages Offer a Stand-Alone Event System
How to Take Column-Slices of Dataframe in Pandas
Pip Uses Incorrect Cached Package Version, Instead of the User-Specified Version
How to Install Both Python 2.X and Python 3.X in Windows
How to Execute Python File in Linux
Curses-Like Library for Cross-Platform Console App in Python
Installing Python Modules on Ubuntu
Python/Ipython Importerror: No Module Named Site
Prevent Sleep Mode Python (Wakelock on Python)
How to Make a Call to an Executable from Python Script