Python Requests. 403 Forbidden

Python requests. 403 Forbidden

It seems the page rejects GET requests that do not identify a User-Agent. I visited the page with a browser (Chrome) and copied the User-Agent header of the GET request (look in the Network tab of the developer tools):

import requests
url = 'http://worldagnetwork.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.content.decode())

# <!doctype html>
# <!--[if lt IE 7 ]><html class="no-js ie ie6" lang="en"> <![endif]-->
# <!--[if IE 7 ]><html class="no-js ie ie7" lang="en"> <![endif]-->
# <!--[if IE 8 ]><html class="no-js ie ie8" lang="en"> <![endif]-->
# <!--[if (gte IE 9)|!(IE)]><!--><html class="no-js" lang="en"> <!--<![endif]-->
# ...

Python requests response 403 forbidden

The code here

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
page = requests.get("https://www.auto24.ee/", headers=headers)
print(page.text)

Always will get something as the following

 <div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>

<p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
</div>

<div class="cf-column">
<h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>

<p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can
run an anti-virus scan on your device to make sure it is not infected with malware.</p>

The website is protected by CloudFlare. By standard means, there is minimal chance of being able to access the WebSite through automation such as requests or selenium. You are seeing 403 since your client is detected as a robot. There may be some arbitrary methods to bypass CloudFlare that could be found elsewhere, but the WebSite is working as intended. There must be a ton of data submitted through headers and cookies that show your request is valid, and since you are simply submitting only a user agent, CloudFlare is triggered. Simply spoofing another user-agent is not even close to enough to not trigger a captcha, CloudFlare checks for MANY things.

I suggest you look at selenium here since it simulates a real browser, or research guides to (possibly?) bypass Cloudflare with requests.

Update
Found 2 python libraries cloudscraper and cfscrape. Both are not usable for this site since it uses cloudflare v2 unless you pay for a premium version.

python requests: 403 Forbidden Error While Downlaoding Pdf File from www.researchgate.net

Forbidden was caused because of TLS verification failure. I had to change the TLS adapter to make it work.

import ssl
import requests
from requests.adapters import HTTPAdapter
from urllib3.poolmanager import PoolManager
from urllib3.util import ssl_

CIPHERS = "ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:AES256-SHA"

class TLSAdapter(HTTPAdapter):

def __init__(self, ssl_options=0, *args, **kwargs):
self.ssl_options = ssl_options
super().__init__(*args, **kwargs)

def init_poolmanager(self, *args, **kwargs):
context = ssl_.create_urllib3_context(ciphers=CIPHERS, cert_reqs=ssl.CERT_REQUIRED, options=self.ssl_options)
self.poolmanager = PoolManager(*args, ssl_context=context, **kwargs)

url = "https://www.researchgate.net/profile/Corina-Florescu/publication/318596725_PositionRank_An_Unsupervised_Approach_to_Keyphrase_Extraction_from_Scholarly_Documents/links/5972182c0f7e9b40168fe63d/PositionRank-An-Unsupervised-Approach-to-Keyphrase-Extraction-from-Scholarly-Documents.pdf"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53",
}
adapter = TLSAdapter(ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1)

with requests.session() as session:
session.mount("https://www.researchgate.net/", adapter)
response = session.get(url, headers=headers)
print(response.status_code) # 200


Related Topics



Leave a reply



Submit