File_Get_Contents Returns 403 Forbidden

file_get_contents() give me 403 Forbidden

This is not a problem in your script, its a feature in you partners web server security.

It's hard to say exactly whats blocking you, most likely its some sort of block against scraping. If your partner has access to his web servers setup it might help pinpoint.

What you could do is to "fake a web browser" by setting the user-agent headers so that it imitates a standard web browser.

I would recommend cURL to do this, and it will be easy to find good documentation for doing this.

    // create curl resource
$ch = curl_init();

// set url
curl_setopt($ch, CURLOPT_URL, "example.com");

//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

// $output contains the output string
$output = curl_exec($ch);

// close curl resource to free up system resources
curl_close($ch);

How to handle 403 error in file_get_contents()?

Personally I'm suggesting you to use cURL instead of file_get_contents. file_get_contents is great for basic content oriented GET requests. But the header, HTTP request method, timeout, redirects, and other important things do not matter for it.

Nevertheless, to detect status code (403, 200, 500 etc.) you can use get_headers() call or $http_response_header auto-assigned variable.

$http_response_header is a predefined variable and it is updated on each file_get_contents call.

Following code may give you status code (403, 200 etc) directly.

preg_match( "#HTTP/[0-9\.]+\s+([0-9]+)#", $http_response_header[0], $match);
$statusCode = intval($match[1]);

For more information and content of variable please check official documentation

$http_response_header — HTTP response headers

get_headers — Fetches all the headers sent by the server in response to a HTTP request

(Better Alternative) cURL

Warning about $http_response_header, (from php.net)

Note that the HTTP wrapper has a hard
limit of 1024 characters for the header lines. Any HTTP header received that is longer than this will be ignored and won't appear in $http_response_header. The cURL extension doesn't have this limit.

file_get_contents returns 403 forbidden

This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.

It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.

You should probably talk to the administrator of the remote server.

Is there a way to get round a 403 error with php file_get_contents?

You need to add the User-Agent header to the actual header:

$context  = stream_context_create(
array(
'http' => array(
'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
),
));

You could also use the user_agent option:

$context = stream_context_create(
array(
'http' => array(
'user_agent' => 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
),
));

Both above examples should work and you should now be able to get the contents using:

$content = file_get_contents('https://www.vesselfinder.com/vessels/CELEBRITY-MILLENNIUM-IMO-9189419-MMSI-249055000', false, $context);

echo $content;

This could of course also be tested using curl from the command line. Notice that we are setting our own User-Agent header:

curl --verbose -H 'User-Agent: YourApplication/1.0' 'https://www.vesselfinder.com/vessels/CELEBRITY-MILLENNIUM-IMO-9189419-MMSI-249055000'

It might also be worth knowing that the default User-Agent used by curl seems to be blocked, so if using curl you need to add your own using the -H flag.

file_get_contents returns 403 forbidden with user agent - PHP

This website has 3 anti bots systems:

  1. Riskified.
  2. Forter.
  3. Cloudflare.

They are used to prevent DoS/DDoS atacks, crawling tasks.... Basically you can't easily crawl them with a simple request.

To bypass them you need to simulate/use real browser. You can use selenium or playwright.

I will show you an example of crawling this website with playwright and python.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
browser = p.webkit.launch(headless=True)
baseurl = "https://www.brownsfashion.com/uk/shopping/jem-18k-yellow-gold-octogone-double-paved-ring-17648795"
page = browser.new_page()
page.goto(baseurl)
title = page.wait_for_selector("//a[@data-test='product-brand']")
name = page.wait_for_selector("//span[@data-test='product-name']")
price = page.wait_for_selector("//span[@data-test='product-price']")
print("Title: " + title.text_content())
print("Name: " + name.text_content())
print("Price: " + price.text_content())
browser.close()

I hope I have been able to help you.

403 error when using file_get_contents()

I've managed to fix it by adding the following code...

ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0)');

...as per this answer.

file_get_contents() gets 403 from api.github.com every time

This happens because GitHub requires you to send UserAgent header. It doesn't need to be anything specific. This will do:

$opts = [
'http' => [
'method' => 'GET',
'header' => [
'User-Agent: PHP'
]
]
];

$context = stream_context_create($opts);
$content = file_get_contents("https://api.github.com/zen", false, $context);
var_dump($content);

The output is:

string(35) "Approachable is better than simple."


Related Topics



Leave a reply



Submit