Python Requests.Get Always Get 404

python requests.get always get 404

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.

That means you need to:

  • Record all aspects of the working request
  • Record all aspects of the failing request
  • Try out what changes you can make to make the failing request more like the working request, and minimise those changes.

I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.

For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:

  • Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
  • Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
  • Connection: leave this to the client to manage
  • Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).

Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.

In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:

>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>

Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.

The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.

Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.

Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

Python requests.get(URL) returns 404 error when using URL with dot

Strange, it seems that if one sends the User-Agent header, even with an empty value, it then responds with a 200:

>>> requests.get('https://finance.yahoo.com/quote/AFLT.ME', headers={'User-Agent': ''})
<Response [200]>

Edit: The same issue was reported here: https://stackoverflow.com/a/68259438/9835872

404 status code while making HTTP request via Python's requests library. However page is loading fine in browser

The website you mentioned is checking for "User-Agent" in the request's header. You can fake the "User-Agent" in your request by passing the dict object with Custom Headers in your requests.get(..) call. It'll make it look like it is coming from the actual browser and you'll receive the response.

For example:

>>> import requests
>>> url = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

# Make request with "User-Agent" Header
>>> response = requests.get(url, headers=headers)
>>> response.status_code
200 # success response

>>> response.text # will return the website content

Python requests.get showing 404 while url does exists

Without even waiting for your test, I'm pretty confident I know what your bug is.

I put this url manually in function call it works fine but if I read that file and directly call function with that url, give me error. I have put 3-4 checks while reading file, url is perfectly coming form the file even I tried to print that url inside the called function and I'm receiving that url in function too. still have no clue what is happening ?

Most likely you're reading the URL with something like for line in file: or file.readline or some other function that preserves newlines. So, what you actually end up with is not this:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm'

… but this:

url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm\n'

The latter will be escaped by requests into something that's a perfectly good URL for a resource that doesn't exist, hence the 404 error.

The best way to check this is to print repr(url) instead of print(url). This will also find other possible problems, like embedded nonprintable characters. It won't find everything, like Unicode characters that look like . but actually aren't, but it's a good first test. (And if that doesn't find it, for a second test, copy and paste from the output, quotes and all, into your test script.)

If this is the problem, the fix is simple:

url = url.rstrip()

Simple GET keeps returning 404 while in browser works perfectly

The website doesn't like HTTP(S) requests coming from Python code. By default, requests sets the following request headers:

{
'User-Agent': 'python-requests/2.19.1',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
'Connection': 'keep-alive'
}

If you set another, less obvious User-Agent, it should work fine. For example:

headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
result = requests.get('https://www.transfermarkt.co.uk', headers=headers)

API Requests throwing a 404 error due to bad URL, URL is being changed automatically, what is going on?

The url parameter in the requests.get call is probably incorrect - an API endpoint rarely ends in a forward slash.



Related Topics



Leave a reply



Submit