python requests.get always get 404
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
- Record all aspects of the working request
- Record all aspects of the failing request
- Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests
, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host
; this must be set to the hostname you are contacting, so that it can properly multi-host different sites.requests
sets this one.Content-Length
andContent-Type
, for POST requests, are usually set from the arguments you pass torequests
. If these don't match, alter the arguments you pass in torequests
(but watch out withmultipart/*
requests, which use a generated boundary recorded in theContent-Type
header; leave generating that torequests
).Connection
: leave this to the client to manageCookies
: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with arequests.Session()
object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests
has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python
, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests
is not a browser. requests
is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests
results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests
as needed. If all else fails, use a project like requests-html
, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1
, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
Python requests.get(URL) returns 404 error when using URL with dot
Strange, it seems that if one sends the User-Agent
header, even with an empty value, it then responds with a 200:
>>> requests.get('https://finance.yahoo.com/quote/AFLT.ME', headers={'User-Agent': ''})
<Response [200]>
Edit: The same issue was reported here: https://stackoverflow.com/a/68259438/9835872
404 status code while making HTTP request via Python's requests library. However page is loading fine in browser
The website you mentioned is checking for "User-Agent"
in the request's header. You can fake the "User-Agent"
in your request by passing the dict
object with Custom Headers in your requests.get(..)
call. It'll make it look like it is coming from the actual browser and you'll receive the response.
For example:
>>> import requests
>>> url = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
# Make request with "User-Agent" Header
>>> response = requests.get(url, headers=headers)
>>> response.status_code
200 # success response
>>> response.text # will return the website content
Python requests.get showing 404 while url does exists
Without even waiting for your test, I'm pretty confident I know what your bug is.
I put this url manually in function call it works fine but if I read that file and directly call function with that url, give me error. I have put 3-4 checks while reading file, url is perfectly coming form the file even I tried to print that url inside the called function and I'm receiving that url in function too. still have no clue what is happening ?
Most likely you're reading the URL with something like for line in file:
or file.readline
or some other function that preserves newlines. So, what you actually end up with is not this:
url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm'
… but this:
url = 'http://www.leboncoin.fr/montres_bijoux/671762293.htm\n'
The latter will be escaped by requests
into something that's a perfectly good URL for a resource that doesn't exist, hence the 404 error.
The best way to check this is to print repr(url)
instead of print(url)
. This will also find other possible problems, like embedded nonprintable characters. It won't find everything, like Unicode characters that look like .
but actually aren't, but it's a good first test. (And if that doesn't find it, for a second test, copy and paste from the output, quotes and all, into your test script.)
If this is the problem, the fix is simple:
url = url.rstrip()
Simple GET keeps returning 404 while in browser works perfectly
The website doesn't like HTTP(S) requests coming from Python code. By default, requests
sets the following request headers:
{
'User-Agent': 'python-requests/2.19.1',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
'Connection': 'keep-alive'
}
If you set another, less obvious User-Agent
, it should work fine. For example:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
result = requests.get('https://www.transfermarkt.co.uk', headers=headers)
API Requests throwing a 404 error due to bad URL, URL is being changed automatically, what is going on?
The url
parameter in the requests.get
call is probably incorrect - an API endpoint rarely ends in a forward slash.
Related Topics
Python's in (_Contains_) Operator Returns a Bool Whose Value Is Neither True Nor False
Python Re.Sub Back Reference Not Back Referencing
Case Insensitive Flask-Sqlalchemy Query
String Comparison Doesn't Seem to Work for Lines Read from a File
How to Loop Through a List by Twos
How to Plot Multi-Color Line If X-Axis Is Date Time Index of Pandas
How to Replace Back Slash Character with Empty String in Python
Downloading a Directory Tree with Ftplib
How to Set Headers Using Python's Urllib
If Two Variables Point to the Same Object, Why Doesn't Reassigning One Variable Affect the Other
Re.Sub Erroring with "Expected String or Bytes-Like Object"
Convert from Ascii String Encoded in Hex to Plain Ascii
Combining Two Series into a Dataframe in Pandas
How to Isolate Everything Inside of a Contour, Scale It, and Test the Similarity to an Image
Catch Exception and Continue Try Block in Python