Unicodeencodeerror: 'Ascii' Codec Can't Encode Character '\Xe9' - -When Using Urlib.Request Python3

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' - -when using urlib.request python3

Use a percent-encoded URL:

link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'

I found the above percent-encoded URL by pointing the browser at

http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

going to the page, then copying-and-pasting the
encoded url supplied by the browser back into the text editor. However, you can generate a percent-encoded URL programmatically using:

from urllib import parse

link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'

scheme, netloc, path, query, fragment = parse.urlsplit(link)
path = parse.quote(path)
link = parse.urlunsplit((scheme, netloc, path, query, fragment))

which yields

http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html

How to deal with the 'ascii' codec can't encode character '\xe9' error?

I think this is a solution...

The problem is that the url you start with:

"https://www.elections.on.ca/content/dam/NGW/sitecontent/2022/results/Vote%20Totals%20From%20Official%20Tabulation%20-%20Orléans%20076.xlsx'

is already url-quoted (e.g. spaces replaced by %20), but still contains non-ascii chars here Orléans

So the solution from this question will help us, but just applying urllib.parse.quote(...) results in twice-encoded spaces as %2520. That is why you get a 404 when requesting the processed url.

So first we need to unquote the url (i.e. %20 ->> " "), then quote it again - this time the accented char will be quoted too and it should work.

Try this:

path = urllib.parse.quote(urllib.parse.unquote(link['href']))
url = "https://www.elections.on.ca" + path

The result we get is:

https://www.elections.on.ca/content/dam/NGW/sitecontent/2022/results/Vote%20Totals%20From%20Official%20Tabulation%20-%20Orl%C3%A9ans%20076.xlsx

...should work now!

UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' python 3

The problem here is not the encoding itself but the correct encoding to pass to `request'.

You need to quote the url as follows:

import urllib.request
import urllib.parse

imglinks = ["http://www.katytrailweekly.com/Files/MalibuPokeMatt_©Marple_449-EDITED_15920174118.jpg"]

for link in imglinks:
link = urllib.parse.quote(link,safe=':/') # <- here
filename = link.split('/')[-1]
urllib.request.urlretrieve(link, filename)

This way your © symbol is encoded as %C2%A9 as the web server wants.

The safe parameter is specified to prevent quote to modify also the : after http.

Is up to you to modify the code to save the file with the correct original filename. ;)

Python UnicodeEncodeError: 'ascii' codec can't encode character in position 0: ordinal not in range(128)

I will answer my own question. Found an duplicated question. stackoverflow.com/questions/9942594/

But for simplicity, here is an elegant solution that works well with my use case:

def safe_str(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
return ""

safe_str(u'\u2013')

Or simply use:

u'\u2013'.encode('ascii', 'ignore')

UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 23: ordinal not in range(128)

urllib.request expects a properly url escaped url.

In this case, the properly escaped url is:

imageUrl = 'https://www.residentadvisor.net/images/labels/3000%C2%B0records.jpg'

If you're dealing with potentially poorly encoded urls, one library which helps you encode them properly is yelp_uri.encoding.recode_uri. Full disclosure: I have contributed to this library.

I used the following code to get the properly encoded url:

from yelp_uri.encoding import recode_uri
imageUrl = recode_uri(imageUrl)

Unicode String in urllib.request

Use urllib.parse.quote:

>>> urllib.parse.quote('bär')
'b%C3%A4r'

>>> urllib.parse.urljoin('https://d7mj4aqfscim2.cloudfront.net/tts/de/token/',
... urllib.parse.quote('bär'))
'https://d7mj4aqfscim2.cloudfront.net/tts/de/token/b%C3%A4r'

UTF-8 encoding issue with Python 3

You are passing a string which contain non-ASCII characters to urllib.urlopen, which isn't a valid URI (it is a valid IRI or International Resource Identifier, though).

You need to make the IRI a valid URI before passing it to urlopen. The specifics of this
depend on which part of the IRI contain non-ASCII characters: the domain part should be encoded using Punycode, while the path should use percent-encoding.

Since your problem is exclusively due to the path containing Unicode characters, assuming your IRI is stored in the variable iri, you can fix it using the following:

import urllib.parse
import urllib.request

split_url = list(urllib.parse.urlsplit(iri))
split_url[2] = urllib.parse.quote(split_url[2]) # the third component is the path of the URL/IRI
url = urllib.parse.urlunsplit(split_url)

urllib.request.urlopen(url).read()

However, if you can avoid urllib and have the option of using the requests library instead, I would recommend doing so. The library is easier to use and has automatic IRI handling.



Related Topics



Leave a reply



Submit