UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' - -when using urlib.request python3
Use a percent-encoded URL:
link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'
I found the above percent-encoded URL by pointing the browser at
http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html
going to the page, then copying-and-pasting the
encoded url supplied by the browser back into the text editor. However, you can generate a percent-encoded URL programmatically using:
from urllib import parse
link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'
scheme, netloc, path, query, fragment = parse.urlsplit(link)
path = parse.quote(path)
link = parse.urlunsplit((scheme, netloc, path, query, fragment))
which yields
http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html
How to deal with the 'ascii' codec can't encode character '\xe9' error?
I think this is a solution...
The problem is that the url you start with:
"https://www.elections.on.ca/content/dam/NGW/sitecontent/2022/results/Vote%20Totals%20From%20Official%20Tabulation%20-%20Orléans%20076.xlsx'
is already url-quoted (e.g. spaces replaced by %20
), but still contains non-ascii chars here Orléans
So the solution from this question will help us, but just applying urllib.parse.quote(...)
results in twice-encoded spaces as %2520
. That is why you get a 404 when requesting the processed url.
So first we need to unquote the url (i.e. %20 ->> " "
), then quote it again - this time the accented char will be quoted too and it should work.
Try this:
path = urllib.parse.quote(urllib.parse.unquote(link['href']))
url = "https://www.elections.on.ca" + path
The result we get is:
https://www.elections.on.ca/content/dam/NGW/sitecontent/2022/results/Vote%20Totals%20From%20Official%20Tabulation%20-%20Orl%C3%A9ans%20076.xlsx
...should work now!
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' python 3
The problem here is not the encoding itself but the correct encoding to pass to `request'.
You need to quote the url as follows:
import urllib.request
import urllib.parse
imglinks = ["http://www.katytrailweekly.com/Files/MalibuPokeMatt_©Marple_449-EDITED_15920174118.jpg"]
for link in imglinks:
link = urllib.parse.quote(link,safe=':/') # <- here
filename = link.split('/')[-1]
urllib.request.urlretrieve(link, filename)
This way your © symbol is encoded as %C2%A9
as the web server wants.
The safe
parameter is specified to prevent quote
to modify also the :
after http
.
Is up to you to modify the code to save the file with the correct original filename. ;)
Python UnicodeEncodeError: 'ascii' codec can't encode character in position 0: ordinal not in range(128)
I will answer my own question. Found an duplicated question. stackoverflow.com/questions/9942594/
But for simplicity, here is an elegant solution that works well with my use case:
def safe_str(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
return ""
safe_str(u'\u2013')
Or simply use:
u'\u2013'.encode('ascii', 'ignore')
UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 23: ordinal not in range(128)
urllib.request
expects a properly url escaped url.
In this case, the properly escaped url is:
imageUrl = 'https://www.residentadvisor.net/images/labels/3000%C2%B0records.jpg'
If you're dealing with potentially poorly encoded urls, one library which helps you encode them properly is yelp_uri.encoding.recode_uri. Full disclosure: I have contributed to this library.
I used the following code to get the properly encoded url:
from yelp_uri.encoding import recode_uri
imageUrl = recode_uri(imageUrl)
Unicode String in urllib.request
Use urllib.parse.quote
:
>>> urllib.parse.quote('bär')
'b%C3%A4r'
>>> urllib.parse.urljoin('https://d7mj4aqfscim2.cloudfront.net/tts/de/token/',
... urllib.parse.quote('bär'))
'https://d7mj4aqfscim2.cloudfront.net/tts/de/token/b%C3%A4r'
UTF-8 encoding issue with Python 3
You are passing a string which contain non-ASCII characters to urllib.urlopen
, which isn't a valid URI (it is a valid IRI or International Resource Identifier, though).
You need to make the IRI a valid URI before passing it to urlopen
. The specifics of this
depend on which part of the IRI contain non-ASCII characters: the domain part should be encoded using Punycode, while the path should use percent-encoding.
Since your problem is exclusively due to the path containing Unicode characters, assuming your IRI is stored in the variable iri
, you can fix it using the following:
import urllib.parse
import urllib.request
split_url = list(urllib.parse.urlsplit(iri))
split_url[2] = urllib.parse.quote(split_url[2]) # the third component is the path of the URL/IRI
url = urllib.parse.urlunsplit(split_url)
urllib.request.urlopen(url).read()
However, if you can avoid urllib
and have the option of using the requests
library instead, I would recommend doing so. The library is easier to use and has automatic IRI handling.
Related Topics
Is There an Expression for an Infinite Iterator
Matplotlib Scatter Plot with Legend
How to Create a Spinning Command Line Cursor
How to Plot and Annotate Grouped Bars in Seaborn/Matplotlib
Iterating Through Two Lists in Django Templates
Difference Between Python3 and Python3M Executables
Nltk Naivebayesclassifier Training for Sentiment Analysis
Python Pyqt Signals Are Not Always Working
Activate Python Virtualenv in Dockerfile
How to Ssh Connect Through Python Paramiko with Ppk Public Key
Possibilities for Python Classes Organized Across Files
Python Input Never Equals an Integer
How to Set Selenium Webdriver from Headless Mode to Normal Mode Within the Same Session
When Installing Pyaudio, Pip Cannot Find Portaudio.H in /Usr/Local/Include
I am Sending Commands Through Serial Port in Python But They Are Sent Multiple Times Instead of One