Python - 'Ascii' Codec Can't Decode Byte

Python - 'ascii' codec can't decode byte

"你好".encode('utf-8')

encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don't have the u). So python has to convert the string to a unicode object first. So it does the equivalent of

"你好".decode().encode('utf-8')

But the decode fails because the string isn't valid ascii. That's why you get a complaint about not being able to decode.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

This should fix your problem:

full_name, email = [unicode(x, 'utf-8') for x in [full_name, email]]

logger.debug(u'__call__ with full_name={}, email={}'.format(full_name, email))

The problem was that the default encoding of unicode strings is ASCII, which only supports 128 characters. Using UTF-8 will fix this problem.

Disclaimer This could be wrong on specifics, I code in py3 only. Learned all this in about 5 mins.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal not in range(128)

So, I managed to solve my issue.

  1. I figured out that active_agents.values(...."first_name", "last_name").order_by('-total_paid_transaction_value_last_month')
    retrieved a dictionary where its key and values were already in unicode (bacause of the way it was configured in models.py, django 1.11 and python2.7. So, the process of serializing was just fine.
    It is indeed true that the final result that went to template was looking like ’C\xc4\x83t\xc4\x83lin'. The error came from /xc4/.
  2. In order to fix it on template, I just did this:
    {{ agent.full_name.decode("utf-8") }}, which gave me the right result: Cătălin Pintea

Thanks @BoarGules. It was true that d['last_name'] and d['first_name'] were in unicode. So when I did the concatenation, I had to add u" ".

UnicodeDecodeError: 'ascii' codec can't decode byte (microsoft API)

I can't reproduce your problem with the following (simplified but runnable) code snippet:

# -*- coding: utf-8 -*-
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = '{"ReferenceText":"%s"}' % referenceText

The above runs fine without exception in Python 2.7.17.

However, I can reproduce the UnicodeError with the following modified version (note the u prefix before the second string literal):

# -*- coding: utf-8 -*-
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = u'{"ReferenceText":"%s"}' % referenceText

Or with this one:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = '{"ReferenceText":"%s"}' % referenceText

The unicode_literals directive has the effect that all string literals are treated as if you prefixed them with u.

The problem here is implicit coercion:
First you encode u"波构" from type unicode to type str explicitly using UTF-8.
But then the string formatting with % coerces it back to unicode, because if one of the operands is unicode, the other one has to be too.
The literal u'{"ReferenceText":"%s"}' is unicode, and therefore Python attempts to automatically convert the value of referenceText from str to unicode as well.

Apparently, automatic conversion happens with .decode('ascii') behind the scenes, not with .decode('utf8') or some other codec.
And of course, this fails miserably:

>>> u"波构".encode('utf-8')
'\xe6\xb3\xa2\xe6\x9e\x84'
>>> u"波构".encode('utf-8').decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

One solution would be to defer manual encoding to a later stage, in order to avoid implicit coercion:

# -*- coding: utf-8 -*-
referenceText = u"波构"
pronAssessmentParamsJson = u'{"ReferenceText":"%s"}' % referenceText
pronAssessmentParamsJson = pronAssessmentParamsJson.encode('utf-8')

However, since you are obviously trying to serialise JSON, you should really be doing this:

>>> import json
>>> json.dumps({'ReferenceText': u"波构"})
'{"ReferenceText": "\\u6ce2\\u6784"}'

Otherwise you'll soon run into trouble if referenceText contains eg. quotes or newline characters.

How to fix UnicodeDecodeError: 'ascii' codec can't decode byte?

I finally fixed my code. I am surprised how easy it looks but it took me so long to get there and I saw so many people puzzled by the same problem so I decided to post my answer.

Adding this small function before passing names for further cleaning solved my problem.

def decode(names):        
decodednames = []
for name in names:
decodednames.append(unicode(name, errors='ignore'))
return decodednames

SpaCy still thinks that £59bn is a PERSON but it's ok with me, I can deal with this later in my code.

The working code:

import urllib
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.en import English
from __future__ import unicode_literals
nlp_toolkit = English()
nlp = spacy.load('en')

def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")

# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()

# use separator to separate paragraphs and subtitles!
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

text = ''.join(article_soup)
return text

# using spacy
def get_names(all_tags):
names=[]
for ent in all_tags.ents:
if ent.label_=="PERSON":
names.append(str(ent))
return names

def decode(names):
decodednames = []
for name in names:
decodednames.append(unicode(name, errors='ignore'))
return decodednames

def cleaning_names(names):
new_names = [s.strip("'s") for s in names] # remove 's' from names
myset = list(set(new_names)) #remove duplicates
return myset

def main():
url = "http://www.bbc.co.uk/news/uk-politics-39784164"
text=get_text(url)
text=u"{}".format(text)
all_tags = nlp(text)
names = get_person(all_tags)
print "names:"
print names
decodednames = decode(names)
mynewlist = cleaning_names(decodednames)
print mynewlist

if __name__ == '__main__':
main()

which gives me this with no errors:

names: ['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May',
'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr
Clegg', 'Theresa May'] [u'Mr Clegg', u'Brexit', u'Nick Clegg',
u'59bn', u'Theresa May']

Querying DB in Python3 throws ascii codec can't decode byte error

To those in the future stuck on the same issue I had, the Database I was connecting to was in fact using SQL_ASCII, I had missed one step when declaring my connection in order to use utf-8 in my environment. Below is the one line added after the connection is made:

try:
conn = psycopg2.connect(host=db_host, user=db_username, password=db_password, database=db_name)
conn.set_client_encoding("utf-8")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)

You are encoding to UTF-8, then re-encoding to UTF-8. Python can only do this if it first decodes again to Unicode, but it has to use the default ASCII codec:

>>> u'ñ'
u'\xf1'
>>> u'ñ'.encode('utf8')
'\xc3\xb1'
>>> u'ñ'.encode('utf8').encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Don't keep encoding; leave encoding to UTF-8 to the last possible moment instead. Concatenate Unicode values instead.

You can use str.join() (or, rather, unicode.join()) here to concatenate the three values with dashes in between:

nombre = u'-'.join(fabrica, sector, unidad)
return nombre.encode('utf-8')

but even encoding here might be too early.

Rule of thumb: decode the moment you receive the value (if not Unicode values supplied by an API already), encode only when you have to (if the destination API does not handle Unicode values directly).



Related Topics



Leave a reply



Submit