How to Fix: "Unicodedecodeerror: 'Ascii' Codec Can't Decode Byte"

How to fix UnicodeDecodeError: 'ascii' codec can't decode byte?

I finally fixed my code. I am surprised how easy it looks but it took me so long to get there and I saw so many people puzzled by the same problem so I decided to post my answer.

Adding this small function before passing names for further cleaning solved my problem.

def decode(names):        
decodednames = []
for name in names:
decodednames.append(unicode(name, errors='ignore'))
return decodednames

SpaCy still thinks that £59bn is a PERSON but it's ok with me, I can deal with this later in my code.

The working code:

import urllib
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.en import English
from __future__ import unicode_literals
nlp_toolkit = English()
nlp = spacy.load('en')

def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")

# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()

# use separator to separate paragraphs and subtitles!
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

text = ''.join(article_soup)
return text

# using spacy
def get_names(all_tags):
names=[]
for ent in all_tags.ents:
if ent.label_=="PERSON":
names.append(str(ent))
return names

def decode(names):
decodednames = []
for name in names:
decodednames.append(unicode(name, errors='ignore'))
return decodednames

def cleaning_names(names):
new_names = [s.strip("'s") for s in names] # remove 's' from names
myset = list(set(new_names)) #remove duplicates
return myset

def main():
url = "http://www.bbc.co.uk/news/uk-politics-39784164"
text=get_text(url)
text=u"{}".format(text)
all_tags = nlp(text)
names = get_person(all_tags)
print "names:"
print names
decodednames = decode(names)
mynewlist = cleaning_names(decodednames)
print mynewlist

if __name__ == '__main__':
main()

which gives me this with no errors:

names: ['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May',
'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr
Clegg', 'Theresa May'] [u'Mr Clegg', u'Brexit', u'Nick Clegg',
u'59bn', u'Theresa May']

UnicodeDecodeError: 'ascii' codec can't decode byte (microsoft API)

I can't reproduce your problem with the following (simplified but runnable) code snippet:

# -*- coding: utf-8 -*-
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = '{"ReferenceText":"%s"}' % referenceText

The above runs fine without exception in Python 2.7.17.

However, I can reproduce the UnicodeError with the following modified version (note the u prefix before the second string literal):

# -*- coding: utf-8 -*-
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = u'{"ReferenceText":"%s"}' % referenceText

Or with this one:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = '{"ReferenceText":"%s"}' % referenceText

The unicode_literals directive has the effect that all string literals are treated as if you prefixed them with u.

The problem here is implicit coercion:
First you encode u"波构" from type unicode to type str explicitly using UTF-8.
But then the string formatting with % coerces it back to unicode, because if one of the operands is unicode, the other one has to be too.
The literal u'{"ReferenceText":"%s"}' is unicode, and therefore Python attempts to automatically convert the value of referenceText from str to unicode as well.

Apparently, automatic conversion happens with .decode('ascii') behind the scenes, not with .decode('utf8') or some other codec.
And of course, this fails miserably:

>>> u"波构".encode('utf-8')
'\xe6\xb3\xa2\xe6\x9e\x84'
>>> u"波构".encode('utf-8').decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

One solution would be to defer manual encoding to a later stage, in order to avoid implicit coercion:

# -*- coding: utf-8 -*-
referenceText = u"波构"
pronAssessmentParamsJson = u'{"ReferenceText":"%s"}' % referenceText
pronAssessmentParamsJson = pronAssessmentParamsJson.encode('utf-8')

However, since you are obviously trying to serialise JSON, you should really be doing this:

>>> import json
>>> json.dumps({'ReferenceText': u"波构"})
'{"ReferenceText": "\\u6ce2\\u6784"}'

Otherwise you'll soon run into trouble if referenceText contains eg. quotes or newline characters.

How to fix UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 when importing matplotlib.pyplot?

Make sure your filename and folder does not have any non ASCII characters. This does not occur usually, the matplotlib team right now is focused on solving bugs in python3 only as python2 will soon be deprecated. That will mostly clear the error. This is something you can try if that doesn't work as a last resort. You can try adding

import sys  
reload(sys)
sys.setdefaultencoding('utf8')

import matplotlib.pyplot as plt

how to interpret this error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 164: ordinal not in range(128)

found a way to solve this:

f = open(file, encoding = 'utf-8', mode = "r+")

f = open(file, encoding = 'utf-8', mode = "w")

it worked.

UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code

Please try if the following works for you.

import networkx as nx
import numpy as np
import sys

reload(sys)
sys.setdefaultencoding('utf8')

from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

def textrank(document):
sentence_tokenizer = PunktSentenceTokenizer()
sentences = sentence_tokenizer.tokenize(document)

bow_matrix = CountVectorizer().fit_transform(sentences)
normalized = TfidfTransformer().fit_transform(bow_matrix)

similarity_graph = normalized * normalized.T

nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
scores = nx.pagerank(nx_graph)
return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

fp = open("QC")
txt = fp.read()
sents = textrank(txt.encode('utf-8'))
print sents

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc7 in position 7: ordinal not in range(128)

I suspect that your temporary directory contains a non-ASCII 7 bit character in the actual path.

If you are on Windows make sure that the environment variable %TEMP% is set to a directory that only uses in it's full path the values in the ranges A-Za-z0-9 no accented characters, symbols, etc.

Note

On windows 10 and python 2.7.13 I have just tested and got:

> pip install django
Collecting django
Downloading Django-1.11.6-py2.py3-none-any.whl (6.9MB)
100% |################################| 7.0MB 152kB/s
Requirement already satisfied: pytz in c:\python27\lib\site-packages (from django)
Installing collected packages: django
Successfully installed django-1.11.6
> pip --version
pip 9.0.1 from c:\python27\lib\site-packages (python 2.7)

BUT

> mkdir Témp
> set TEMP=.\Témp
> pip install django
Collecting django
Exception:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\pip\basecommand.py", line 215, in main
status = self.run(options, args)
File "c:\python27\lib\site-packages\pip\commands\install.py", line 335, in run
wb.build(autobuilding=True)
File "c:\python27\lib\site-packages\pip\wheel.py", line 749, in build
self.requirement_set.prepare_files(self.finder)
File "c:\python27\lib\site-packages\pip\req\req_set.py", line 380, in prepare_files
ignore_dependencies=self.ignore_dependencies))
File "c:\python27\lib\site-packages\pip\req\req_set.py", line 620, in _prepare_file
session=self.session, hashes=hashes)
File "c:\python27\lib\site-packages\pip\download.py", line 821, in unpack_url
hashes=hashes
File "c:\python27\lib\site-packages\pip\download.py", line 659, in unpack_http_url
hashes)
File "c:\python27\lib\site-packages\pip\download.py", line 880, in _download_http_url
file_path = os.path.join(temp_dir, filename)
File "c:\python27\lib\ntpath.py", line 85, in join
result_path = result_path + p_path
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 31: ordinal not in range(128)


Related Topics



Leave a reply



Submit