How to fix UnicodeDecodeError: 'ascii' codec can't decode byte?
I finally fixed my code. I am surprised how easy it looks but it took me so long to get there and I saw so many people puzzled by the same problem so I decided to post my answer.
Adding this small function before passing names for further cleaning solved my problem.
def decode(names):
decodednames = []
for name in names:
decodednames.append(unicode(name, errors='ignore'))
return decodednames
SpaCy still thinks that £59bn is a PERSON but it's ok with me, I can deal with this later in my code.
The working code:
import urllib
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.en import English
from __future__ import unicode_literals
nlp_toolkit = English()
nlp = spacy.load('en')
def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()
# use separator to separate paragraphs and subtitles!
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]
text = ''.join(article_soup)
return text
# using spacy
def get_names(all_tags):
names=[]
for ent in all_tags.ents:
if ent.label_=="PERSON":
names.append(str(ent))
return names
def decode(names):
decodednames = []
for name in names:
decodednames.append(unicode(name, errors='ignore'))
return decodednames
def cleaning_names(names):
new_names = [s.strip("'s") for s in names] # remove 's' from names
myset = list(set(new_names)) #remove duplicates
return myset
def main():
url = "http://www.bbc.co.uk/news/uk-politics-39784164"
text=get_text(url)
text=u"{}".format(text)
all_tags = nlp(text)
names = get_person(all_tags)
print "names:"
print names
decodednames = decode(names)
mynewlist = cleaning_names(decodednames)
print mynewlist
if __name__ == '__main__':
main()
which gives me this with no errors:
names: ['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May',
'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr
Clegg', 'Theresa May'] [u'Mr Clegg', u'Brexit', u'Nick Clegg',
u'59bn', u'Theresa May']
UnicodeDecodeError: 'ascii' codec can't decode byte (microsoft API)
I can't reproduce your problem with the following (simplified but runnable) code snippet:
# -*- coding: utf-8 -*-
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = '{"ReferenceText":"%s"}' % referenceText
The above runs fine without exception in Python 2.7.17.
However, I can reproduce the UnicodeError with the following modified version (note the u
prefix before the second string literal):
# -*- coding: utf-8 -*-
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = u'{"ReferenceText":"%s"}' % referenceText
Or with this one:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = '{"ReferenceText":"%s"}' % referenceText
The unicode_literals
directive has the effect that all string literals are treated as if you prefixed them with u
.
The problem here is implicit coercion:
First you encode u"波构"
from type unicode
to type str
explicitly using UTF-8.
But then the string formatting with %
coerces it back to unicode
, because if one of the operands is unicode
, the other one has to be too.
The literal u'{"ReferenceText":"%s"}'
is unicode
, and therefore Python attempts to automatically convert the value of referenceText
from str
to unicode
as well.
Apparently, automatic conversion happens with .decode('ascii')
behind the scenes, not with .decode('utf8')
or some other codec.
And of course, this fails miserably:
>>> u"波构".encode('utf-8')
'\xe6\xb3\xa2\xe6\x9e\x84'
>>> u"波构".encode('utf-8').decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
One solution would be to defer manual encoding to a later stage, in order to avoid implicit coercion:
# -*- coding: utf-8 -*-
referenceText = u"波构"
pronAssessmentParamsJson = u'{"ReferenceText":"%s"}' % referenceText
pronAssessmentParamsJson = pronAssessmentParamsJson.encode('utf-8')
However, since you are obviously trying to serialise JSON, you should really be doing this:
>>> import json
>>> json.dumps({'ReferenceText': u"波构"})
'{"ReferenceText": "\\u6ce2\\u6784"}'
Otherwise you'll soon run into trouble if referenceText
contains eg. quotes or newline characters.
How to fix UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 when importing matplotlib.pyplot?
Make sure your filename and folder does not have any non ASCII characters. This does not occur usually, the matplotlib team right now is focused on solving bugs in python3 only as python2 will soon be deprecated. That will mostly clear the error. This is something you can try if that doesn't work as a last resort. You can try adding
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import matplotlib.pyplot as plt
how to interpret this error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 164: ordinal not in range(128)
found a way to solve this:
f = open(file, encoding = 'utf-8', mode = "r+")
f = open(file, encoding = 'utf-8', mode = "w")
it worked.
UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code
Please try if the following works for you.
import networkx as nx
import numpy as np
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
def textrank(document):
sentence_tokenizer = PunktSentenceTokenizer()
sentences = sentence_tokenizer.tokenize(document)
bow_matrix = CountVectorizer().fit_transform(sentences)
normalized = TfidfTransformer().fit_transform(bow_matrix)
similarity_graph = normalized * normalized.T
nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
scores = nx.pagerank(nx_graph)
return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
fp = open("QC")
txt = fp.read()
sents = textrank(txt.encode('utf-8'))
print sents
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc7 in position 7: ordinal not in range(128)
I suspect that your temporary directory contains a non-ASCII 7 bit character in the actual path.
If you are on Windows make sure that the environment variable %TEMP% is set to a directory that only uses in it's full path the values in the ranges A-Za-z0-9 no accented characters, symbols, etc.
Note
On windows 10 and python 2.7.13 I have just tested and got:
> pip install django
Collecting django
Downloading Django-1.11.6-py2.py3-none-any.whl (6.9MB)
100% |################################| 7.0MB 152kB/s
Requirement already satisfied: pytz in c:\python27\lib\site-packages (from django)
Installing collected packages: django
Successfully installed django-1.11.6
> pip --version
pip 9.0.1 from c:\python27\lib\site-packages (python 2.7)
BUT
> mkdir Témp
> set TEMP=.\Témp
> pip install django
Collecting django
Exception:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\pip\basecommand.py", line 215, in main
status = self.run(options, args)
File "c:\python27\lib\site-packages\pip\commands\install.py", line 335, in run
wb.build(autobuilding=True)
File "c:\python27\lib\site-packages\pip\wheel.py", line 749, in build
self.requirement_set.prepare_files(self.finder)
File "c:\python27\lib\site-packages\pip\req\req_set.py", line 380, in prepare_files
ignore_dependencies=self.ignore_dependencies))
File "c:\python27\lib\site-packages\pip\req\req_set.py", line 620, in _prepare_file
session=self.session, hashes=hashes)
File "c:\python27\lib\site-packages\pip\download.py", line 821, in unpack_url
hashes=hashes
File "c:\python27\lib\site-packages\pip\download.py", line 659, in unpack_http_url
hashes)
File "c:\python27\lib\site-packages\pip\download.py", line 880, in _download_http_url
file_path = os.path.join(temp_dir, filename)
File "c:\python27\lib\ntpath.py", line 85, in join
result_path = result_path + p_path
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 31: ordinal not in range(128)
Related Topics
Bare Asterisk in Function Arguments
How to Find the Cumulative Sum of Numbers in a List
Pandas Groupby With Delimiter Join
How to Add Hovering Annotations to a Plot
What Is the Result of % in Python
How to Add New Keys to a Dictionary
Python Multiprocessing Picklingerror: Can't Pickle ≪Type 'Function'≫
Identify Groups of Continuous Numbers in a List
What Exactly Is Current Working Directory
How Can the Euclidean Distance Be Calculated With Numpy
Unicodeencodeerror: 'Charmap' Codec Can't Encode Characters
Multiple Assignment and Evaluation Order in Python
Using Global Variables Between Files
Matplotlib/Seaborn: First and Last Row Cut in Half of Heatmap Plot
How to Write the Fibonacci Sequence