UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte
The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded. One simple way to avoid this error is to encode such strings with encode()
function as follows (if a
is the string with non-ascii character):
a.encode('utf-8').strip()
error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Python tries to convert a byte-array (a bytes
which it assumes to be a utf-8-encoded string) to a unicode string (str
). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()
). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b
in the mode specifier in the open()
states that the file shall be treated as binary, so contents
will remain a bytes
. No decoding attempt will happen this way.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3118: invalid start byte Simple text file
It seems like the file is not encoded in utf-8
. Could you try open the file using io.open with latin-1
encoding instead?
from textblob import TextBlob
import io
# dummy variables initialization
pos_correct = 0
pos_count = 0
with io.open("positive.txt", encoding='latin-1') as f:
for line in f.read().split('\n'):
analysis = TextBlob(line)
if analysis.sentiment.polarity > 0:
pos_correct += 1
pos_count +=1
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
It doesn't help that you have sys.setdefaultencoding('utf-8')
, which is confusing things further - It's a nasty hack and you need to remove it from your code.
See https://stackoverflow.com/a/34378962/1554386 for more information
The error is happening because line
is a string and you're calling encode()
. encode()
only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8
, but should be ASCII
. Either way, 0x80
is not valid ASCII or UTF-8 so fails.
0x80
is valid in some characters sets. In windows-1252
/cp1252
it's €
.
The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.
Use the io
module to open the file in text mode and decode the file as it goes - no more .decode()
! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252
.
with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
for line in twitter_file:
# line is now a <type 'unicode'>
tweet = json.loads(line)
The io
module also provide Universal Newlines. This means \r\n
are detected as newlines, so you don't have to watch for them.
Error reading file -- 'utf' can't decode byte 0xff in position 45: invalid start byte
This code is assuming that a single send
in the sender matches a single recv
in the recipient. This assumption is wrong for TCP: TCP is only an unstructured byte stream and not a structured message transport which would preserve message boundaries over send/recv.
This means that the initial data = s.recv(1024)
in the recipient might not only include the filename, but might also already include parts of the music file. Thus it is a mix of the utf-8 encoded filename (multi-byte characters) followed by the binary music data (bytes). Trying to filename = data.decode()
on this will successfully decode the initial filename. But it will continue to decode the data after the end of the filename and thus treat the binary music data also as multi-byte characters encoded in utf-8. This will lead to the observed decoding error.
The fix should be to clearly mark where the filename ends and the binary data start and then only decode the filename as text and treat the rest as bytes. A common approach is to prefix the filename with the length so that it is clear where it ends. Another approaches might to add a \0
at the end of the filename (since it is not part of valid utf-8 encoded character except NUL - which itself is invalid in filenames) and split the incoming data on this delimiter.
Apart from that the later data.decode()
when reading the music data is plain wrong since there is no matching encode()
on the sender side. And there should not be one since these are binary data, i.e. already bytes.
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c
http://docs.python.org/howto/unicode.html#the-unicode-type
str = unicode(str, errors='replace')
or
str = unicode(str, errors='ignore')
Note: This will strip out (ignore) the characters in question returning the string without them.
For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.
Alternatively: Use the open method from the codecs
module to read in the file:
import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
errors='ignore') as fdata:
Related Topics
Python: Pandas Merge Multiple Dataframes
How to Compare Version Numbers in Python
How to Display Pandas Dataframe of Floats Using a Format String for Columns
Convert Datetime Object to a String of Date Only in Python
How to Access Command Line Arguments
Splitting a List into N Parts of Approximately Equal Length
How to Check for Valid Email Address
How to Get a String After a Specific Substring
Why am I Seeing "Typeerror: String Indices Must Be Integers"
Random.Seed(): What Does It Do
Activate a Virtualenv with a Python Script
Getting "Permission Denied" When Running Pip as Root on My MAC
How to Hide Console Window in Python
How to Pass Variables Across Functions
String Concatenation of Two Pandas Columns
Calculate Cosine Similarity Given 2 Sentence Strings
Why Do You Need Explicitly Have the "Self" Argument in a Python Method