Unicodedecodeerror: 'Utf8' Codec Can't Decode Byte 0Xa5 in Position 0: Invalid Start Byte

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded. One simple way to avoid this error is to encode such strings with encode() function as follows (if a is the string with non-ascii character):

a.encode('utf-8').strip()

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

Since you did not provide any code we could look at, we only could guess on the rest.

From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:

with open(path, 'rb') as f:
contents = f.read()

That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3118: invalid start byte Simple text file

It seems like the file is not encoded in utf-8. Could you try open the file using io.open with latin-1 encoding instead?

from textblob import TextBlob
import io

# dummy variables initialization
pos_correct = 0
pos_count = 0

with io.open("positive.txt", encoding='latin-1') as f:
for line in f.read().split('\n'):
analysis = TextBlob(line)
if analysis.sentiment.polarity > 0:
pos_correct += 1
pos_count +=1

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code.
See https://stackoverflow.com/a/34378962/1554386 for more information

The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.

0x80 is valid in some characters sets. In windows-1252/cp1252 it's .

The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.

Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
for line in twitter_file:
# line is now a <type 'unicode'>
tweet = json.loads(line)

The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.

Error reading file -- 'utf' can't decode byte 0xff in position 45: invalid start byte

This code is assuming that a single send in the sender matches a single recv in the recipient. This assumption is wrong for TCP: TCP is only an unstructured byte stream and not a structured message transport which would preserve message boundaries over send/recv.

This means that the initial data = s.recv(1024) in the recipient might not only include the filename, but might also already include parts of the music file. Thus it is a mix of the utf-8 encoded filename (multi-byte characters) followed by the binary music data (bytes). Trying to filename = data.decode() on this will successfully decode the initial filename. But it will continue to decode the data after the end of the filename and thus treat the binary music data also as multi-byte characters encoded in utf-8. This will lead to the observed decoding error.

The fix should be to clearly mark where the filename ends and the binary data start and then only decode the filename as text and treat the rest as bytes. A common approach is to prefix the filename with the length so that it is clear where it ends. Another approaches might to add a \0 at the end of the filename (since it is not part of valid utf-8 encoded character except NUL - which itself is invalid in filenames) and split the incoming data on this delimiter.

Apart from that the later data.decode() when reading the music data is plain wrong since there is no matching encode() on the sender side. And there should not be one since these are binary data, i.e. already bytes.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')

Note: This will strip out (ignore) the characters in question returning the string without them.

For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.

Alternatively: Use the open method from the codecs module to read in the file:

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
errors='ignore') as fdata:


Related Topics



Leave a reply



Submit