Error Unicodedecodeerror: 'Utf-8' Codec Can't Decode Byte 0Xff in Position 0: Invalid Start Byte

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

Since you did not provide any code we could look at, we only could guess on the rest.

From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:

with open(path, 'rb') as f:
contents = f.read()

That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded. One simple way to avoid this error is to encode such strings with encode() function as follows (if a is the string with non-ascii character):

a.encode('utf-8').strip()

Python 3 CSV file giving UnicodeDecodeError: 'utf-8' codec can't decode byte error when I print

We know the file contains the byte b'\x96' since it is mentioned in the error message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte

Now we can write a little script to find out if there are any encodings where b'\x96' decodes to ñ:

import pkgutil
import encodings
import os

def all_encodings():
modnames = set([modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)

text = b'\x96'
for enc in all_encodings():
try:
msg = text.decode(enc)
except Exception:
continue
if msg == 'ñ':
print('Decoding {t} with {enc} is {m}'.format(t=text, enc=enc, m=msg))

which yields

Decoding b'\x96' with mac_roman is ñ
Decoding b'\x96' with mac_farsi is ñ
Decoding b'\x96' with mac_croatian is ñ
Decoding b'\x96' with mac_arabic is ñ
Decoding b'\x96' with mac_romanian is ñ
Decoding b'\x96' with mac_iceland is ñ
Decoding b'\x96' with mac_turkish is ñ

Therefore, try changing

with open('my_file.csv', 'r', newline='') as csvfile:

to one of those encodings, such as:

with open('my_file.csv', 'r', encoding='mac_roman', newline='') as csvfile:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte error in python while reading a csv file

Without seeing the binary content of the file it is difficult to guess the actual encoding but UTF-8, with or without a BOM (Byte order Marker) cannot start with an 0xFF.

If it starts with an 0xFF, then that would suggest that it is probably in Little Endian UTF-16 to UTF-32 which are the only Unicode serialisations that have a byte order marker starting with 0xFF.

https://en.wikipedia.org/wiki/Byte_order_mark gives some explanation.

It is also possible that it is a Persian specific character set. National character sets should be avoided if a Unicode option is available, for the generation of your source CSV files.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3118: invalid start byte Simple text file

It seems like the file is not encoded in utf-8. Could you try open the file using io.open with latin-1 encoding instead?

from textblob import TextBlob
import io

# dummy variables initialization
pos_correct = 0
pos_count = 0

with io.open("positive.txt", encoding='latin-1') as f:
for line in f.read().split('\n'):
analysis = TextBlob(line)
if analysis.sentiment.polarity > 0:
pos_correct += 1
pos_count +=1


Related Topics



Leave a reply



Submit