Correctly Reading Text from Windows-1252(Cp1252) File in Python

Correctly reading text from Windows-1252(cp1252) file in python

CP1252 cannot represent ā; your input contains the similar character â. repr just displays an ASCII representation of a unicode string in Python 2.x:

>>> print(repr(b'J\xe2nis'.decode('cp1252')))
u'J\xe2nis'
>>> print(b'J\xe2nis'.decode('cp1252'))
Jânis

How to read ® character from Windows-1252 file and write to UTF-8 file

Your file is not in Windows-1252 if 0xC2 should represent the ® character; in Windows-1252, 0xC2 is Â.

However, you should just use

of.write(line)

since encoding properly is the whole reason you're using codecs in the first place.

Python 3 Default Encoding cp1252

According to What’s New In Python 3.0,

There is a platform-dependent default encoding […] In many cases, but not all, the system default is UTF-8; you should never count on this default.

and

PEP 3120: The default source encoding is now UTF-8.

In other words, Python opens source files as UTF-8 by default, but any interaction with the filesystem will depend on the environment. It's strongly recommended to use open(filename, encoding='utf-8') to read a file.

Another change is that b'bytes'.decode() and 'str'.encode() with no argument use utf-8 instead of ascii.

Python 3.6 changes some more defaults:

PEP 529: Change Windows filesystem encoding to UTF-8

PEP 528: Change Windows console encoding to UTF-8

But the default encoding for open() is still whatever Python manages to infer from the environment.

It appears that 3.7 will add an (opt-in!) mode where the environmental locale encoding is ignored, and everything is all UTF-8 all the time (except for specific cases where Windows uses UTF-16, I suppose). See PEP 0540 and corresponding Issue 29240.

Handling Encoding Errors in a UTF-8 File with Python3

Using the original data from your answer, you've got mojibake from a double-encode. You need a double-decode to translate it properly.

>>> s = b'# ::snt That\xc2\x92s what we\xc2\x92re with\xc2\x85You\xc2\x92re not sittin\xc2\x92 there in a back alley and sayin\xc2\x92 hey what do you say, five bucks?\n'
>>> s.decode('utf8').encode('latin1').decode('cp1252')
'# ::snt That’s what we’re with…You’re not sittin’ there in a back alley and sayin’ hey what do you say, five bucks?\n'

The data is actually in UTF-8, but on decode to Unicode the code points of the errors are the bytes for a Windows-1252 code page. The .encode('latin1') converts the Unicode code points 1:1 back to bytes, since the latin1 encoding is the first 256 code points of Unicode, then it can be decoded correctly as Windows-1252.

Python 3 chokes on CP-1252/ANSI reading

Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:

>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>

or with an actual file:

>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte

Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:

>>> open('test.txt', encoding='latin-1').read()
'\x81\n'

Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.

How to convert cp1252 to UTF-8 when export csv file using python

The UnicodeEncodeError you are facing occurs when you write the data to the CSV output file.
As the error message tells us, Python uses a "charmap" codec which doesn't support the characters contained in your data.
This usually happens when you open a file without specifying the encoding parameter on a Windows machine.

In the attached code document (comment link), snippet no. 10, we can see that this is the case.
You wrote:

with open('wongnai.csv', 'w', newline='') as record:
fieldnames = ...

In this case, Python uses a platform-dependent default encoding, which is usually some 8-bit encoding on Windows machines.
Specify a codec that supports all of Unicode, and writing the file should succeed:

with open('wongnai.csv', 'w', newline='', encoding='utf16') as record:
fieldnames = ...

You can also use "utf8" or "utf32" instead of "utf16", of course.
UTF-8 is very popular for saving files in Unix environments and on the Internet, but if you are planning to open the CSV file with Excel later on, you might face some trouble to get the application to display the data properly.
A more Windows-proof (but technically non-standard) solution is to use "utf-8-sig", which adds some semi-magic character to the beginning of the file for helping Windows programs understand that it's UTF-8.



Related Topics



Leave a reply



Submit