Correctly reading text from Windows-1252(cp1252) file in python
CP1252 cannot represent ā; your input contains the similar character â. repr
just displays an ASCII representation of a unicode string in Python 2.x:
>>> print(repr(b'J\xe2nis'.decode('cp1252')))
u'J\xe2nis'
>>> print(b'J\xe2nis'.decode('cp1252'))
Jânis
How to read ® character from Windows-1252 file and write to UTF-8 file
Your file is not in Windows-1252 if 0xC2 should represent the ®
character; in Windows-1252, 0xC2 is Â
.
However, you should just use
of.write(line)
since encoding properly is the whole reason you're using codecs
in the first place.
Python 3 Default Encoding cp1252
According to What’s New In Python 3.0,
There is a platform-dependent default encoding […] In many cases, but not all, the system default is UTF-8; you should never count on this default.
and
PEP 3120: The default source encoding is now UTF-8.
In other words, Python opens source files as UTF-8 by default, but any interaction with the filesystem will depend on the environment. It's strongly recommended to use open(filename, encoding='utf-8')
to read a file.
Another change is that b'bytes'.decode()
and 'str'.encode()
with no argument use utf-8 instead of ascii.
Python 3.6 changes some more defaults:
PEP 529: Change Windows filesystem encoding to UTF-8
PEP 528: Change Windows console encoding to UTF-8
But the default encoding for open()
is still whatever Python manages to infer from the environment.
It appears that 3.7 will add an (opt-in!) mode where the environmental locale encoding is ignored, and everything is all UTF-8 all the time (except for specific cases where Windows uses UTF-16, I suppose). See PEP 0540 and corresponding Issue 29240.
Handling Encoding Errors in a UTF-8 File with Python3
Using the original data from your answer, you've got mojibake from a double-encode. You need a double-decode to translate it properly.
>>> s = b'# ::snt That\xc2\x92s what we\xc2\x92re with\xc2\x85You\xc2\x92re not sittin\xc2\x92 there in a back alley and sayin\xc2\x92 hey what do you say, five bucks?\n'
>>> s.decode('utf8').encode('latin1').decode('cp1252')
'# ::snt That’s what we’re with…You’re not sittin’ there in a back alley and sayin’ hey what do you say, five bucks?\n'
The data is actually in UTF-8, but on decode to Unicode the code points of the errors are the bytes for a Windows-1252
code page. The .encode('latin1')
converts the Unicode code points 1:1 back to bytes, since the latin1
encoding is the first 256 code points of Unicode, then it can be decoded correctly as Windows-1252.
Python 3 chokes on CP-1252/ANSI reading
Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding
argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.
How to convert cp1252 to UTF-8 when export csv file using python
The UnicodeEncodeError
you are facing occurs when you write the data to the CSV output file.
As the error message tells us, Python uses a "charmap" codec which doesn't support the characters contained in your data.
This usually happens when you open
a file without specifying the encoding parameter on a Windows machine.
In the attached code document (comment link), snippet no. 10, we can see that this is the case.
You wrote:
with open('wongnai.csv', 'w', newline='') as record:
fieldnames = ...
In this case, Python uses a platform-dependent default encoding, which is usually some 8-bit encoding on Windows machines.
Specify a codec that supports all of Unicode, and writing the file should succeed:
with open('wongnai.csv', 'w', newline='', encoding='utf16') as record:
fieldnames = ...
You can also use "utf8" or "utf32" instead of "utf16", of course.
UTF-8 is very popular for saving files in Unix environments and on the Internet, but if you are planning to open the CSV file with Excel later on, you might face some trouble to get the application to display the data properly.
A more Windows-proof (but technically non-standard) solution is to use "utf-8-sig", which adds some semi-magic character to the beginning of the file for helping Windows programs understand that it's UTF-8.
Related Topics
Python: Requests.Exceptions.Connectionerror. Max Retries Exceeded With Url
How to Modify the Navigation Toolbar Easily in a Matplotlib Figure Window
Is There a Memory Efficient and Fast Way to Load Big Json Files
How to Find the Maximum Consecutive Occurrences of a Number in Python
Pandas Concat: Valueerror: Shape of Passed Values Is Blah, Indices Imply Blah2
Executing Multiple Functions Simultaneously
Create New Column Based on String
How to Insert Text At Line and Column Position in a File
How to Set the Precision on Str(Numpy.Float64)
How to Repeat Each Test Multiple Times in a Py.Test Run
Importerror: No Module Named Sklearn (Python)
Deleting Dataframe Row in Pandas If a Combination of Column Values Equals a Tuple in a List
Open() Gives Filenotfounderror/Ioerror: Errno 2 No Such File or Directory
How to Extract Address from Raw Text Using Nltk in Python
Sort Array and Return Original Indexes of Sorted Array