Encoding Cp-1252 as Utf-8

How to convert cp1252 to UTF-8 when export csv file using python

The UnicodeEncodeError you are facing occurs when you write the data to the CSV output file.
As the error message tells us, Python uses a "charmap" codec which doesn't support the characters contained in your data.
This usually happens when you open a file without specifying the encoding parameter on a Windows machine.

In the attached code document (comment link), snippet no. 10, we can see that this is the case.
You wrote:

with open('wongnai.csv', 'w', newline='') as record:
fieldnames = ...

In this case, Python uses a platform-dependent default encoding, which is usually some 8-bit encoding on Windows machines.
Specify a codec that supports all of Unicode, and writing the file should succeed:

with open('wongnai.csv', 'w', newline='', encoding='utf16') as record:
fieldnames = ...

You can also use "utf8" or "utf32" instead of "utf16", of course.
UTF-8 is very popular for saving files in Unix environments and on the Internet, but if you are planning to open the CSV file with Excel later on, you might face some trouble to get the application to display the data properly.
A more Windows-proof (but technically non-standard) solution is to use "utf-8-sig", which adds some semi-magic character to the beginning of the file for helping Windows programs understand that it's UTF-8.

Windows-1252 to UTF-8 encoding

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.

Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.

One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.

I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.

Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.

Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.



Related Topics



Leave a reply



Submit