How to Convert a File to Utf-8 in Python

How to convert a file to utf-8 in Python?

You can use the codecs module, like this:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)

EDIT: added BLOCKSIZE parameter to control file chunk size.

Python 3 unicode to utf-8 on file

What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

To convert to UTF-8, you could use:

with open('log.txt',encoding='utf16') as f:
data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
f.write(data)

Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.

Python reading from a file and saving to utf-8

Process text to and from Unicode at the I/O boundaries of your program using open with the encoding parameter. Make sure to use the (hopefully documented) encoding of the file being read. The default encoding varies by OS (specifically, locale.getpreferredencoding(False) is the encoding used), so I recommend always explicitly using the encoding parameter for portability and clarity (Python 3 syntax below):

with open(filename, 'r', encoding='utf8') as f:
text = f.read()

# process Unicode text

with open(filename, 'w', encoding='utf8') as f:
f.write(text)

If still using Python 2 or for Python 2/3 compatibility, the io module implements open with the same semantics as Python 3's open and exists in both versions:

import io
with io.open(filename, 'r', encoding='utf8') as f:
text = f.read()

# process Unicode text

with io.open(filename, 'w', encoding='utf8') as f:
f.write(text)

Fastest way to convert file from latin1 to utf-8 in python

You could use blocks larger than one line, and do binary I/O -- each might speed thinks up a bit (though on Linux binary I/O won't, as it's identical to text I/O):

 BLOCKSIZE = 1024*1024
with open(tmpfile, 'rb') as inf:
with open(tmpfile, 'wb') as ouf:
while True:
data = inf.read(BLOCKSIZE)
if not data: break
converted = data.decode('latin1').encode('utf-8')
ouf.write(converted)

The byte-by-byte parsing implied in by-line reading, line-end conversion (not on Linux;-), and codecs.open-style encoding-decoding, should be part of what's slowing you down. This approach is also portable (like yours is), since control-characters such as \n need no translation among these codecs anyway (in any OS).

This only works for input codecs that have no multibyte characters, but `latin1' is one of those (it does not matter whether the output codec has such characters or not).

Try different block sizes to find the sweet spot performance-wise, depending on your disk, filesystem and available RAM.

Edit: changed code per @John's comment, and clarified a conditon as per @gnibbler's.

Convert string of unknown encoding to UTF-8

"Träume groß" is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.

A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:

print('"Träume groß"'.encode('cp1252').decode('utf8'))

gives as expected:

"Träume groß"

But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.

How to convert ISO-8859-1 to UTF-8 using Python 3.7.4

decode is a member of the bytes type:

>>> help(bytes.decode)
Help on method_descriptor:

decode(self, /, encoding='utf-8', errors='strict')
Decode the bytes using the codec registered for encoding.

encoding
The encoding with which to decode the bytes.
errors
The error handling scheme to use for the handling of decoding errors.
The default is 'strict' meaning that decoding errors raise a
UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
as well as any other name registered with codecs.register_error that
can handle UnicodeDecodeErrors.

So inputText needs to be of type bytes, not str:

>>> inputText = b"\xC4pple"
>>> inputText.decode('iso-8859-1')
'Äpple'
>>> inputText.decode('iso-8859-1').encode('utf8')
b'\xc3\x84pple'

Note that the result of decode is type str and of encode is type bytes.

Convert utf-16 to utf-8 using python

an option is to convert the file line by line:

with open(r'D:\_apps\aaa\output\srcfile', 'rb') as source_file, \
open(r'D:\_apps\aaa\output\destfile', 'w+b') as dest_file:
for line in source_file:
dest_file.write(line.decode('utf-16').encode('utf-8'))

or you could open the files with your desired encoding:

with open(r'D:\_apps\aaa\output\srcfile', 'r', encoding='utf-16') as source_file, \
open(r'D:\_apps\aaa\output\destfile', 'w+', encoding='utf-8') as dest_file:
for line in source_file:
dest_file.write(line)

Unicode (UTF-8) reading and writing to files in Python

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.

Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.

In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.

In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'


Related Topics



Leave a reply



Submit