How to convert a file to utf-8 in Python?
You can use the codecs module, like this:
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
EDIT: added BLOCKSIZE
parameter to control file chunk size.
Python 3 unicode to utf-8 on file
What notepad considers Unicode
is utf16
to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE
, which indicates little-endian UTF-16. This is why you get the following when using utf8
to decode the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
To convert to UTF-8, you could use:
with open('log.txt',encoding='utf16') as f:
data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
f.write(data)
Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI
instead. ANSI
is really the local language locale. On US Windows it is cp1252
, but it varies for other localized builds. If you open utf8.txt
and it still looks garbled, use encoding='utf-8-sig'
when writing instead.
Python reading from a file and saving to utf-8
Process text to and from Unicode at the I/O boundaries of your program using open
with the encoding
parameter. Make sure to use the (hopefully documented) encoding of the file being read. The default encoding varies by OS (specifically, locale.getpreferredencoding(False)
is the encoding used), so I recommend always explicitly using the encoding
parameter for portability and clarity (Python 3 syntax below):
with open(filename, 'r', encoding='utf8') as f:
text = f.read()
# process Unicode text
with open(filename, 'w', encoding='utf8') as f:
f.write(text)
If still using Python 2 or for Python 2/3 compatibility, the io
module implements open
with the same semantics as Python 3's open
and exists in both versions:
import io
with io.open(filename, 'r', encoding='utf8') as f:
text = f.read()
# process Unicode text
with io.open(filename, 'w', encoding='utf8') as f:
f.write(text)
Fastest way to convert file from latin1 to utf-8 in python
You could use blocks larger than one line, and do binary I/O -- each might speed thinks up a bit (though on Linux binary I/O won't, as it's identical to text I/O):
BLOCKSIZE = 1024*1024
with open(tmpfile, 'rb') as inf:
with open(tmpfile, 'wb') as ouf:
while True:
data = inf.read(BLOCKSIZE)
if not data: break
converted = data.decode('latin1').encode('utf-8')
ouf.write(converted)
The byte-by-byte parsing implied in by-line reading, line-end conversion (not on Linux;-), and codecs.open-style encoding-decoding, should be part of what's slowing you down. This approach is also portable (like yours is), since control-characters such as \n
need no translation among these codecs anyway (in any OS).
This only works for input codecs that have no multibyte characters, but `latin1' is one of those (it does not matter whether the output codec has such characters or not).
Try different block sizes to find the sweet spot performance-wise, depending on your disk, filesystem and available RAM.
Edit: changed code per @John's comment, and clarified a conditon as per @gnibbler's.
Convert string of unknown encoding to UTF-8
"Träume groß"
is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.
A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:
print('"Träume groß"'.encode('cp1252').decode('utf8'))
gives as expected:
"Träume groß"
But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.
How to convert ISO-8859-1 to UTF-8 using Python 3.7.4
decode
is a member of the bytes
type:
>>> help(bytes.decode)
Help on method_descriptor:
decode(self, /, encoding='utf-8', errors='strict')
Decode the bytes using the codec registered for encoding.
encoding
The encoding with which to decode the bytes.
errors
The error handling scheme to use for the handling of decoding errors.
The default is 'strict' meaning that decoding errors raise a
UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
as well as any other name registered with codecs.register_error that
can handle UnicodeDecodeErrors.
So inputText needs to be of type bytes
, not str
:
>>> inputText = b"\xC4pple"
>>> inputText.decode('iso-8859-1')
'Äpple'
>>> inputText.decode('iso-8859-1').encode('utf8')
b'\xc3\x84pple'
Note that the result of decode
is type str
and of encode
is type bytes
.
Convert utf-16 to utf-8 using python
an option is to convert the file line by line:
with open(r'D:\_apps\aaa\output\srcfile', 'rb') as source_file, \
open(r'D:\_apps\aaa\output\destfile', 'w+b') as dest_file:
for line in source_file:
dest_file.write(line.decode('utf-16').encode('utf-8'))
or you could open the files with your desired encoding:
with open(r'D:\_apps\aaa\output\srcfile', 'r', encoding='utf-16') as source_file, \
open(r'D:\_apps\aaa\output\destfile', 'w+', encoding='utf-8') as dest_file:
for line in source_file:
dest_file.write(line)
Unicode (UTF-8) reading and writing to files in Python
In the notation u'Capit\xe1n\n'
(should be just 'Capit\xe1n\n'
in 3.x, and must be in 3.0 and 3.1), the \xe1
represents just one character. \x
is an escape sequence, indicating that e1
is in hexadecimal.
Writing Capit\xc3\xa1n
into the file in a text editor means that it actually contains \xc3\xa1
. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á
in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape
codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str
that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1
in the original string. To get a unicode
result, decode again with UTF-8.
In 3.x, the string_escape
codec is replaced with unicode_escape
, and it is strictly enforced that we can only encode
from a str
to bytes
, and decode
from bytes
to str
. unicode_escape
needs to start with a bytes
in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3
and \xa1
as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Related Topics
How to Open a File for Exclusive Access in Python
Subprocess.Popen() Error (No Such File or Directory) When Calling Command with Arguments as a String
How to Use Pil to Make All White Pixels Transparent
Concatenate Two Numpy Arrays Vertically
Use .Corr to Get the Correlation Between Two Columns
Python - Rolling Functions for Groupby Object
Pythonic Way to Check If a File Exists
Correct Way to Define Class Variables in Python
How to Execute a Python Script in Notepad++
What Exactly Is Contained Within a Obj._Closure_
What's a Faster Operation, Re.Match/Search or Str.Find
Simple Python Challenge: Fastest Bitwise Xor on Data Buffers