Write to Utf-8 File in Python

Write to UTF-8 file in Python

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on "I'm meant to be writing Unicode as UTF-8-encoded text, but you've given me a byte string!"

Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:

import codecs

file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()

(That seems to give the right answer - a file with bytes EF BB BF.)

EDIT: S. Lott's suggestion of using "utf-8-sig" as the encoding is a better one than explicitly writing the BOM yourself, but I'll leave this answer here as it explains what was going wrong before.

Unicode (UTF-8) reading and writing to files in Python

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.

Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.

In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.

In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'

Python reading from a file and saving to utf-8

Process text to and from Unicode at the I/O boundaries of your program using open with the encoding parameter. Make sure to use the (hopefully documented) encoding of the file being read. The default encoding varies by OS (specifically, locale.getpreferredencoding(False) is the encoding used), so I recommend always explicitly using the encoding parameter for portability and clarity (Python 3 syntax below):

with open(filename, 'r', encoding='utf8') as f:
text = f.read()

# process Unicode text

with open(filename, 'w', encoding='utf8') as f:
f.write(text)

If still using Python 2 or for Python 2/3 compatibility, the io module implements open with the same semantics as Python 3's open and exists in both versions:

import io
with io.open(filename, 'r', encoding='utf8') as f:
text = f.read()

# process Unicode text

with io.open(filename, 'w', encoding='utf8') as f:
f.write(text)

Writing to a .txt file (UTF-8), python

The short way:

file('file2.txt','w').write( file('file.txt').read().encode('utf-8') )

The long way:

data = file('file.txt').read()
... process data ...
data = data.encode('utf-8')
file('file2.txt','w').write( data )

And using 'codecs' explicitly:

codecs.getwriter('utf-8')(file('/tmp/bla3','w')).write(data)

How can I create a file with utf-8 in Python?

An empty file is always binary.

$ touch /tmp/foo
$ file -i /tmp/foo
/tmp/foo: inode/x-empty; charset=binary

Put something in it and everything is fine.

$ cat > /tmp/foo 
Rübe
Möhre
Mähne
$ file -i /tmp/foo
/tmp/foo: text/plain; charset=utf-8

Python will do the same as cat.

with open("/tmp/foo", "w") as f:
f.write("Rübe\n")

Check it:

$ cat /tmp/foo
Rübe
$ file -i /tmp/foo
/tmp/foo: text/plain; charset=utf-8

Edit:

Using Python 2.7, you must encode an Unicode string.

with open("/tmp/foo", "w") as f:
f.write(u"Rübe\n".encode("UTF-8"))

Writing to txt file in UTF-8 - Python

The issue here is that you're calling open() on a file without specifying the encoding. As noted in the Python documentation, the default encoding is platform dependent. That's probably why you're seeing different results in Windows and MacOS.

Assuming that the file itself was actually encoded in UTF-8, just specify that when reading the file:

original = open(str(last_uploaded.document), 'r', encoding="utf-8")

How to convert a file to utf-8 in Python?

You can use the codecs module, like this:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)

EDIT: added BLOCKSIZE parameter to control file chunk size.

why can't I save my file as utf-8 format

Your exception is thrown in 'saveText', but I can't see how you implemented it so I'll try to reproduce the error and the give a suggestion to a fix.

In 'getUrl' you return a decoded string ( .decode('utf-8') ) and my guess is, that in 'saveText', you forget to encode it before writing to the file.

Reproducing the error

Trying to reproduce the error, I did this:

# String with unicode chars, decoded like in you example
s = 'æøå'.decode('utf-8')

# How saveText could be:
# Encode before write
f = open('test', mode='w')
f.write(s)
f.close()

this gives a similar exception:

---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-36-1309da3ad975> in <module>()
5 # Encode before write
6 f = open('test', mode='w')
----> 7 f.write(s)
8 f.close()

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Two ways of fixing

You can do either:

# String with unicode chars, decoded like in you example
s = 'æøå'.decode('utf-8')

# How saveText could be:
# Encode before write
f = open('test', mode='w')
f.write(s.encode('utf-8'))
f.close()

or you can try writing the file using the module 'codecs':

import codecs

# String with unicode chars, decoded like in you example
s = 'æøå'.decode('utf-8')

# How saveText could be:
f = codecs.open('test', encoding='utf-8', mode='w')
f.write(s)
f.close()

Hope this helps.

Python codec error during file write with UTF-8 string

I understand you want two things:

  • a way to write arbitrary Unicode characters to a file, and
  • Python 2/3 compatibility.

Using open('out1.txt','w') violates both:

  • The output text stream is opened with a default encoding, which happens to be CP-1252 on your platform (apparently Windows). This codec supports only a subset of Unicode, eg. lacking all emojis.
  • The open function differs considerably between Python versions. In Python 3, it is the io.open function, which offers a lot of flexibility, such as specifying a text encoding. In Python 2, the returned file handle processes 8-bit strings rather than Unicode strings (text).
  • There's also a portability issue of which you might not be aware: the default encoding for IO is platform dependent, ie. people running your code might see a different default depending on OS and localisation.

You can avoid all this with io.open('out1.txt', 'w', encoding='utf8'):

  • Use an encoding that supports all characters needed. Using the detected input encoding should work, unless processing introduces characters outside the supported range. Using one of the UTF codecs will always work, with UTF-8 being the most widely used for text files. Note that some Windows apps (like Notepad) tend not to understand UTF-8. There is the utf-8-sig codec that supports writing UTF-8 w/ BOM that makes Windows apps recognize files encoded in UTF-8. That codec also removes the UTF-8 BOM signature from the input stream if present when used for reading.
  • The io module was backported to Python 2.7. This generally qualifies as Py2/3 compatible, since support for versions <= 2.6 has ended quite some time ago.
  • Be explicit about the encoding used whenever opening text files. There might be scenarios where the platform-dependent default encoding makes sense, but usually you want control.

Side note:
You mention a simple heuristic for detecting the input codec.
If there's really no way to obtain this information, you should consider using chardet.



Related Topics



Leave a reply



Submit