Working with Utf-8 Encoding in Python Source

Enforcing Python source encoding as UTF-8

I'm not sure I know exactly what compatibility issues you mean, but you seem to be conflating two separate issues. One thing is: When you actually type characters into your source file, they are encoded using a certain encoding, which is determined by your text editor and/or operating system settings. Another thing is: when Python reads your source file, it interprets what it finds according to a certain encoding, and that is what your *-* coding declaration tells it.

Just because you write # -*- coding: utf-8 -*- at the top of your file doesn't mean your file actually is in UTF-8. That encoding declaration does not "enforce" anything; it just tells Python to assume that the file is in UTF-8.

As a parallel, imagine receiving a document that says at the top "This document is written in Croatian". Upon reading this, you might go get a Croatian dictionary to help you understand the document. However, just because it says that at the top doesn't mean the document actually is in Croatian; anyone can take a document written in Albanian or some other language and write "This document is written in Croatian" at the top --- and in fact, they might do so, if they were unfamiliar with both languages and didn't know how to tell the difference.

Similarly, if you use a text editor that isn't Unicode-aware, it may blithely insert non-UTF8 characters into the file, even though you wrote "coding: utf-8" at the top. This will cause problems if you later try to run the file, because Python will think it is in UTF-8 even though it really isn't.

UTF-8 is still the best encoding to use. The only thing is you should make sure that your editor is set up so it really is encoding your files in UTF-8.

It's also possible that if someone else gets your code and makes modifications, they could be using an editor that's not using UTF-8, which would likewise cause problems if their editor put non-UTF-8 stuff into the file. This means that if you're sharing code with other people (e.g., you're part of a team developing software), you should all agree on an encoding and use it consistently. It is conceivable that you could be part of an organization that has a policy of using some encoding other than UTF-8 (say, Latin-1), in which case you'd have to set your editor to use that encoding. However, more and more, organizations big enough to share code extensively among different people are realizing that everyone should always be using UTF-8 all the time.

(Someone who downloads your code off the internet and tries to modify it can also run into the same encoding problems, but if your file is in UTF-8 and has the UTF-8 encoding declaration, then it's self documenting. If someone else messes it up with another encoding, that's their own fault for not paying attention. You only need to worry about such problems insofar as you actually care about collaborating with others; you can't and shouldn't worry about the myriad mistakes that random people on the internet might make if they come across your code.)

When to use utf8 as a header in py files

wherever you need to use in your code chars that aren't from ascii, like:

ă 

interpreter will complain that he doesn't understand that char.

Usually this happens when you define constants.

Example:
Add into x.py

print 'ă'

then start a python console

import x
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "x.py", line 1
SyntaxError: Non-ASCII character '\xc4' in file x.py on line 1,
but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

Why doesn't Python recognize my utf-8 encoded source file?

The encoding your terminal is using doesn't support that character:

>>> '\xdf'.encode('cp866')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/cp866.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xdf' in position 0: character maps to <undefined>

Python is handling it just fine, it's your output encoding that cannot handle it.

You can try using chcp 65001 in the Windows console to switch your codepage; chcp is a windows command line command to change code pages.

Mine, on OS X (using UTF-8) can handle it just fine:

>>> print('\xdf')
ß

Should I use encoding declaration in Python 3?

Because the default is UTF-8, you only need to use that declaration when you deviate from the default, or if you rely on other tools (like your IDE or text editor) to make use of that information.

In other words, as far as Python is concerned, only when you want to use an encoding that differs do you have to use that declaration.

Other tools, such as your editor, can support similar syntax, which is why the PEP 263 specification allows for considerable flexibility in the syntax (it must be a comment, the text coding must be there, followed by either a : or = character and optional whitespace, followed by a recognised codec).

Note that it only applies to how Python reads the source code. It doesn't apply to executing that code, so not to how printing, opening files, or any other I/O operations translate between bytes and Unicode. For more details on Python, Unicode, and encodings, I strongly urge you to read the Python Unicode HOWTO, or the very thorough Pragmatic Unicode talk by Ned Batchelder.



Related Topics



Leave a reply



Submit