Correct Way to Define Python Source Code Encoding

Correct way to define Python source code encoding

Check the docs here:

"If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration"

"The recommended forms of this expression are

# -*- coding: <encoding-name> -*-

which is recognized also by GNU Emacs, and

# vim:fileencoding=<encoding-name>

which is recognized by Bram Moolenaar’s VIM."

So, you can put pretty much anything before the "coding" part, but stick to "coding" (with no prefix) if you want to be 100% python-docs-recommendation-compatible.

More specifically, you need to use whatever is recognized by Python and the specific editing software you use (if it needs/accepts anything at all). E.g. the coding form is recognized (out of the box) by GNU Emacs but not Vim (yes, without a universal agreement, it's essentially a turf war).

What is the default encoding method for code assumed by Python interpreter?

Without any explicit encoding declaration, the assumed encoding for your source code will be

  • ascii for Python 2.x
  • utf-8 for Python 3.x

See PEP 0263 and Using source code encoding for Python 2.x, and PEP 3120 for the new default of utf-8 for Python 3.x.

So the default encoding assumened for source code will be directly dependent of the version of the Python interpreter, and it is not configurable.


Note that the source code encoding is something entirely different than dealing with non-ASCII characters as part of your data in strings.

There are two distinct cases where you may encounter non-ASCII characters:

  • As part of your programs data, during runtime
  • As part of your source code (and since you can't have non-ASCII characters in identifiers, that usually means hard coded string data in your source code or comments).

The source code encoding declaration affects what encoding your source code will be interpreted with - so it's only needed if you decide to directly put non-ASCII characters in your source code.

So, the following code will eventually have to deal with the fact that there might be non-ASCII characters in data.txt:

with open('data.txt') as f:
for line in f:
# do something with `line`

But it doesn't contain any non-ASCII characters in the source code, therefore it doesn't need an encoding declaration at the top of the file. It will however need to properly decode line if it wants to turn it into unicode. Simply doing unicode(line) will use the system default encoding, which is ascii (different from the default source encoding, but happens to also be ascii). So to explicitely decode the string using utf-8 you'd need to do line.decode('utf-8').


This code however does contain non-ASCII characters directly in its source code:

TEST_DATA = 'Bär'    # <--- non-ASCII character on this line
print TEST_DATA

And it will fail with a SyntaxError similar to this, unless you declare an explicit source code encoding:

SyntaxError: Non-ASCII character '\xc3' in file foo.py on line 1, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

So assuming your text editor is configured to save files in utf-8, you'd need to put the line

# -*- coding: utf-8 -*-

at the top of the file for Python to interpret the source code correctly.

My advice however would be to generally avoid putting non-ASCII characters in your source code, exactly because if it depends on your and your co-workers editor and terminal settings wheter it will be written and read correctly.

Instead you can use escaped strings to safely enter non-ASCII characters in your code:

TEST_DATA = 'B\xc3\xa4r'

Correct way to define Python source code encoding

Check the docs here:

"If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration"

"The recommended forms of this expression are

# -*- coding: <encoding-name> -*-

which is recognized also by GNU Emacs, and

# vim:fileencoding=<encoding-name>

which is recognized by Bram Moolenaar’s VIM."

So, you can put pretty much anything before the "coding" part, but stick to "coding" (with no prefix) if you want to be 100% python-docs-recommendation-compatible.

More specifically, you need to use whatever is recognized by Python and the specific editing software you use (if it needs/accepts anything at all). E.g. the coding form is recognized (out of the box) by GNU Emacs but not Vim (yes, without a universal agreement, it's essentially a turf war).

Defining unicode variables in Python

No, Python 2 only supports ASCII names. From the language reference:

identifier ::=  (letter|”_”) (letter | digit | “_”)*
letter ::= lowercase | uppercase
lowercase ::= “a”…”z”
uppercase ::= “A”…”Z”
digit ::= “0”…”9”

Compared that the much longer Python 3 version, which does have full Unicode names.

The practical problem the PEPs solve is that before, if a byte over 127 appeared in a source file (say inside a unicode string), then Python had no way of knowing which character was meant by that as it could have been any encoding. Now it's interpreted as UTF-8 by default, and can be changed by adding such a header.



Related Topics



Leave a reply



Submit