Any Gotchas Using Unicode_Literals in Python 2.6

Any gotchas using unicode_literals in Python 2.6?

The main source of problems I've had working with unicode strings is when you mix utf-8 encoded strings with unicode ones.

For example, consider the following scripts.

two.py

# encoding: utf-8
name = 'helló wörld from two'

one.py

# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name

The output of running python one.py is:

Traceback (most recent call last):
  File "one.py", line 5, in <module>
    print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

In this example, two.name is an utf-8 encoded string (not unicode) since it did not import unicode_literals, and one.name is an unicode string. When you mix both, python tries to decode the encoded string (assuming it's ascii) and convert it to unicode and fails. It would work if you did print name + two.name.decode('utf-8').

The same thing can happen if you encode a string and try to mix them later.
For example, this works:

# encoding: utf-8
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

Output:

DEBUG: <html><body>helló wörld</body></html>

But after adding the import unicode_literals it does NOT:

# encoding: utf-8
from __future__ import unicode_literals
html = '<html><body>helló wörld</body></html>'
if isinstance(html, unicode):
    html = html.encode('utf-8')
print 'DEBUG: %s' % html

Output:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

It fails because 'DEBUG: %s' is an unicode string and therefore python tries to decode html. A couple of ways to fix the print are either doing print str('DEBUG: %s') % html or print 'DEBUG: %s' % html.decode('utf-8').

I hope this helps you understand the potential gotchas when using unicode strings.

unicode_literals and type()

six.b is written under the assumption that you won't use unicode_literals (and that you'll pass a string literal to it, as the documentation states), so the Python 2 implementation is just def b(s): return s as a Python 2 string literal is already a byte string.

Either don't use unicode_literals in this module, or use (as a comment suggests) str(name). In Python 3, that is a no-op. In Python 2, it silently converts the unicode string to a byte string (assuming some encoding that I can't be bothered to remember, but it's a superset of ASCII so you should be fine).

What's the preferred way to include unicode in python source files?

I think the most common way I've used (in Python 2) is:

# coding: utf-8

text = u'résumé'

The text is readable. Compare to text = u'r\u00e9sum\u00e9', where I must look up what character that is. Everything else is less readable.
If you're using Unicode, your variable is most certainly text and not binary data, so there's no point in keeping it in anything other than a unicode object. (Just in case '€' became an option.)

from __future__ import unicode_literals changes the parsing mode of the program; I think you'd need to be more aware of the difference between text & binary data. (Something that, if you ask me, most programmers are not good at.)

In large projects, it might be confusing to have the parsing mode change for just one file, so it's probably better as an all files or no files, so you don't need to refer to the file header. If you're in Python 2, the default is probably off unless you're also targetting Python 3. If you're targetting Python 2.5 or older¹, then it's not an option.

Most editors these days are Unicode-aware. That said, I have seen editors corrupt non-ASCII characters in files, but exceedingly rarely; if the author of such a commit doesn't review his code adequately, code review should catch this. (The diff will be painfully obvious.) It is not worth supporting these people: Unicode is here to stay; track them down and fix their set up. Of note, vim handles Unicode just fine.

¹You should upgrade.

Tracking down implicit unicode conversions in Python 2

You can register a custom encoding which prints a message whenever it's used:

Code in ourencoding.py:

import sys
import codecs
import traceback

# Define a function to print out a stack frame and a message:

def printWarning(s):
    sys.stderr.write(s)
    sys.stderr.write("\n")
    l = traceback.extract_stack()
    # cut off the frames pointing to printWarning and our_encode
    l = traceback.format_list(l[:-2])
    sys.stderr.write("".join(l))

# Define our encoding:

originalencoding = sys.getdefaultencoding()

def our_encode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.encode(s, originalencoding, errors), len(s))

def our_decode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.decode(s, originalencoding, errors), len(s))

def our_search(name):
    if name == 'our_encoding':
        return codecs.CodecInfo(
            name='our_encoding',
            encode=our_encode,
            decode=our_decode);
    return None

# register our search and set the default encoding:
codecs.register(our_search)
reload(sys)
sys.setdefaultencoding('our_encoding')

If you import this file at the start of our script, then you'll see warnings for implicit conversions:

#!python2
# coding: utf-8

import ourencoding

print("test 1")
a = "hello " + u"world"

print("test 2")
a = "hello ☺ " + u"world"

print("test 3")
b = u" ".join(["hello", u"☺"])

print("test 4")
c = unicode("hello ☺")

output:

test 1
test 2
Default encoding used
 File "test.py", line 10, in <module>
   a = "hello ☺ " + u"world"
test 3
Default encoding used
 File "test.py", line 13, in <module>
   b = u" ".join(["hello", u"☺"])
test 4
Default encoding used
 File "test.py", line 16, in <module>
   c = unicode("hello ☺")

It's not perfect as test 1 shows, if the converted string only contain ASCII characters, sometimes you won't see a warning.

How do I get compatible type() behaviour in python 2 & 3 with unicode_literals?

Don't use builtins.str(), use the plain str that comes with your Python version:

>>> from __future__ import unicode_literals
>>> type(str('MyClass'), (object,), {})
<class '__main__.MyClass'>

This works both in Python 2 and 3. If the future.builtins module replaces the str built-in type by default, use the __builtin__ module:

try:
    # Python 2
    from __builtin__ import str as builtin_str
except ImportError:
    # Python 3
    from builtins import str as builtin_str

MyClass = type(builtin_str('MyClass'), (object,), {})

What future features should I import in Python v2.6.2?

Well, even if there wasn't documentation, __future__ is also a regular module that has some info about itself:

>>> import __future__
>>> __future__.all_feature_names
['nested_scopes', 'generators', 'division', 'absolute_import', 'with_statement', 'print_function', 'unicode_literals']
>>> __future__.unicode_literals
_Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 131072)

Python 2.6 has most of the features already enabled, so choose from division, print_function, absolute_import and unicode_literals.

And no, import __future__ won't work as you think. It's only magic when you use the from __future__ import something form as the first statement in the file. See the docs for more.

Of course, no matter how much you import from __future__, you will get different behavior in 3.x.

Use unicode strings instead of regular strings? (Python 2.7)

But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?

There are two type of strings in Python (on both Python 2 and 3): a bytestring (a sequence of bytes) and a Unicode string (a sequence of Unicode codepoints).

bytestring = b'abc'
unicode_text = u'abc'

The type of string created using 'abc' string literal depends on Python version and the presence of from __future__ import unicode_literals import. Without the import on Python 2, 'abc' literal creates a bytestring otherwise it creates a Unicode string.

Add the encoding declaration at the top of your Python source file if you use non-ascii characters in string literals e.g.: # -*- coding: utf-8 -*-.

So when I get a text input, I don't need to use unicode()?

If by "text input" you mean that your program receives bytes somehow (from a file, network, from the command-line) then no: you shouldn't rely on Python to convert bytes to Unicode implicitly -- you should do it explicitly as soon as you receive the bytes using unicode_text = bytestring.decode(character_encoding).

And in reverse, keep the text as Unicode inside your program. Convert Unicode strings to bytes as late as possible when it is necessary (e.g., to send the text via the network).

Use bytestrings to work with a binary data: an image, a compressed content, etc. Use Unicode strings to work with text in Python.

To read Unicode from a file, use io.open() (you have to know the correct character encoding if it is not locale.getpreferredencoding(False)).

What character encoding to use when you receive your Unicode text via network may depend on the corresponding protocol e.g., the charset can be specified in Content-Type http header:

    text = data.decode(response.headers.getparam('charset'))

You could use universal_newlines=True or io.TextIOWrapper() to get Unicode text from an external process started using subprocess module. It can be non-trivial to figure out what character encoding should be used on Windows (if you read Russian, see the gory details here: Byte при печати вывода внешней команды).

Any Gotchas Using Unicode_Literals in Python 2.6