Python Str VS Unicode Types

Python str vs unicode types

unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).

Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.

On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.

Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.

Some differences that you can see:

>>> len(u'à')  # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù

What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text.
You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.

Python2: what do str.encode() and unicode.decode() do?

TL;DR Using unicode.decode and str.encode means you aren't using the right types to represent your data. The methods on the equivalent types in Python 3 don't even exist.


A unicode value is a single Unicode code point: an integer interpreted as a particular character. A str, on the other hand, is a sequence of bytes.

For example, à is Unicode code point U+00E0. The UTF-8 encoding represents it with a pair of bytes, 0xC3 and 0xA0.

The unicode.encode method takes a Unicode string (a sequence of code points) and returns the byte-level encoding of each code point as a single byte string.

>>> ua.encode('utf-8')
'\xc3\xa0'

str.decode takes a byte string and attempts to produce the equivalent Unicode value.

>>> '\xc3\xa0'.decode('utf-8')
u'\xe0'

(u'\xe0' is equivalent to u'à').


As for your errors: Python 2 doesn't enforce a strict separation between how unicode and str are used. It doesn't really make sense to encode a str if it is already an encoded value, and it doesn't make sense to decode a unicode value because it's not encoded in the first place. Rather than pick apart exactly how the errors occur, I'll just point out that in Python 3, there are two types: bytes is a string of bytes (corresponding to Python 2 str), and str is a Unicode string (corresponding to Python 2 unicode). The "nonsensical" methods don't even exist in Python 3:

>>> bytes.encode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'bytes' has no attribute 'encode'
>>> str.decode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'str' has no attribute 'decode'

So your attempts that raised Unicode*Error exceptions before now would simply raise an AttributeError.

If you are stuck supporting Python 2, just follow these rules:

  • unicode is for text
  • str is for binary data
  • unicode.encode produces a str value
  • str.decode produces a unicode value
  • If you find yourself trying to call str.encode, you are using the wrong type.
  • If you find yourself trying to call unicode.decode, you are using the wrong type.

String literal Vs Unicode literal Vs unicode type object - Memory representation

Which encoding technique used to represent in memory? utf-8?

You can try the following:

ThisisNotUnicodeString.decode('utf-8')

If you get a result, it's UTF-8, otherwise it's not.

If you want to get the UTF-16 representation of the string, you should first decode it, and then encode with UTF-16 scheme:

ThisisNotUnicodeString.decode('utf-8').encode('utf-16')

So basically, you can decode and encode the given string from/to UTF-8/UTF-16, because all characters can be represented in both schemes.

ThisisNotUnicodeString.decode('utf-8').encode('utf-16').decode('utf-16').encode('utf-8')

How do I check if a string is unicode or ascii?

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"

This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

Python returns string is both str and unicode type

In IronPython, the str type and the unicode type are the same object. The .NET string type is unicode.

What is a unicode string?

Update: Python 3

In Python 3, Unicode strings are the default. The type str is a collection of Unicode code points, and the type bytes is used for representing collections of 8-bit integers (often interpreted as ASCII characters).

Here is the code from the question, updated for Python 3:

>>> my_str = 'A unicode \u018e string \xf1' # no need for "u" prefix
# the escape sequence "\u" denotes a Unicode code point (in hex)
>>> my_str
'A unicode Ǝ string ñ'
# the Unicode code points U+018E and U+00F1 were displayed
# as their corresponding glyphs
>>> my_bytes = my_str.encode('utf-8') # convert to a bytes object
>>> my_bytes
b'A unicode \xc6\x8e string \xc3\xb1'
# the "b" prefix means a bytes literal
# the escape sequence "\x" denotes a byte using its hex value
# the code points U+018E and U+00F1 were encoded as 2-byte sequences
>>> my_str2 = my_bytes.decode('utf-8') # convert back to str
>>> my_str2 == my_str
True

Working with files:

>>> f = open('foo.txt', 'r') # text mode (Unicode)
>>> # the platform's default encoding (e.g. UTF-8) is used to decode the file
>>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...")
>>> for line in f:
>>> # here line is a str object

>>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes)
>>> for line in f:
>>> # here line is a bytes object

Historical answer: Python 2

In Python 2, the str type was a collection of 8-bit characters (like Python 3's bytes type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.

Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.

UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

When the Python interpreter displays the value of ustring, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.

The line s = unistring.encode('utf-8') encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of s is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (\xc6, \x8e, \xc3, and \xb1). The two pairs of bytes are not treated as single characters like before because s is of type str, not unicode.

The line t = unicode(s, 'utf-8') does the opposite of encode(). It reconstructs the original code points by looking at the bytes of s and parsing byte sequences. The result is a Unicode string.

The call to codecs.open() specifies utf-8 as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.

byte string vs. unicode string. Python

No, Python does not use its own encoding - it will use any encoding that it has access to and that you specify.

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters.

bytes objects give you access to the underlying bytes. str objects have the encode method that takes a string representing an encoding and returns the bytes object that represents the string in that encoding. bytes objects have the decode method that takes a string representing an encoding and returns the str that results from interpreting the byte as a string encoded in the the given encoding.

For example:

>>> a = "αά".encode('utf-8')
>>> a
b'\xce\xb1\xce\xac'
>>> a.decode('utf-8')
'αά'

We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac, to represent two characters.

Related reading:

  • Python Unicode Howto (from the official documentation)

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • Pragmatic Unicode by Ned Batchelder

Unicode vs UTF-8 confusion in Python / Django?

what is a "Unicode string" in Python? Does that mean UCS-2?

Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.

You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.

plain wrong, or is it?

Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).

There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.

How can I compare a unicode type to a string in python?

You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys() first:

data = json.loads(response)
myList = [item for item in data if item == "number1"]

You may want to use u"number1" to avoid implicit conversions between Unicode and byte strings:

data = json.loads(response)
myList = [item for item in data if item == u"number1"]

Both versions work fine:

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']

Note that in your first example, us is not a UTF-8 string; it is unicode data, the json library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

On Python 2, your expectation that your test returns True would be correct, you are doing something else wrong:

>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>

There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:

myComp = [elem for elem in json_data if elem == u"MyString"]


Related Topics



Leave a reply



Submit