Python str vs unicode types
unicode
is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8
, latin-1
...).
Note that unicode
is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.
On the contrary str
in Python 2 is a plain sequence of bytes. It does not represent text!
You can think of unicode
as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str
.
Note: In Python 3, unicode
was renamed to str
and there is a new bytes
type for a plain sequence of bytes.
Some differences that you can see:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�
Note that using str
you have a lower-level control on the single bytes of a specific encoding representation, while using unicode
you can only control at the code-point level. For example you can do:
>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù
What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text.
You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
Python2: what do str.encode() and unicode.decode() do?
TL;DR Using unicode.decode
and str.encode
means you aren't using the right types to represent your data. The methods on the equivalent types in Python 3 don't even exist.
A unicode
value is a single Unicode code point: an integer interpreted as a particular character. A str
, on the other hand, is a sequence of bytes.
For example, à
is Unicode code point U+00E0. The UTF-8 encoding represents it with a pair of bytes, 0xC3 and 0xA0.
The unicode.encode
method takes a Unicode string (a sequence of code points) and returns the byte-level encoding of each code point as a single byte string.
>>> ua.encode('utf-8')
'\xc3\xa0'
str.decode
takes a byte string and attempts to produce the equivalent Unicode value.
>>> '\xc3\xa0'.decode('utf-8')
u'\xe0'
(u'\xe0'
is equivalent to u'à'
).
As for your errors: Python 2 doesn't enforce a strict separation between how unicode
and str
are used. It doesn't really make sense to encode a str
if it is already an encoded value, and it doesn't make sense to decode a unicode
value because it's not encoded in the first place. Rather than pick apart exactly how the errors occur, I'll just point out that in Python 3, there are two types: bytes
is a string of bytes (corresponding to Python 2 str
), and str
is a Unicode string (corresponding to Python 2 unicode
). The "nonsensical" methods don't even exist in Python 3:
>>> bytes.encode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'bytes' has no attribute 'encode'
>>> str.decode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'str' has no attribute 'decode'
So your attempts that raised Unicode*Error
exceptions before now would simply raise an AttributeError
.
If you are stuck supporting Python 2, just follow these rules:
unicode
is for textstr
is for binary dataunicode.encode
produces astr
valuestr.decode
produces aunicode
value- If you find yourself trying to call
str.encode
, you are using the wrong type. - If you find yourself trying to call
unicode.decode
, you are using the wrong type.
String literal Vs Unicode literal Vs unicode type object - Memory representation
Which encoding technique used to represent in memory? utf-8?
You can try the following:
ThisisNotUnicodeString.decode('utf-8')
If you get a result, it's UTF-8, otherwise it's not.
If you want to get the UTF-16 representation of the string, you should first decode it, and then encode with UTF-16 scheme:
ThisisNotUnicodeString.decode('utf-8').encode('utf-16')
So basically, you can decode and encode the given string from/to UTF-8/UTF-16, because all characters can be represented in both schemes.
ThisisNotUnicodeString.decode('utf-8').encode('utf-16').decode('utf-16').encode('utf-8')
How do I check if a string is unicode or ascii?
In Python 3, all strings are sequences of Unicode characters. There is a bytes
type that holds raw bytes.
In Python 2, a string may be of type str
or of type unicode
. You can tell which using code something like this:
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.
Python returns string is both str and unicode type
In IronPython, the str
type and the unicode
type are the same object. The .NET string type is unicode.
What is a unicode string?
Update: Python 3
In Python 3, Unicode strings are the default. The type str
is a collection of Unicode code points, and the type bytes
is used for representing collections of 8-bit integers (often interpreted as ASCII characters).
Here is the code from the question, updated for Python 3:
>>> my_str = 'A unicode \u018e string \xf1' # no need for "u" prefix
# the escape sequence "\u" denotes a Unicode code point (in hex)
>>> my_str
'A unicode Ǝ string ñ'
# the Unicode code points U+018E and U+00F1 were displayed
# as their corresponding glyphs
>>> my_bytes = my_str.encode('utf-8') # convert to a bytes object
>>> my_bytes
b'A unicode \xc6\x8e string \xc3\xb1'
# the "b" prefix means a bytes literal
# the escape sequence "\x" denotes a byte using its hex value
# the code points U+018E and U+00F1 were encoded as 2-byte sequences
>>> my_str2 = my_bytes.decode('utf-8') # convert back to str
>>> my_str2 == my_str
True
Working with files:
>>> f = open('foo.txt', 'r') # text mode (Unicode)
>>> # the platform's default encoding (e.g. UTF-8) is used to decode the file
>>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...")
>>> for line in f:
>>> # here line is a str object
>>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes)
>>> for line in f:
>>> # here line is a bytes object
Historical answer: Python 2
In Python 2, the str
type was a collection of 8-bit characters (like Python 3's bytes
type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.
Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.
UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.
To allow working with Unicode characters, Python 2 has a unicode
type which is a collection of Unicode code points (like Python 3's str
type). The line ustring = u'A unicode \u018e string \xf1'
creates a Unicode string with 20 characters.
When the Python interpreter displays the value of ustring
, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.
The line s = unistring.encode('utf-8')
encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str
. The size of s
is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.
When the Python interpreter displays the value of s
, it escapes four bytes that are not in the printable range (\xc6
, \x8e
, \xc3
, and \xb1
). The two pairs of bytes are not treated as single characters like before because s
is of type str
, not unicode
.
The line t = unicode(s, 'utf-8')
does the opposite of encode()
. It reconstructs the original code points by looking at the bytes of s
and parsing byte sequences. The result is a Unicode string.
The call to codecs.open()
specifies utf-8
as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.
byte string vs. unicode string. Python
No, Python does not use its own encoding - it will use any encoding that it has access to and that you specify.
A character in a str
represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters.
bytes
objects give you access to the underlying bytes. str
objects have the encode
method that takes a string representing an encoding and returns the bytes
object that represents the string in that encoding. bytes
objects have the decode
method that takes a string representing an encoding and returns the str
that results from interpreting the byte
as a string encoded in the the given encoding.
For example:
>>> a = "αά".encode('utf-8')
>>> a
b'\xce\xb1\xce\xac'
>>> a.decode('utf-8')
'αά'
We can see that UTF-8 is using four bytes, \xce
, \xb1
, \xce
, and \xac
, to represent two characters.
Related reading:
Python Unicode Howto (from the official documentation)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Unicode vs UTF-8 confusion in Python / Django?
what is a "Unicode string" in Python? Does that mean UCS-2?
Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.
You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.
plain wrong, or is it?
Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).
There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.
How can I compare a unicode type to a string in python?
You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys()
first:
data = json.loads(response)
myList = [item for item in data if item == "number1"]
You may want to use u"number1"
to avoid implicit conversions between Unicode and byte strings:
data = json.loads(response)
myList = [item for item in data if item == u"number1"]
Both versions work fine:
>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']
Note that in your first example, us
is not a UTF-8 string; it is unicode data, the json
library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
On Python 2, your expectation that your test returns True
would be correct, you are doing something else wrong:
>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>
There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:
myComp = [elem for elem in json_data if elem == u"MyString"]
Related Topics
How to Read and Write Ini File with Python3
How to Format a String Using a Dictionary in Python-3.X
What's the Fastest Way in Python to Calculate Cosine Similarity Given Sparse Matrix Data
How to Override the [] Operator in Python
What Does a for Loop Within a List Do in Python
How to Redirect Stdout and Stderr to Logger in Python
Python: Simple List Merging Based on Intersections
Python - Is a Dictionary Slow to Find Frequency of Each Character
How to Apply Piecewise Linear Fit in Python
Binary Representation of Float in Python (Bits Not Hex)
Is the Server Bundled with Flask Safe to Use in Production
Underscore _ as Variable Name in Python
How to Add an Integer to Each Element in a List