What's the U Prefix in a Python String

What's the u prefix in a Python string?

You're right, see 3.1.3. Unicode Strings.

It's been the syntax since Python 2.0.

Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.

What does the 'u' symbol mean in front of string values?

The 'u' in front of the string values means the string is a Unicode string. Unicode is a way to represent more characters than normal ASCII can manage. The fact that you're seeing the u means you're on Python 2 - strings are Unicode by default on Python 3, but on Python 2, the u in front distinguishes Unicode strings. The rest of this answer will focus on Python 2.

You can create a Unicode string multiple ways:

>>> u'foo'
u'foo'
>>> unicode('foo') # Python 2 only
u'foo'

But the real reason is to represent something like this (translation here):

>>> val = u'Ознакомьтесь с документацией'
>>> val
u'\u041e\u0437\u043d\u0430\u043a\u043e\u043c\u044c\u0442\u0435\u0441\u044c \u0441 \u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0446\u0438\u0435\u0439'
>>> print val
Ознакомьтесь с документацией

For the most part, Unicode and non-Unicode strings are interoperable on Python 2.

There are other symbols you will see, such as the "raw" symbol r for telling a string not to interpret backslashes. This is extremely useful for writing regular expressions.

>>> 'foo\"'
'foo"'
>>> r'foo\"'
'foo\\"'

Unicode and non-Unicode strings can be equal on Python 2:

>>> bird1 = unicode('unladen swallow')
>>> bird2 = 'unladen swallow'
>>> bird1 == bird2
True

but not on Python 3:

>>> x = u'asdf' # Python 3
>>> y = b'asdf' # b indicates bytestring
>>> x == y
False

What exactly do u and r string prefixes do, and what are raw string literals?

There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.

A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).

rstring bstring ustring Python 2 / 3 comparison

From the python docs for literals: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

Bytes literals are always prefixed with 'b' or 'B'; they produce an
instance of the bytes type instead of the str type. They may only
contain ASCII characters; bytes with a numeric value of 128 or greater
must be expressed with escapes.

Both string and bytes literals may optionally be prefixed with a
letter 'r' or 'R'; such strings are called raw strings and treat
backslashes as literal characters. As a result, in string literals,
'\U' and '\u' escapes in raw strings are not treated specially. Given
that Python 2.x’s raw unicode literals behave differently than Python
3.x’s the 'ur' syntax is not supported.

and

A string literal with 'f' or 'F' in its prefix is a formatted string
literal; see Formatted string literals. The 'f' may be combined with
'r', but not with 'b' or 'u', therefore raw formatted strings are
possible, but formatted bytes literals are not.

So:

  • r means raw
  • b means bytes
  • u means unicode
  • f means format

The r and b were already available in Python 2, as such in many other languages (they are very handy sometimes).

Since the strings literals were not unicode in Python 2, the u-strings were created to offer support for internationalization. As of Python 3, u-strings are the default strings, so "..." is semantically the same as u"...".

Finally, from those, the f-string is the only one that isn't supported in Python 2.

What is the difference between u' ' prefix and unicode() in python?

  • u'..' is a string literal, and decodes the characters according to the source encoding declaration.

  • unicode() is a function that converts another type to a unicode object, you've given it a byte string literal. It'll decode a byte string according to the default ASCII codec.

So you created a byte string object using a different type of literal notation, then tried to convert it to a unicode() object, which fails because the default codec for str -> unicode conversions is ASCII.

The two are quite different beasts. If you want to use the latter, you need to give it an explicit codec:

print unicode('上午', 'utf8')

The two are related in the same way that using 0xFF and int('0xFF', 0) are related; the former defines an integer of value 255 using hex notation, the latter uses the int() function to extract an integer from a string.

An alternative method would be to use the str.decode() method:

print '上午'.decode('utf8')

Don't be tempted to use an error handler (such as ignore' or 'replace') unless you know what you are doing. 'ignore' especially can mask underlying issues with having picked the wrong codec, for example.

You may want to read up on Python and Unicode:

  • Pragmatic Unicode by Ned Batchelder

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

What does 'u' before a string in python mean?

What does 'u' before a string in python mean?

The character prefixing a python string is called a python String Encoding declaration. You can read all about them here: https://docs.python.org/3.3/reference/lexical_analysis.html#encoding-declarations

The u stands for unicode. Alternative letters in that slot can be r'foobar' for raw string and b'foobar' for byte string.

Some help on what Python considers unicode:
http://docs.python.org/howto/unicode.html

Then to explore the nature of this, run this command:

type(u'abc')

returns:

<type 'unicode'>

If you use Unicode when you should be using ascii, or ascii when you should be using unicode you are very likely going to encounter bubbleup implementationitis when users start "exploring the unicode space" at runtime.

For example, if you pass a unicode string to facebook's api.fql(...) function, which says it accepts unicode, it will happily process your request, and then later on return undefined results as if the function succeeded as it pukes all over the carpet.

As defined in this post:

FQL multiquery from python fails with unicode query

Some unicode characters are going to cause your code in production to have a seizure then puke all over the floor. So make sure you're okay and customers can tolerate that.

Python string prints as [u'String']

[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.

I don't know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:

 soup[0].encode("ascii")

However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it's latin-1 or utf-8.

 soup[0].encode("latin-1")


soup[0].encode("utf-8")

Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:

 soup[0].encode(soup.originalEncoding)

Suppress the u'prefix indicating unicode' in python strings

You could use Python 3.0.. The default string type is unicode, so the u'' prefix is no longer required..

In short, no. You cannot turn this off.

The u comes from the unicode.__repr__ method, which is used to display stuff in REPL:

>>> print repr(unicode('a'))
u'a'
>>> unicode('a')
u'a'

If I'm not mistaken, you cannot override this without recompiling Python.

The simplest way around this is to simply print the string..

>>> print unicode('a')
a

If you use the unicode() builtin to construct all your strings, you could do something like..

>>> class unicode(unicode):
... def __repr__(self):
... return __builtins__.unicode.__repr__(self).lstrip("u")
...
>>> unicode('a')
a

..but don't do that, it's horrible



Related Topics



Leave a reply



Submit