What Does the 'U' Symbol Mean in Front of String Values

What does the 'u' symbol mean in front of string values?

The 'u' in front of the string values means the string is a Unicode string. Unicode is a way to represent more characters than normal ASCII can manage. The fact that you're seeing the u means you're on Python 2 - strings are Unicode by default on Python 3, but on Python 2, the u in front distinguishes Unicode strings. The rest of this answer will focus on Python 2.

You can create a Unicode string multiple ways:

>>> u'foo'
u'foo'
>>> unicode('foo') # Python 2 only
u'foo'

But the real reason is to represent something like this (translation here):

>>> val = u'Ознакомьтесь с документацией'
>>> val
u'\u041e\u0437\u043d\u0430\u043a\u043e\u043c\u044c\u0442\u0435\u0441\u044c \u0441 \u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0446\u0438\u0435\u0439'
>>> print val
Ознакомьтесь с документацией

For the most part, Unicode and non-Unicode strings are interoperable on Python 2.

There are other symbols you will see, such as the "raw" symbol r for telling a string not to interpret backslashes. This is extremely useful for writing regular expressions.

>>> 'foo\"'
'foo"'
>>> r'foo\"'
'foo\\"'

Unicode and non-Unicode strings can be equal on Python 2:

>>> bird1 = unicode('unladen swallow')
>>> bird2 = 'unladen swallow'
>>> bird1 == bird2
True

but not on Python 3:

>>> x = u'asdf' # Python 3
>>> y = b'asdf' # b indicates bytestring
>>> x == y
False

What's the u prefix in a Python string?

You're right, see 3.1.3. Unicode Strings.

It's been the syntax since Python 2.0.

Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.

What does 'u' before a string in python mean?

What does 'u' before a string in python mean?

The character prefixing a python string is called a python String Encoding declaration. You can read all about them here: https://docs.python.org/3.3/reference/lexical_analysis.html#encoding-declarations

The u stands for unicode. Alternative letters in that slot can be r'foobar' for raw string and b'foobar' for byte string.

Some help on what Python considers unicode:
http://docs.python.org/howto/unicode.html

Then to explore the nature of this, run this command:

type(u'abc')

returns:

<type 'unicode'>

If you use Unicode when you should be using ascii, or ascii when you should be using unicode you are very likely going to encounter bubbleup implementationitis when users start "exploring the unicode space" at runtime.

For example, if you pass a unicode string to facebook's api.fql(...) function, which says it accepts unicode, it will happily process your request, and then later on return undefined results as if the function succeeded as it pukes all over the carpet.

As defined in this post:

FQL multiquery from python fails with unicode query

Some unicode characters are going to cause your code in production to have a seizure then puke all over the floor. So make sure you're okay and customers can tolerate that.

Python string prints as [u'String']

[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.

I don't know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:

 soup[0].encode("ascii")

However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it's latin-1 or utf-8.

 soup[0].encode("latin-1")

soup[0].encode("utf-8")

Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:

 soup[0].encode(soup.originalEncoding)

What does a beginning 'u do when defining a string in python?

isdecimal is a method on Unicode strings, but not byte strings. u'xxx' defines a Unicode string.

What exactly do u and r string prefixes do, and what are raw string literals?

There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.

A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).

Removing u in list

That 'u' is part of the external representation of the string, meaning it's a Unicode string as opposed to a byte string. It's not in the string, it's part of the type.

As an example, you can create a new Unicode string literal by using the same synax. For instance:

>>> sandwich = u"smörgås"
>>> sandwich
u'sm\xf6rg\xe5s'

This creates a new Unicode string whose value is the Swedish word for sandwich. You can see that the non-English characters are represented by their Unicode code points, ö is \xf6 and å is \xe5. The 'u' prefix appears just like in your example to signify that this string holds Unicode text.

To get rid of those, you need to encode the Unicode string into some byte-oriented representation, such as UTF-8. You can do that with e.g.:

>>> sandwich.encode("utf-8")
'sm\xc3\xb6rg\xc3\xa5s'

Here, we get a new string without the prefix 'u', since this is a byte string. It contains the bytes representing the characters of the Unicode string, with the Swedish characters resulting in multiple bytes due to the wonders of the UTF-8 encoding.

What are those little u in my tuple? (python 2.7)

The u indicates that the string is a unicode object. See here for details

Converting list of strings with u'...' to a list of normal strings

Try proper encoding- But care this u does not have any effect on data- it is just an explicit representation of unicode object (not byte array), if your code needs back unicode then better to feed it unicode.

>>>d =  [u'homo', u'man', u'human being', u'human']
>>>print [i.encode('utf-8') for i in d]
>>>['homo', 'man', 'human being', 'human']


Related Topics



Leave a reply



Submit