What does the 'u' symbol mean in front of string values?
The 'u' in front of the string values means the string is a Unicode string. Unicode is a way to represent more characters than normal ASCII can manage. The fact that you're seeing the u
means you're on Python 2 - strings are Unicode by default on Python 3, but on Python 2, the u
in front distinguishes Unicode strings. The rest of this answer will focus on Python 2.
You can create a Unicode string multiple ways:
>>> u'foo'
u'foo'
>>> unicode('foo') # Python 2 only
u'foo'
But the real reason is to represent something like this (translation here):
>>> val = u'Ознакомьтесь с документацией'
>>> val
u'\u041e\u0437\u043d\u0430\u043a\u043e\u043c\u044c\u0442\u0435\u0441\u044c \u0441 \u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0446\u0438\u0435\u0439'
>>> print val
Ознакомьтесь с документацией
For the most part, Unicode and non-Unicode strings are interoperable on Python 2.
There are other symbols you will see, such as the "raw" symbol r
for telling a string not to interpret backslashes. This is extremely useful for writing regular expressions.
>>> 'foo\"'
'foo"'
>>> r'foo\"'
'foo\\"'
Unicode and non-Unicode strings can be equal on Python 2:
>>> bird1 = unicode('unladen swallow')
>>> bird2 = 'unladen swallow'
>>> bird1 == bird2
True
but not on Python 3:
>>> x = u'asdf' # Python 3
>>> y = b'asdf' # b indicates bytestring
>>> x == y
False
What's the u prefix in a Python string?
You're right, see 3.1.3. Unicode Strings.
It's been the syntax since Python 2.0.
Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.
What does 'u' before a string in python mean?
What does 'u' before a string in python mean?
The character prefixing a python string is called a python String Encoding declaration. You can read all about them here: https://docs.python.org/3.3/reference/lexical_analysis.html#encoding-declarations
The u stands for unicode. Alternative letters in that slot can be r'foobar'
for raw string and b'foobar' for byte string.
Some help on what Python considers unicode:
http://docs.python.org/howto/unicode.html
Then to explore the nature of this, run this command:
type(u'abc')
returns:
<type 'unicode'>
If you use Unicode when you should be using ascii, or ascii when you should be using unicode you are very likely going to encounter bubbleup implementationitis when users start "exploring the unicode space" at runtime.
For example, if you pass a unicode string to facebook's api.fql(...)
function, which says it accepts unicode, it will happily process your request, and then later on return undefined results as if the function succeeded as it pukes all over the carpet.
As defined in this post:
FQL multiquery from python fails with unicode query
Some unicode characters are going to cause your code in production to have a seizure then puke all over the floor. So make sure you're okay and customers can tolerate that.
Python string prints as [u'String']
[u'ABC']
would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.
I don't know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:
soup[0].encode("ascii")
However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it's latin-1 or utf-8.
soup[0].encode("latin-1")
soup[0].encode("utf-8")
Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:
soup[0].encode(soup.originalEncoding)
What does a beginning 'u do when defining a string in python?
isdecimal
is a method on Unicode strings, but not byte strings. u'xxx'
defines a Unicode string.
What exactly do u and r string prefixes do, and what are raw string literals?
There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r'
before the opening quote.
A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \
, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).
r'...'
is a byte string (in Python 2.*), ur'...'
is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...'
, r'''...'''
, r"..."
, r"""..."""
are all byte strings, and so on).
Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*, u'...'
is of course always distinct from just '...'
-- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
E.g., consider (Python 2.6):
>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34
The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).
Removing u in list
That 'u' is part of the external representation of the string, meaning it's a Unicode string as opposed to a byte string. It's not in the string, it's part of the type.
As an example, you can create a new Unicode string literal by using the same synax. For instance:
>>> sandwich = u"smörgås"
>>> sandwich
u'sm\xf6rg\xe5s'
This creates a new Unicode string whose value is the Swedish word for sandwich. You can see that the non-English characters are represented by their Unicode code points, ö is \xf6
and å is \xe5
. The 'u' prefix appears just like in your example to signify that this string holds Unicode text.
To get rid of those, you need to encode the Unicode string into some byte-oriented representation, such as UTF-8. You can do that with e.g.:
>>> sandwich.encode("utf-8")
'sm\xc3\xb6rg\xc3\xa5s'
Here, we get a new string without the prefix 'u', since this is a byte string. It contains the bytes representing the characters of the Unicode string, with the Swedish characters resulting in multiple bytes due to the wonders of the UTF-8 encoding.
What are those little u in my tuple? (python 2.7)
The u
indicates that the string is a unicode object. See here for details
Converting list of strings with u'...' to a list of normal strings
Try proper encoding- But care this u
does not have any effect on data- it is just an explicit representation of unicode object (not byte array), if your code needs back unicode
then better to feed it unicode.
>>>d = [u'homo', u'man', u'human being', u'human']
>>>print [i.encode('utf-8') for i in d]
>>>['homo', 'man', 'human being', 'human']
Related Topics
Python 3 Importerror: No Module Named 'Configparser'
How to Send a Head Http Request in Python 2
Why Do You Need Explicitly Have the "Self" Argument in a Python Method
What Is the Standard Way to Add N Seconds to Datetime.Time in Python
How to Delete a Character from a String Using Python
Updating Openssl in Python 2.7
Differences Between Distribute, Distutils, Setuptools and Distutils2
How to Highlight Text in a Tkinter Text Widget
Dictionary: Get List of Values for List of Keys
Apply Pandas Function to Column to Create Multiple New Columns
How to Sort Two Lists (Which Reference Each Other) in the Exact Same Way
How to Add a Constant Column in a Spark Dataframe
How to Get Monitor Resolution in Python
Append Multiple Values for One Key in a Dictionary
How to Get the Logical Xor of Two Variables in Python
Why Don't These List Operations Return the Resulting List