Python String Prints as [U'String']

Python string prints as [u'String']

[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.

I don't know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:

 soup[0].encode("ascii")

However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it's latin-1 or utf-8.

 soup[0].encode("latin-1")

soup[0].encode("utf-8")

Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:

 soup[0].encode(soup.originalEncoding)

What's the u prefix in a Python string?

You're right, see 3.1.3. Unicode Strings.

It's been the syntax since Python 2.0.

Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.

What does the 'u' symbol mean in front of string values?

The 'u' in front of the string values means the string is a Unicode string. Unicode is a way to represent more characters than normal ASCII can manage. The fact that you're seeing the u means you're on Python 2 - strings are Unicode by default on Python 3, but on Python 2, the u in front distinguishes Unicode strings. The rest of this answer will focus on Python 2.

You can create a Unicode string multiple ways:

>>> u'foo'
u'foo'
>>> unicode('foo') # Python 2 only
u'foo'

But the real reason is to represent something like this (translation here):

>>> val = u'Ознакомьтесь с документацией'
>>> val
u'\u041e\u0437\u043d\u0430\u043a\u043e\u043c\u044c\u0442\u0435\u0441\u044c \u0441 \u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0446\u0438\u0435\u0439'
>>> print val
Ознакомьтесь с документацией

For the most part, Unicode and non-Unicode strings are interoperable on Python 2.

There are other symbols you will see, such as the "raw" symbol r for telling a string not to interpret backslashes. This is extremely useful for writing regular expressions.

>>> 'foo\"'
'foo"'
>>> r'foo\"'
'foo\\"'

Unicode and non-Unicode strings can be equal on Python 2:

>>> bird1 = unicode('unladen swallow')
>>> bird2 = 'unladen swallow'
>>> bird1 == bird2
True

but not on Python 3:

>>> x = u'asdf' # Python 3
>>> y = b'asdf' # b indicates bytestring
>>> x == y
False

Remove \u from string?

I was making an error in assuming that the .encode method of strings modifies the string inplace similar to the .sort() method of a list. But according to the documentation

The opposite method of bytes.decode() is str.encode(), which returns a bytes representation of the Unicode string, encoded in the requested encoding.

def remove_u(word):
word_u = (word.encode('unicode-escape')).decode("utf-8", "strict")
if r'\u' in word_u:
# print(True)
return word_u.split('\\u')[1]
return word

vocabulary_ = [remove_u(each_word) for each_word in vocabulary_]

Use u'string' on string stored as variable in Python

The u sigil is not part of the value, it's just a type indicator. To convert a string into a Unicode string, you need to know the encoding.

unicodestring = mystring.decode('utf-8')  # or 'latin-1' or ... whatever

and to print it you typically (in Python 2) need to convert back to whatever the system accepts on the output filehandle:

print(unicodestring.encode('utf-8'))  # or 'latin-1' or ... whatever

Python 3 clarifies (though not directly simplifies) the situation by keeping Unicode strings and (what is now called) bytes objects separate.

Suppress the u'prefix indicating unicode' in python strings

You could use Python 3.0.. The default string type is unicode, so the u'' prefix is no longer required..

In short, no. You cannot turn this off.

The u comes from the unicode.__repr__ method, which is used to display stuff in REPL:

>>> print repr(unicode('a'))
u'a'
>>> unicode('a')
u'a'

If I'm not mistaken, you cannot override this without recompiling Python.

The simplest way around this is to simply print the string..

>>> print unicode('a')
a

If you use the unicode() builtin to construct all your strings, you could do something like..

>>> class unicode(unicode):
... def __repr__(self):
... return __builtins__.unicode.__repr__(self).lstrip("u")
...
>>> unicode('a')
a

..but don't do that, it's horrible

Convert a Unicode string to a string in Python (containing extra symbols)

See unicodedata.normalize

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'

How do I get rid of the b-prefix in a string in python?

decode the bytes to produce a str:

b = b'1234'
print(b.decode('utf-8')) # '1234'

Removing u in list

That 'u' is part of the external representation of the string, meaning it's a Unicode string as opposed to a byte string. It's not in the string, it's part of the type.

As an example, you can create a new Unicode string literal by using the same synax. For instance:

>>> sandwich = u"smörgås"
>>> sandwich
u'sm\xf6rg\xe5s'

This creates a new Unicode string whose value is the Swedish word for sandwich. You can see that the non-English characters are represented by their Unicode code points, ö is \xf6 and å is \xe5. The 'u' prefix appears just like in your example to signify that this string holds Unicode text.

To get rid of those, you need to encode the Unicode string into some byte-oriented representation, such as UTF-8. You can do that with e.g.:

>>> sandwich.encode("utf-8")
'sm\xc3\xb6rg\xc3\xa5s'

Here, we get a new string without the prefix 'u', since this is a byte string. It contains the bytes representing the characters of the Unicode string, with the Swedish characters resulting in multiple bytes due to the wonders of the UTF-8 encoding.

Converting list of strings with u'...' to a list of normal strings

Try proper encoding- But care this u does not have any effect on data- it is just an explicit representation of unicode object (not byte array), if your code needs back unicode then better to feed it unicode.

>>>d =  [u'homo', u'man', u'human being', u'human']
>>>print [i.encode('utf-8') for i in d]
>>>['homo', 'man', 'human being', 'human']


Related Topics



Leave a reply



Submit