Python string prints as [u'String']
[u'ABC']
would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.
I don't know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:
soup[0].encode("ascii")
However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it's latin-1 or utf-8.
soup[0].encode("latin-1")
soup[0].encode("utf-8")
Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:
soup[0].encode(soup.originalEncoding)
What's the u prefix in a Python string?
You're right, see 3.1.3. Unicode Strings.
It's been the syntax since Python 2.0.
Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.
What does the 'u' symbol mean in front of string values?
The 'u' in front of the string values means the string is a Unicode string. Unicode is a way to represent more characters than normal ASCII can manage. The fact that you're seeing the u
means you're on Python 2 - strings are Unicode by default on Python 3, but on Python 2, the u
in front distinguishes Unicode strings. The rest of this answer will focus on Python 2.
You can create a Unicode string multiple ways:
>>> u'foo'
u'foo'
>>> unicode('foo') # Python 2 only
u'foo'
But the real reason is to represent something like this (translation here):
>>> val = u'Ознакомьтесь с документацией'
>>> val
u'\u041e\u0437\u043d\u0430\u043a\u043e\u043c\u044c\u0442\u0435\u0441\u044c \u0441 \u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0446\u0438\u0435\u0439'
>>> print val
Ознакомьтесь с документацией
For the most part, Unicode and non-Unicode strings are interoperable on Python 2.
There are other symbols you will see, such as the "raw" symbol r
for telling a string not to interpret backslashes. This is extremely useful for writing regular expressions.
>>> 'foo\"'
'foo"'
>>> r'foo\"'
'foo\\"'
Unicode and non-Unicode strings can be equal on Python 2:
>>> bird1 = unicode('unladen swallow')
>>> bird2 = 'unladen swallow'
>>> bird1 == bird2
True
but not on Python 3:
>>> x = u'asdf' # Python 3
>>> y = b'asdf' # b indicates bytestring
>>> x == y
False
Remove \u from string?
I was making an error in assuming that the .encode
method of strings modifies the string inplace similar to the .sort()
method of a list. But according to the documentation
The opposite method of bytes.decode() is str.encode(), which returns a bytes representation of the Unicode string, encoded in the requested encoding.
def remove_u(word):
word_u = (word.encode('unicode-escape')).decode("utf-8", "strict")
if r'\u' in word_u:
# print(True)
return word_u.split('\\u')[1]
return word
vocabulary_ = [remove_u(each_word) for each_word in vocabulary_]
Use u'string' on string stored as variable in Python
The u
sigil is not part of the value, it's just a type indicator. To convert a string into a Unicode string, you need to know the encoding.
unicodestring = mystring.decode('utf-8') # or 'latin-1' or ... whatever
and to print it you typically (in Python 2) need to convert back to whatever the system accepts on the output filehandle:
print(unicodestring.encode('utf-8')) # or 'latin-1' or ... whatever
Python 3 clarifies (though not directly simplifies) the situation by keeping Unicode strings and (what is now called) bytes
objects separate.
Suppress the u'prefix indicating unicode' in python strings
You could use Python 3.0.. The default string type is unicode, so the u''
prefix is no longer required..
In short, no. You cannot turn this off.
The u
comes from the unicode.__repr__
method, which is used to display stuff in REPL:
>>> print repr(unicode('a'))
u'a'
>>> unicode('a')
u'a'
If I'm not mistaken, you cannot override this without recompiling Python.
The simplest way around this is to simply print the string..
>>> print unicode('a')
a
If you use the unicode()
builtin to construct all your strings, you could do something like..
>>> class unicode(unicode):
... def __repr__(self):
... return __builtins__.unicode.__repr__(self).lstrip("u")
...
>>> unicode('a')
a
..but don't do that, it's horrible
Convert a Unicode string to a string in Python (containing extra symbols)
See unicodedata.normalize
title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'
How do I get rid of the b-prefix in a string in python?
decode
the bytes
to produce a str
:
b = b'1234'
print(b.decode('utf-8')) # '1234'
Removing u in list
That 'u' is part of the external representation of the string, meaning it's a Unicode string as opposed to a byte string. It's not in the string, it's part of the type.
As an example, you can create a new Unicode string literal by using the same synax. For instance:
>>> sandwich = u"smörgås"
>>> sandwich
u'sm\xf6rg\xe5s'
This creates a new Unicode string whose value is the Swedish word for sandwich. You can see that the non-English characters are represented by their Unicode code points, ö is \xf6
and å is \xe5
. The 'u' prefix appears just like in your example to signify that this string holds Unicode text.
To get rid of those, you need to encode the Unicode string into some byte-oriented representation, such as UTF-8. You can do that with e.g.:
>>> sandwich.encode("utf-8")
'sm\xc3\xb6rg\xc3\xa5s'
Here, we get a new string without the prefix 'u', since this is a byte string. It contains the bytes representing the characters of the Unicode string, with the Swedish characters resulting in multiple bytes due to the wonders of the UTF-8 encoding.
Converting list of strings with u'...' to a list of normal strings
Try proper encoding- But care this u
does not have any effect on data- it is just an explicit representation of unicode object (not byte array), if your code needs back unicode
then better to feed it unicode.
>>>d = [u'homo', u'man', u'human being', u'human']
>>>print [i.encode('utf-8') for i in d]
>>>['homo', 'man', 'human being', 'human']
Related Topics
How to Find the Last Occurrence of an Item in a Python List
How to Increase the Cell Width of the Jupyter/Ipython Notebook in My Browser
How to Verify If One List Is a Subset of Another
Python/Numpy First Occurrence of Subarray
How to Remove Nan Values from a Numpy Array
Make 2 Functions Run at the Same Time
Why Does '.Sort()' Cause the List to Be 'None' in Python
How to Copy an Entire Directory of Files into an Existing Directory Using Python
What Is the Internal Precision of Numpy.Float128
How to Use Angularjs with the Jinja2 Template Engine
Purpose of "%Matplotlib Inline"
Generate a Random Date Between Two Other Dates
Appending Pandas Dataframes Generated in a for Loop
Intersection of Two Graphs in Python, Find the X Value