Convert XML/HTML Entities into Unicode String in Python
The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
up to Python 3.4:
import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('© 2010') # u'\xa9 2010'
h.unescape('© 2010') # u'\xa9 2010'
Python 3.4+:
import html
html.unescape('© 2010') # u'\xa9 2010'
html.unescape('© 2010') # u'\xa9 2010'
Python, convert HTML entities to Unicode
If you want unicode handling, use unicode strings. Everything works as expected in your example then.
# -*- coding: utf-8 -*-
import HTMLParser
from bs4 import BeautifulSoup
astring = u"P&O."
bstring = u"& "
cstring = u">"
dstring = u"> 150ÎC"
pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup('<span>%s</span>' % astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup('<span>%s</span>' % bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup('<span>%s</span>' % cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup('<span>%s</span>' % dstring)
try: d2 = pars.unescape(dstring)
except: d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2
This gives the following output.
a1: <span>P&O.</span>
a2: P&O.
b1: <span>& </span>
b2: &
c1: <span>></span>
c2: >
d1: <span>> 150ÎC</span>
d2: > 150ÎC
BeautifulSoup encodes them, HTMLParser decodes them.
Convert HTML entities to Unicode and vice versa
You need to have BeautifulSoup.
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
Decode HTML entities in Python string?
Python 3.4+
Use html.unescape()
:
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape
is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape()
from the standard library:
- For Python 2.6-2.7 it's in
HTMLParser
- For Python 3 it's in
html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You can also use the six
compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
Unicode encoding in python
You are looking for HTML entity decoding, not Unicode (or codec) decoding.
See Decode HTML entities in Python string? for ways to do this.
Python: Unicode to html entities
\xc3\xa1
is á
in UTF-8, not in Unicode.
(áááá
in Unicode would be u'\xe1\xe1\xe1\xe1'
)
You therefore need to use a string literal to define it, not an unicode literal (''
vs u''
). Once you got UTF-8, you need to decode that to Unicode, in other to encode it again to ASCII with XML entities:
>>> name = '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1'.decode('utf-8')
>>> name.encode('ascii', 'xmlcharrefreplace')
'áááá'
Python, XML, é type encodings
If you just want to parse the HTML entity to its unicode equivalent:
>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('é')
u'\xe9'
>>> print parser.unescape('é')
é
This is for Python 2.x, for 3.x the import is import html.parser
Related Topics
Syntax Error on Print With Python 3
Why Is the Order in Dictionaries and Sets Arbitrary
Difference Between Re.Search and Re.Match
What Is Truthy and Falsy? How Is It Different from True and False
Accessing the Index in 'For' Loops
How to Iterate Over a List in Chunks
How to Dynamically Create Variables
Replacements For Switch Statement in Python
Why Does Comparing Strings Using Either '==' or 'Is' Sometimes Produce a Different Result
Running Shell Command and Capturing the Output
What Exactly Do "U" and "R" String Prefixes Do, and What Are Raw String Literals
Evaluating a Mathematical Expression in a String
Why Do I Get Attributeerror: 'Nonetype' Object Has No Attribute 'Something'
What Is the Purpose of the Single Underscore "_" Variable in Python