Convert Xml/Html Entities into Unicode String in Python

Convert XML/HTML Entities into Unicode String in Python

The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

up to Python 3.4:

import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('© 2010') # u'\xa9 2010'
h.unescape('© 2010') # u'\xa9 2010'

Python 3.4+:

import html
html.unescape('© 2010') # u'\xa9 2010'
html.unescape('© 2010') # u'\xa9 2010'

Python, convert HTML entities to Unicode

If you want unicode handling, use unicode strings. Everything works as expected in your example then.

# -*- coding: utf-8 -*-
import HTMLParser
from bs4 import BeautifulSoup

astring = u"P&O."
bstring = u"& "
cstring = u">"
dstring = u"> 150ÎC"

pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup('<span>%s</span>' % astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup('<span>%s</span>' % bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup('<span>%s</span>' % cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup('<span>%s</span>' % dstring)
try: d2 = pars.unescape(dstring)
except: d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2

This gives the following output.

a1: <span>P&O.</span>
a2: P&O.
b1: <span>& </span>
b2: &
c1: <span>></span>
c2: >
d1: <span>> 150ÎC</span>
d2: > 150ÎC

BeautifulSoup encodes them, HTMLParser decodes them.

Convert HTML entities to Unicode and vice versa

You need to have BeautifulSoup.

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text

def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text

text = "&, ®, <, >, ¢, £, ¥, €, §, ©"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©

Decode HTML entities in Python string?

Python 3.4+

Use html.unescape():

import html

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.

Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

  • For Python 2.6-2.7 it's in HTMLParser
  • For Python 3 it's in html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))

Unicode encoding in python

You are looking for HTML entity decoding, not Unicode (or codec) decoding.

See Decode HTML entities in Python string? for ways to do this.

Python: Unicode to html entities

\xc3\xa1 is á in UTF-8, not in Unicode.

(áááá in Unicode would be u'\xe1\xe1\xe1\xe1')

You therefore need to use a string literal to define it, not an unicode literal ('' vs u''). Once you got UTF-8, you need to decode that to Unicode, in other to encode it again to ASCII with XML entities:

>>> name = '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1'.decode('utf-8')
>>> name.encode('ascii', 'xmlcharrefreplace')

Python, XML, é type encodings

If you just want to parse the HTML entity to its unicode equivalent:

>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('é')
>>> print parser.unescape('é')

This is for Python 2.x, for 3.x the import is import html.parser

Related Topics

Leave a reply
