Convert Xml/Html Entities into Unicode String in Python

Convert XML/HTML Entities into Unicode String in Python

The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

up to Python 3.4:

import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('© 2010') # u'\xa9 2010'
h.unescape('© 2010') # u'\xa9 2010'

Python 3.4+:

import html
html.unescape('© 2010') # u'\xa9 2010'
html.unescape('© 2010') # u'\xa9 2010'

Python, convert HTML entities to Unicode

If you want unicode handling, use unicode strings. Everything works as expected in your example then.

# -*- coding: utf-8 -*-
import HTMLParser
from bs4 import BeautifulSoup

astring = u"P&O."
bstring = u"& "
cstring = u">"
dstring = u"> 150ÎC"

pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup('<span>%s</span>' % astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup('<span>%s</span>' % bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup('<span>%s</span>' % cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup('<span>%s</span>' % dstring)
try: d2 = pars.unescape(dstring)
except: d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2

This gives the following output.

a1: <span>P&O.</span>
a2: P&O.
b1: <span>& </span>
b2: &
c1: <span>></span>
c2: >
d1: <span>> 150ÎC</span>
d2: > 150ÎC

BeautifulSoup encodes them, HTMLParser decodes them.

Convert HTML entities to Unicode and vice versa

You need to have BeautifulSoup.

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text

def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text

text = "&, ®, <, >, ¢, £, ¥, €, §, ©"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©

Decode HTML entities in Python string?

Python 3.4+

Use html.unescape():

import html
print(html.unescape('£682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.


Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

  • For Python 2.6-2.7 it's in HTMLParser
  • For Python 3 it's in html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

Unicode encoding in python

You are looking for HTML entity decoding, not Unicode (or codec) decoding.

See Decode HTML entities in Python string? for ways to do this.

Python: Unicode to html entities

\xc3\xa1 is á in UTF-8, not in Unicode.

(áááá in Unicode would be u'\xe1\xe1\xe1\xe1')

You therefore need to use a string literal to define it, not an unicode literal ('' vs u''). Once you got UTF-8, you need to decode that to Unicode, in other to encode it again to ASCII with XML entities:

>>> name = '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1'.decode('utf-8')
>>> name.encode('ascii', 'xmlcharrefreplace')
'áááá'

Python, XML, é type encodings

If you just want to parse the HTML entity to its unicode equivalent:

>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('é')
u'\xe9'
>>> print parser.unescape('é')
é

This is for Python 2.x, for 3.x the import is import html.parser



Related Topics



Leave a reply



Submit