How to Perform HTML Decoding/Encoding Using Python/Django

How do I perform HTML decoding/encoding using Python/Django?

Given the Django use case, there are two answers to this. Here is its django.utils.html.escape function, for reference:

def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '&l
t;').replace('>', '>').replace('"', '"').replace("'", '''))

To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:

def html_decode(s):
"""
Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>.
"""
htmlCodes = (
("'", '''),
('"', '"'),
('>', '>'),
('<', '<'),
('&', '&')
)
for code in htmlCodes:
s = s.replace(code[1], code[0])
return s

unescaped = html_decode(my_string)

This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape. More generally, it is a good idea to stick with the standard library:

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.

With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:

{{ context_var|safe }}
{% autoescape off %}
{{ context_var }}
{% endautoescape %}

Decode / encode html escaped special characters in Python

You have a Mojibake, double-encoded data. You not only have HTML entities, your data was incorrectly decoded from bytes to text before the HTML entities were applied.

For your example, the two Ã, entities decode to the Unicode characters à and . Those two characters are also known (from the Unicode standard), as U+00C3 LATIN CAPITAL LETTER A WITH TILDE and U+2030 PER MILLE SIGN. This is typical of UTF-8 data being mis-interpreted as a Latin variant encoding (such as ISO 8859-1 or a Windows Latin codepage variant.

If we assume that the original character was meant to be É, or U+00C9 LATIN CAPITAL LETTER E WITH ACUTE, then the original would have been encoded to the bytes C3 and 89 if using UTF-8. That à (U+00C3!) shows up here is not a coincidence, it is typical of UTF-8 -> Latin variant Mojibakes to end up with such combinations. The 89 mapping tells us that the most likely candidate for the wrong encoding is the Windows CP 1252 encoding, which maps the hex value 89 to U+2030 PER MILLE SIGN.

You could manually encode to bytes then decode as the correct encoding, but the trick is to know what encoding was used incorrectly, and sometimes that mistake leads to data loss, because the CP-1252 codepage doesn't have a Unicode character mapping for 5 specific byte values. That's not a direct problem for the example in your question, but can be for other text. Manually decoding would work like this:

>>> import html
>>> broken = ""Coup d'État""
>>> html.unescape(broken)
'"Coup d\'État"'
>>> html.unescape(broken).encode("cp1252")
b'"Coup d\'\xc3\x89tat"'
>>> html.unescape(broken).encode("cp1252").decode("utf-8")
'"Coup d\'État"'

A better option is to use the special ftfy library (the name is an acronym for Fixed That For You), which uses detailed knowledge about how to recognize such mistakes and undo the damage.

ftfy also handles the HTML-entity decoding, all in one step:

>>> import ftfy
>>> ftfy.fix_text(""Coup d'État"")
'"Coup d\'État"'

The library includes sloppy variants of text codes often found in a Mojibake to help with repairing. It also encodes information about how to recognize the specific errors that a given wrong codec choice produces so it knows what to do to reverse the damage.

encoding text to html entity (not the tags)

Encode named HTML entities with Python

http://beckism.com/2009/03/named_entities_python/

There is also a django app for both decoding and encoding:

https://github.com/cobrateam/python-htmlentities

For Python 2.x (Change to html.entities.codepoint2name in Python 3.x):

'''
Registers a special handler for named HTML entities

Usage:
import named_entities
text = u'Some string with Unicode characters'
text = text.encode('ascii', 'named_entities')
'''

import codecs
from htmlentitydefs import codepoint2name

def named_entities(text):
if isinstance(text, (UnicodeEncodeError, UnicodeTranslateError)):
s = []
for c in text.object[text.start:text.end]:
if ord(c) in codepoint2name:
s.append(u'&%s;' % codepoint2name[ord(c)])
else:
s.append(u'&#%s;' % ord(c))
return ''.join(s), text.end
else:
raise TypeError("Can't handle %s" % text.__name__)

codecs.register_error('named_entities', named_entities)

python django decode array html codes to readable text

You need to convert the codes into string characters and then join the characters together:

myString = ''.join(map(chr, eventData))

If you have a hard time understanding what the code above does look at the code below - it's quite similar. Both versions use chr() to convert every numerical ASCI code to one-character string and then join the strings together. The only difference is, in the former version I replaced map() with a simple for loop.

characters = []
for code in eventData:
characters.append(chr(code))
myString = ''.join(characters)

Decode HTML entities in Python string?

Python 3.4+

Use html.unescape():

import html
print(html.unescape('£682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.


Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

  • For Python 2.6-2.7 it's in HTMLParser
  • For Python 3 it's in html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

Decoding html encoded strings in python

What's you're trying to do is called "HTML entity decoding" and it's covered in a number of past Stack Overflow questions, for example:

  • How to unescape apostrophes and such in Python?
  • Decoding HTML Entities With Python

Here's a code snippet using the Beautiful Soup HTML parsing library to decode your example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup

string = "Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."
s = BeautifulSoup(string,convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0]
print s

Here's the output:

Scam, hoax, or the real deal, he’s
gonna work his way to the bottom of
the sordid tale, and hopefully end up
with an arcade game in the process.



Related Topics



Leave a reply



Submit