How do I perform HTML decoding/encoding using Python/Django?
Given the Django use case, there are two answers to this. Here is its django.utils.html.escape
function, for reference:
def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '&l
t;').replace('>', '>').replace('"', '"').replace("'", '''))
To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:
def html_decode(s):
"""
Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>.
"""
htmlCodes = (
("'", '''),
('"', '"'),
('>', '>'),
('<', '<'),
('&', '&')
)
for code in htmlCodes:
s = s.replace(code[1], code[0])
return s
unescaped = html_decode(my_string)
This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape
. More generally, it is a good idea to stick with the standard library:
# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)
As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.
With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:
{{ context_var|safe }}
{% autoescape off %}
{{ context_var }}
{% endautoescape %}
Decode / encode html escaped special characters in Python
You have a Mojibake, double-encoded data. You not only have HTML entities, your data was incorrectly decoded from bytes to text before the HTML entities were applied.
For your example, the two Ã
,
entities decode to the Unicode characters Ã
and ‰
. Those two characters are also known (from the Unicode standard), as U+00C3 LATIN CAPITAL LETTER A WITH TILDE
and U+2030 PER MILLE SIGN
. This is typical of UTF-8 data being mis-interpreted as a Latin variant encoding (such as ISO 8859-1 or a Windows Latin codepage variant.
If we assume that the original character was meant to be É
, or U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
, then the original would have been encoded to the bytes C3
and 89
if using UTF-8. That Ã
(U+00C3
!) shows up here is not a coincidence, it is typical of UTF-8 -> Latin variant Mojibakes to end up with such combinations. The 89
mapping tells us that the most likely candidate for the wrong encoding is the Windows CP 1252 encoding, which maps the hex value 89
to U+2030 PER MILLE SIGN
.
You could manually encode to bytes then decode as the correct encoding, but the trick is to know what encoding was used incorrectly, and sometimes that mistake leads to data loss, because the CP-1252 codepage doesn't have a Unicode character mapping for 5 specific byte values. That's not a direct problem for the example in your question, but can be for other text. Manually decoding would work like this:
>>> import html
>>> broken = ""Coup d'Ãtat""
>>> html.unescape(broken)
'"Coup d\'État"'
>>> html.unescape(broken).encode("cp1252")
b'"Coup d\'\xc3\x89tat"'
>>> html.unescape(broken).encode("cp1252").decode("utf-8")
'"Coup d\'État"'
A better option is to use the special ftfy
library (the name is an acronym for Fixed That For You), which uses detailed knowledge about how to recognize such mistakes and undo the damage.
ftfy
also handles the HTML-entity decoding, all in one step:
>>> import ftfy
>>> ftfy.fix_text(""Coup d'Ãtat"")
'"Coup d\'État"'
The library includes sloppy variants of text codes often found in a Mojibake to help with repairing. It also encodes information about how to recognize the specific errors that a given wrong codec choice produces so it knows what to do to reverse the damage.
encoding text to html entity (not the tags)
Encode named HTML entities with Python
http://beckism.com/2009/03/named_entities_python/
There is also a django app for both decoding and encoding:
https://github.com/cobrateam/python-htmlentities
For Python 2.x (Change to html.entities.codepoint2name
in Python 3.x):
'''
Registers a special handler for named HTML entities
Usage:
import named_entities
text = u'Some string with Unicode characters'
text = text.encode('ascii', 'named_entities')
'''
import codecs
from htmlentitydefs import codepoint2name
def named_entities(text):
if isinstance(text, (UnicodeEncodeError, UnicodeTranslateError)):
s = []
for c in text.object[text.start:text.end]:
if ord(c) in codepoint2name:
s.append(u'&%s;' % codepoint2name[ord(c)])
else:
s.append(u'%s;' % ord(c))
return ''.join(s), text.end
else:
raise TypeError("Can't handle %s" % text.__name__)
codecs.register_error('named_entities', named_entities)
python django decode array html codes to readable text
You need to convert the codes into string characters and then join the characters together:
myString = ''.join(map(chr, eventData))
If you have a hard time understanding what the code above does look at the code below - it's quite similar. Both versions use chr()
to convert every numerical ASCI code to one-character string and then join the strings together. The only difference is, in the former version I replaced map()
with a simple for loop.
characters = []
for code in eventData:
characters.append(chr(code))
myString = ''.join(characters)
Decode HTML entities in Python string?
Python 3.4+
Use html.unescape()
:
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape
is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape()
from the standard library:
- For Python 2.6-2.7 it's in
HTMLParser
- For Python 3 it's in
html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You can also use the six
compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
Decoding html encoded strings in python
What's you're trying to do is called "HTML entity decoding" and it's covered in a number of past Stack Overflow questions, for example:
- How to unescape apostrophes and such in Python?
- Decoding HTML Entities With Python
Here's a code snippet using the Beautiful Soup HTML parsing library to decode your example:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
string = "Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."
s = BeautifulSoup(string,convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0]
print s
Here's the output:
Scam, hoax, or the real deal, he’s
gonna work his way to the bottom of
the sordid tale, and hopefully end up
with an arcade game in the process.
Related Topics
Python 3.5 - "Geckodriver Executable Needs to Be in Path"
How to Make My Player Rotate Towards Mouse Position
Wrapping a C Library in Python: C, Cython or Ctypes
What's the Best Practice Using a Settings File in Python
Pyplot Common Axes Labels for Subplots
How to Upgrade to Python 3.6 with Conda
How to Get the Current Time in Milliseconds in Python
Why Doesn't Os.Path.Join() Work in This Case
Multiple Linear Regression in Python
Full Examples of Using Pyserial Package
How to Declare an Array in Python
How to Use Youtube-Dl from a Python Program
How to Detect the Python Version at Runtime
Pass a Parameter to a Fixture Function
How to Insert Newlines on Argparse Help Text
How to Write to a File, Using the Logging Python Module