Decode HTML entities in Python string?
Python 3.4+
Use html.unescape()
:
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape
is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape()
from the standard library:
- For Python 2.6-2.7 it's in
HTMLParser
- For Python 3 it's in
html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You can also use the six
compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
Convert HTML entities in plain text to characters
To decode HTML Entities like of your example you could use the following code.
html_encoded = 'Motorists could be charged for every mile they drive to raise €35bn'
import html
html_decoded = html.unescape(html_encoded)
print(html_decoded)
How do I unescape HTML entities in a string in Python 3.1?
You could use the function html.unescape:
In Python3.4+ (thanks to J.F. Sebastian for the update):
import html
html.unescape('Suzy & John')
# 'Suzy & John'
html.unescape('"')
# '"'
In Python3.3 or older:
import html.parser
html.parser.HTMLParser().unescape('Suzy & John')
In Python2:
import HTMLParser
HTMLParser.HTMLParser().unescape('Suzy & John')
Decoding HTML entities with Python
Try this:
import re
def _callback(matches):
id = matches.group(1)
try:
return unichr(int(id))
except:
return id
def decode_unicode_references(data):
return re.sub("(\d+)(;|(?=\s))", _callback, data)
data = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
print decode_unicode_references(data)
Decode / encode html escaped special characters in Python
You have a Mojibake, double-encoded data. You not only have HTML entities, your data was incorrectly decoded from bytes to text before the HTML entities were applied.
For your example, the two Ã
,
entities decode to the Unicode characters Ã
and ‰
. Those two characters are also known (from the Unicode standard), as U+00C3 LATIN CAPITAL LETTER A WITH TILDE
and U+2030 PER MILLE SIGN
. This is typical of UTF-8 data being mis-interpreted as a Latin variant encoding (such as ISO 8859-1 or a Windows Latin codepage variant.
If we assume that the original character was meant to be É
, or U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
, then the original would have been encoded to the bytes C3
and 89
if using UTF-8. That Ã
(U+00C3
!) shows up here is not a coincidence, it is typical of UTF-8 -> Latin variant Mojibakes to end up with such combinations. The 89
mapping tells us that the most likely candidate for the wrong encoding is the Windows CP 1252 encoding, which maps the hex value 89
to U+2030 PER MILLE SIGN
.
You could manually encode to bytes then decode as the correct encoding, but the trick is to know what encoding was used incorrectly, and sometimes that mistake leads to data loss, because the CP-1252 codepage doesn't have a Unicode character mapping for 5 specific byte values. That's not a direct problem for the example in your question, but can be for other text. Manually decoding would work like this:
>>> import html
>>> broken = ""Coup d'Ãtat""
>>> html.unescape(broken)
'"Coup d\'État"'
>>> html.unescape(broken).encode("cp1252")
b'"Coup d\'\xc3\x89tat"'
>>> html.unescape(broken).encode("cp1252").decode("utf-8")
'"Coup d\'État"'
A better option is to use the special ftfy
library (the name is an acronym for Fixed That For You), which uses detailed knowledge about how to recognize such mistakes and undo the damage.
ftfy
also handles the HTML-entity decoding, all in one step:
>>> import ftfy
>>> ftfy.fix_text(""Coup d'Ãtat"")
'"Coup d\'État"'
The library includes sloppy variants of text codes often found in a Mojibake to help with repairing. It also encodes information about how to recognize the specific errors that a given wrong codec choice produces so it knows what to do to reverse the damage.
Decoding html entities in python2
Use HTMLParser.HTMLParser.unescape
:
>>> import HTMLParser
>>> parser = HTMLParser.HTMLParser()
>>> parser.unescape('í')
u'\xed'
>>> print parser.unescape('í')
í
In Python 3.x:
>>> import html.parser
>>> parser = html.parser.HTMLParser()
>>> parser.unescape('í')
'í'
Decode HTML entities in Python string?
Python 3.4+
Use html.unescape()
:
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape
is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape()
from the standard library:
- For Python 2.6-2.7 it's in
HTMLParser
- For Python 3 it's in
html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You can also use the six
compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
How do I perform HTML decoding/encoding using Python/Django?
Given the Django use case, there are two answers to this. Here is its django.utils.html.escape
function, for reference:
def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '&l
t;').replace('>', '>').replace('"', '"').replace("'", '''))
To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:
def html_decode(s):
"""
Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>.
"""
htmlCodes = (
("'", '''),
('"', '"'),
('>', '>'),
('<', '<'),
('&', '&')
)
for code in htmlCodes:
s = s.replace(code[1], code[0])
return s
unescaped = html_decode(my_string)
This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape
. More generally, it is a good idea to stick with the standard library:
# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)
As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.
With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:
{{ context_var|safe }}
{% autoescape off %}
{{ context_var }}
{% endautoescape %}
How do I encode specific characters to HTML in python
I believe the solution to this post would give you what you need:
Convert HTML entities to Unicode and vice versa
Related Topics
Force Python to Use an Older Version of Module (Than What I Have Installed Now)
Simulate Keystroke in Linux With Python
Set Chrome Browser Binary Through Chromedriver in Python
How to Update a Python Package
How to Make Python Script Run as Service
Fail During Installation of Pillow (Python Module) in Linux
How to Listen For 'Usb Device Inserted' Events in Linux, in Python
How to Use "/" (Directory Separator) in Both Linux and Windows in Python
Get Total Physical Memory in Python
How Would I Build Python Myself from Source Code on Ubuntu
In Python Script, How to Set Pythonpath
How to Get Output from Subprocess.Popen(). Proc.Stdout.Readline() Blocks, No Data Prints Out
Cross-Platform Subprocess With Hidden Window
Python Multiprocessing: Permission Denied
Python - Is Time.Sleep(N) Cpu Intensive