What's the easiest way to escape HTML in Python?
html.escape
is the correct answer now, it used to be cgi.escape
in python before 3.2. It escapes:
<
to<
>
to>
&
to&
That is enough for all HTML.
EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:
data.encode('ascii', 'xmlcharrefreplace')
Don't forget to decode data
to unicode
first, using whatever encoding it was encoded.
However in my experience that kind of encoding is useless if you just work with unicode
all the time from start. Just encode at the end to the encoding specified in the document header (utf-8
for maximum compatibility).
Example:
>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'<a>bá</a>
Also worth of note (thanks Greg) is the extra quote
parameter cgi.escape
takes. With it set to True
, cgi.escape
also escapes double quote chars ("
) so you can use the resulting value in a XML/HTML attribute.
EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape
, which does the same except that quote
defaults to True.
Escape html in python?
If your value being escaped might contain quotes, the best thing is to use the quoteattr
method: http://docs.python.org/library/xml.sax.utils.html#module-xml.sax.saxutils
This is referenced right beneath the docs on the cgi.escape() method.
Escape special HTML characters in Python
In Python 3.2, you could use the html.escape
function, e.g.
>>> string = """ Hello "XYZ" this 'is' a test & so on """
>>> import html
>>> html.escape(string)
' Hello "XYZ" this 'is' a test & so on '
For earlier versions of Python, check http://wiki.python.org/moin/EscapingHtml:
The
cgi
module that comes with Python has anescape()
function:import cgi
s = cgi.escape( """& < >""" ) # s = "& < >"
However, it doesn't escape characters beyond
&
,<
, and>
. If it is used ascgi.escape(string_to_escape, quote=True)
, it also escapes"
.
Here's a small snippet that will let you escape quotes and apostrophes as well:
html_escape_table = {
"&": "&",
'"': """,
"'": "'",
">": ">",
"<": "<",
}
def html_escape(text):
"""Produce entities within text."""
return "".join(html_escape_table.get(c,c) for c in text)
You can also use
escape()
fromxml.sax.saxutils
to escape html. This function should execute faster. Theunescape()
function of the same module can be passed the same arguments to decode a string.from xml.sax.saxutils import escape, unescape
# escape() and unescape() takes care of &, < and >.
html_escape_table = {
'"': """,
"'": "'"
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}
def html_escape(text):
return escape(text, html_escape_table)
def html_unescape(text):
return unescape(text, html_unescape_table)
HTML Escaping in Python
Python standard library has cgi
module, which provides escape
function.
See: http://docs.python.org/library/cgi.html#functions
How can I escape *all* characters into their corresponding html entity names and numbers in Python?
You don't really need a special function for what you are doing because the numbers you want are just the Unicode code points of the characters in question.
ord
does pretty much what you want:
def encode(s):
return ''.join('{:07d};'.format(ord(c)) for c in s)
Aesthetically, I prefer hex encoding:
def encode(s):
return ''.join('{:06x};'.format(ord(c)) for c in s)
What is special about html.escape
and html.unescape
is that they support named entities in addition to the numerical ones. The goal of escaping is normally to turn your string into something that doesn't have characters special to the HTML parser, so escape
only replaces a handful of characters. What you are doing ensures that all characters in the string are ASCII in addition to that.
If you want to force the use of named entities wherever possible, you can check the html.entities.codepoint2name
mapping after applying ord
to the characters:
def encode(s):
return ''.join('&{};'.format(codepoint2name.get(i, '#{}'.format(i))) for i in map(ord, s))
Decode HTML entities in Python string?
Python 3.4+
Use html.unescape()
:
import html
print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape
is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
Python 2.6-3.3
You can use HTMLParser.unescape()
from the standard library:
- For Python 2.6-2.7 it's in
HTMLParser
- For Python 3 it's in
html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
You can also use the six
compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m
Related Topics
Why Is Printing to Stdout So Slow? Can It Be Sped Up
Yes' Reporting Error With Subprocess Communicate()
List of Lists Changes Reflected Across Sublists Unexpectedly
Are Dictionaries Ordered in Python 3.6+
How to Parse an Iso 8601-Formatted Date
Use Different Python Version With Virtualenv
Convert Pandas Column to Datetime
Why Do People Write #!/Usr/Bin/Env Python on the First Line of a Python Script
Run Multiple Python Scripts Concurrently
Django Server Killed Frequently
How to Read a File Line-By-Line into a List
How to Parse a String to a Float or Int
How to Make a Python Script Standalone Executable to Run Without Any Dependency
Qtdesigner Changes Will Be Lost After Redesign User Interface