What's the Easiest Way to Escape HTML in Python

What's the easiest way to escape HTML in Python?

html.escape is the correct answer now, it used to be cgi.escape in python before 3.2. It escapes:

  • < to <
  • > to >
  • & to &

That is enough for all HTML.

EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:

data.encode('ascii', 'xmlcharrefreplace')

Don't forget to decode data to unicode first, using whatever encoding it was encoded.

However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).

Example:

>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'<a>bá</a>

Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.

EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape, which does the same except that quote defaults to True.

Escape html in python?

If your value being escaped might contain quotes, the best thing is to use the quoteattr method: http://docs.python.org/library/xml.sax.utils.html#module-xml.sax.saxutils

This is referenced right beneath the docs on the cgi.escape() method.

Escape special HTML characters in Python

In Python 3.2, you could use the html.escape function, e.g.

>>> string = """ Hello "XYZ" this 'is' a test & so on """
>>> import html
>>> html.escape(string)
' Hello "XYZ" this 'is' a test & so on '

For earlier versions of Python, check http://wiki.python.org/moin/EscapingHtml:

The cgi module that comes with Python has an escape() function:

import cgi

s = cgi.escape( """& < >""" ) # s = "& < >"

However, it doesn't escape characters beyond &, <, and >. If it is used as cgi.escape(string_to_escape, quote=True), it also escapes ".



Here's a small snippet that will let you escape quotes and apostrophes as well:

 html_escape_table = {
"&": "&",
'"': """,
"'": "'",
">": ">",
"<": "<",
}

def html_escape(text):
"""Produce entities within text."""
return "".join(html_escape_table.get(c,c) for c in text)



You can also use escape() from xml.sax.saxutils to escape html. This function should execute faster. The unescape() function of the same module can be passed the same arguments to decode a string.

from xml.sax.saxutils import escape, unescape
# escape() and unescape() takes care of &, < and >.
html_escape_table = {
'"': """,
"'": "'"
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

def html_escape(text):
return escape(text, html_escape_table)

def html_unescape(text):
return unescape(text, html_unescape_table)

HTML Escaping in Python

Python standard library has cgi module, which provides escape function.

See: http://docs.python.org/library/cgi.html#functions

How can I escape *all* characters into their corresponding html entity names and numbers in Python?

You don't really need a special function for what you are doing because the numbers you want are just the Unicode code points of the characters in question.

ord does pretty much what you want:

 def encode(s):
return ''.join('&#{:07d};'.format(ord(c)) for c in s)

Aesthetically, I prefer hex encoding:

 def encode(s):
return ''.join('&#x{:06x};'.format(ord(c)) for c in s)

What is special about html.escape and html.unescape is that they support named entities in addition to the numerical ones. The goal of escaping is normally to turn your string into something that doesn't have characters special to the HTML parser, so escape only replaces a handful of characters. What you are doing ensures that all characters in the string are ASCII in addition to that.

If you want to force the use of named entities wherever possible, you can check the html.entities.codepoint2name mapping after applying ord to the characters:

def encode(s):
return ''.join('&{};'.format(codepoint2name.get(i, '#{}'.format(i))) for i in map(ord, s))

Decode HTML entities in Python string?

Python 3.4+

Use html.unescape():

import html
print(html.unescape('£682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.


Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

  • For Python 2.6-2.7 it's in HTMLParser
  • For Python 3 it's in html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m


Related Topics



Leave a reply



Submit