Which Characters Need to Be Escaped in Html

Which characters need to be escaped in HTML?

If you're inserting text content in your document in a location where text content is expected¹, you typically only need to escape the same characters as you would in XML. Inside of an element, this just includes the entity escape ampersand & and the element delimiter less-than and greater-than signs < >:

& becomes &
< becomes <
> becomes >

Inside of attribute values you must also escape the quote character you're using:

" becomes "
' becomes '

In some cases it may be safe to skip escaping some of these characters, but I encourage you to escape all five in all cases to reduce the chance of making a mistake.

If your document encoding does not support all of the characters that you're using, such as if you're trying to use emoji in an ASCII-encoded document, you also need to escape those. Most documents these days are encoded using the fully Unicode-supporting UTF-8 encoding where this won't be necessary.

In general, you should not escape spaces as . is not a normal space, it's a non-breaking space. You can use these instead of normal spaces to prevent a line break from being inserted between two words, or to insert extra space without it being automatically collapsed, but this is usually a rare case. Don't do this unless you have a design constraint that requires it.

¹ By "a location where text content is expected", I mean inside of an element or quoted attribute value where normal parsing rules apply. For example: <p>HERE</p> or <p title="HERE">...</p>. What I wrote above does not apply to content that has special parsing rules or meaning, such as inside of a script or style tag, or as an element or attribute name. For example: <NOT-HERE>...</NOT-HERE>, <script>NOT-HERE</script>, <style>NOT-HERE</style>, or <p NOT-HERE="...">...</p>.

In these contexts, the rules are more complicated and it's much easier to introduce a security vulnerability. I strongly discourage you from ever inserting dynamic content in any of these locations. I have seen teams of competent security-aware developers introduce vulnerabilities by assuming that they had encoded these values correctly, but missing an edge case. There's usually a safer alternative, such as putting the dynamic value in an attribute and then handling it with JavaScript.

If you must, please read the Open Web Application Security Project's XSS Prevention Rules to help understand some of the concerns you will need to keep in mind.

What characters must be escaped in HTML 5?

The specification defines the syntax for normal elements as:

Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.

So you have to escape <, or & when followed by anything that could begin a character reference. The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. (Obviously, if you don’t want to terminate the attribute value there, escape the quotation mark.)

These rules don’t apply to <script> and <style>; you should avoid putting dynamic content in those. (If you have to include JSON in a <script>, replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialization.)

What characters must be escaped in HTML5 attributes?

Characters that need escaping are:

Whatever character you use to delimit the attribute value (either " or ')
Ampersands unless the characters that follow them do not form a character reference
Characters which can't be represented by the current character encoding

I want to do the minimum amount of escaping necessary for the values to be correct and safe.

I recommend aiming to be simple over minimum. You are less likely to make a mistake that way.

Always (except inside <script> and <style> elements which are special cased) escape the five characters which can have special meaning in HTML: <, >, &, ", and '.
Use UTF-8 everywhere

These guidelines work inside attribute values, inside text nodes, for HTML 4, for HTML 5 and for XML (including XHTML).

What characters do I need to escape in XML documents?

If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.

XML escape characters

There are only five:

"   "
'   '
<   <
>   >
&   &

Escaping characters depends on where the special character is used.

The examples can be validated at the W3C Markup Validation Service.

Text

The safe way is to escape all five characters in text. However, the three characters ", ' and > needn't be escaped in text:

<?xml version="1.0"?>
<valid>"'></valid>

Attributes

The safe way is to escape all five characters in attributes. However, the > character needn't be escaped in attributes:

<?xml version="1.0"?>
<valid attribute=">"/>

The ' character needn't be escaped in attributes if the quotes are ":

<?xml version="1.0"?>
<valid attribute="'"/>

Likewise, the " needn't be escaped in attributes if the quotes are ':

<?xml version="1.0"?>
<valid attribute='"'/>

Comments

All five special characters must not be escaped in comments:

<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>

CDATA

All five special characters must not be escaped in CDATA sections:

<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>

Processing instructions

All five special characters must not be escaped in XML processing instructions:

<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>

XML vs. HTML

HTML has its own set of escape codes which cover a lot more characters.

Should src be HTML-escaped in script tags in HTML?

Yes. When you have doubts you can use the W3C validator, which says & must be escaped into & in this case.

Double-quotes attributes are parsed according to this rules. When a & is found,

Switch to the character reference in attribute value state, with the additional allowed character being U+0022 QUOTATION MARK (").

And the Character reference in attribute value state consists in

Attempt to consume a character reference.

If nothing is returned, append a U+0026 AMPERSAND character (&) to the
current attribute's value.

Otherwise, append the returned character tokens to the current
attribute's value.

Finally, switch back to the attribute value state that switched into
this state.

Therefore, it would (probably) work too if you didn't escape &. However, it will produce a parse error during the consumption of the character reference:

If no match can be made, then no characters are consumed, and nothing
is returned. In this case, if the characters after the U+0026
AMPERSAND character (&) consist of a sequence of one or more
alphanumeric ASCII characters followed by a U+003B SEMICOLON character
(;), then this is a parse error.

Note that you should escape it if you want to be safe:

Certain points in the parsing algorithm are said to be parse errors.
The error handling for parse errors is well-defined (that's the
processing rules described throughout this specification), but user
agents, while parsing an HTML document, may abort the parser at the
first parse error that they encounter for which they do not wish to
apply the rules described in this specification.

In what scopes do special HTML characters need to be escaped?

The rules vary depending on the version of HTML you are dealing with but are always more complex then is worth trying to remember.

The safe approach is "Use character references to represent the 5 HTML special characters everywhere except inside script and style elements", which makes you safe for everything except XHTML.

For XHTML the rule is the same with the additional proviso of "and use explicit CDATA sections in script and style elements".

The bigger question, which would explain this, is, when does the HTML parser look for &XXX; tokens and replace them?

As it parses the HTML (depending on what the current state of the tokeniser is ("inside start tag" and "inside attribute value" are examples of different states)).

Is it done once on the whole document

Unless you trigger additional HTML parsing (e.g. by setting innerHTML on an element).

or do different rules apply for the text between tags vs. attribute values within a tag vs. wihtin tagA vs. within tagB

Different rules apply in different places. The complete, current rules are (as I suggested in a comment) rather complex and would require a lot of work to extract from the HTML 5 parsing rules. This is why I suggest, if you are an HTML author and not a browser author, using the simpler rules of "Use character references unless you are in a script or style element".

-- different parsing rules seem to apply within <script>, so I may write && (for AND) and < for (LESS-THAN). So, what rules apply in which scopes?

In HTML 4 terms, script and style elements are defined as containing CDATA (where the only sequence of characters with special meaning in HTML are </ which terminates the CDATA section). Everywhere else in the document (including, counter-intuitively, attribute values that are defined as containing CDATA) & indicates the start of a character reference (although there might be a few exceptions based on what the character following the & is).

The HTML 5 rules are more complicated, but the basic principle of "It is safe and sane to use character references for &, <, >, " and ' everywhere except inside script and style elements" holds.

When i need to escape Html string?

I can think of several possibilities to explain why sometimes a string is not escaped:

perhaps the original programmer was confident that at certain places the string had no special characters (however, in my opinion this would be bad programming practice; it costs very little to escape a string as protection against future changes)
the string was already escaped at that point in the code. You definitely don't want to escape a string twice; the user will end up seeing the escape sequence instead of the intended text.
The string was the actual html itself. You don't want to escape the html; you want the browser to process it!

EDIT -
The reason for escaping is that special characters like & and < can end up causing the browser to display something other than what you intended. A bare & is technically an error in the html. Most browsers try to deal intelligently with such errors and will display them correctly in most cases. (This will almost certainly happen in your example text if the string were text in a <div>, for instance.) However, because it is bad markup, some browsers will not work well; assistive technologies (e.g., text-to-speech) may fail; and there may be other problems.

There are several cases that will fail despite the best efforts of the browser to recover from bad markup. If your sample string were an attribute value, escaping the quote marks would be absolutely required. There's no way that a browser is going to correctly handle something like:

<img alt="Sample Image"bread" & "butter"" ... >

The general rule is that any character that is not markup but might be confused as markup need to be escaped.

Note that there are several contexts in which text can appear within an html document, and they have separate requirements for escaping. The following should be escaped:

all characters that have no representation in the character set of the document (unlikely if you are using UTF-8, but that's not always the case)
Within attribute values, quote marks (' or ", whichever one matches the delimiters used for the attribute value itself) and the ampersand (&), but not <
Within text nodes, only & and <
Within href values, characters that need escaping in a url (and sometimes these need to be doubly escaped so they are still escaped after the browser unescapes them once)
Within a CDATA block, generally nothing (at the HTML level).

Finally, aside from the hazard of double-escaping, the cost of escaping all text is minimal: a tiny bit of extra processing and a few extra bytes on the network.

Special characters must be escaped: []

They're in your last two items:

    <li>>Over 3000 Oscars have been awarded</li>
    <li>>The statues are 8.5 pounds and 13.5 inches tall</li>

Strictly speaking though, you don't need to encode them in this context; doing so is always good advice, but what you have here isn't invalid.

But in any case, judging by the content they probably weren't supposed to be there in the first place and so your first instinct is probably to remove them; go ahead and do so.

Which Characters Need to Be Escaped in Html