Escape Text for HTML

Which characters need to be escaped in HTML?

If you're inserting text content in your document in a location where text content is expected1, you typically only need to escape the same characters as you would in XML. Inside of an element, this just includes the entity escape ampersand & and the element delimiter less-than and greater-than signs < >:

& becomes &
< becomes <
> becomes >

Inside of attribute values you must also escape the quote character you're using:

" becomes "
' becomes '

In some cases it may be safe to skip escaping some of these characters, but I encourage you to escape all five in all cases to reduce the chance of making a mistake.

If your document encoding does not support all of the characters that you're using, such as if you're trying to use emoji in an ASCII-encoded document, you also need to escape those. Most documents these days are encoded using the fully Unicode-supporting UTF-8 encoding where this won't be necessary.

In general, you should not escape spaces as  .   is not a normal space, it's a non-breaking space. You can use these instead of normal spaces to prevent a line break from being inserted between two words, or to insert          extra        space       without it being automatically collapsed, but this is usually a rare case. Don't do this unless you have a design constraint that requires it.


1 By "a location where text content is expected", I mean inside of an element or quoted attribute value where normal parsing rules apply. For example: <p>HERE</p> or <p title="HERE">...</p>. What I wrote above does not apply to content that has special parsing rules or meaning, such as inside of a script or style tag, or as an element or attribute name. For example: <NOT-HERE>...</NOT-HERE>, <script>NOT-HERE</script>, <style>NOT-HERE</style>, or <p NOT-HERE="...">...</p>.

In these contexts, the rules are more complicated and it's much easier to introduce a security vulnerability. I strongly discourage you from ever inserting dynamic content in any of these locations. I have seen teams of competent security-aware developers introduce vulnerabilities by assuming that they had encoded these values correctly, but missing an edge case. There's usually a safer alternative, such as putting the dynamic value in an attribute and then handling it with JavaScript.

If you must, please read the Open Web Application Security Project's XSS Prevention Rules to help understand some of the concerns you will need to keep in mind.

Can I escape HTML special chars in JavaScript?

Here's a solution that will work in practically every web browser:

function escapeHtml(unsafe)
{
return unsafe
.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">")
.replace(/"/g, """)
.replace(/'/g, "'");
}

If you only support modern web browsers (2020+), then you can use the new replaceAll function:

const escapeHtml = (unsafe) => {
return unsafe.replaceAll('&', '&').replaceAll('<', '<').replaceAll('>', '>').replaceAll('"', '"').replaceAll("'", ''');
}

What do HTML is escaping means?

Escaping in HTML means, that you are replacing some special characters with others. In HTML it means usally, you replace e. e.g < or > or " or &. These characters have special meanings in HTML.

Imagine, you write

<b>hello, world</b>

And the text will appear as hello, world. But sometime you don't want to have this behaviour. So you replace the < and >.

<b>hello world</b>

This will result in <b>hello world</b>.

How to properly escape text inside option tag?

If I understand your question properly, as demonstrated in the below snippet, you can escape HTML entities inside the <option> tag perfectly fine (tested on Firefox, Chrome & Safari).

So you are able to have the text </option> inside an option tag. This is done using the HTML entities that represent the < and > characters

Essentially it is the same as having:

<option></option></option>

Although the middle </option> is not treated as a closing tag, but rather it is shown that way in the browser

As a side note, the only limitation using <option> tags when escaping characters lays in the :before and :after CSS pseudo elements. W3 suggests that you cannot prepend/append content to the <option> tag using these pseudo elements. Although for some reason, in Firefox 48 this does infact work

#css-option:before {

content: "any text";

}
<select>

<option></option></option>

<option></option></option>

<option></option></option>

<option id="css-option"></option>

</select>

Escaping HTML strings with jQuery

Since you're using jQuery, you can just set the element's text property:

// before:
// <div class="someClass">text</div>
var someHtmlString = "<script>alert('hi!');</script>";

// set a DIV's text:
$("div.someClass").text(someHtmlString);
// after:
// <div class="someClass"><script>alert('hi!');</script></div>

// get the text in a string:
var escaped = $("<div>").text(someHtmlString).html();
// value:
// <script>alert('hi!');</script>

How to convert escape characters in HTML tags?

You can use the strconv.Unquote() to do the conversion.

One thing you should be aware of is that strconv.Unquote() can only unquote strings that are in quotes (e.g. start and end with a quote char " or a back quote char `), so we have to manually append that.

Example:

// Important to use backtick ` (raw string literal)
// else the compiler will unquote it (interpreted string literal)!

s := `\u003chtml\u003e`
fmt.Println(s)
s2, err := strconv.Unquote(`"` + s + `"`)
if err != nil {
panic(err)
}
fmt.Println(s2)

Output (try it on the Go Playground):

\u003chtml\u003e
<html>

Note: To do HTML text escaping and unescaping, you can use the html package. Quoting its doc:

Package html provides functions for escaping and unescaping HTML text.

But the html package (specifically html.UnescapeString()) does not decode unicode sequences of the form \uxxxx, only &#decimal; or &#xHH;.

Example:

fmt.Println(html.UnescapeString(`\u003chtml\u003e`)) // wrong
fmt.Println(html.UnescapeString(`<html>`)) // good
fmt.Println(html.UnescapeString(`<html>`)) // good

Output (try it on the Go Playground):

\u003chtml\u003e
<html>
<html>

Note #2:

You should also note that if you write a code like this:

s := "\u003chtml\u003e"

This quoted string will be unquoted by the compiler itself as it is an interpreted string literal, so you can't really test that. To specify quoted string in the source, you may use the backtick to specify a raw string literal or you may use a double quoted interpreted string literal:

s := "\u003chtml\u003e" // Interpreted string literal (unquoted by the compiler!)
fmt.Println(s)

s2 := `\u003chtml\u003e` // Raw string literal (no unquoting will take place)
fmt.Println(s2)

s3 := "\\u003chtml\\u003e" // Double quoted interpreted string literal
// (unquoted by the compiler to be "single" quoted)
fmt.Println(s3)

Output:

<html>
\u003chtml\u003e

Escaped characters appearing when appending to HTML element

The response is JSON, you need to parse it.

let responseText = `"<li class='listitems'><a href='https:\\/\\/website.com\\/contact' class='listlinks'>Contact Us ▸<\\/a><\\/li>\\n"`;
let r = JSON.parse(responseText);
document.getElementById('mylist').insertAdjacentHTML("beforeend",r);
<ul id="mylist">
</ul>


Related Topics



Leave a reply



Submit