How to HTML Encode/Escape a String? Is There a Built-In

How to HTML encode/escape a string? Is there a built-in?

The h helper method:

<%=h "<p> will be preserved" %>

Escaping HTML strings with jQuery

Since you're using jQuery, you can just set the element's text property:

// before:
// <div class="someClass">text</div>
var someHtmlString = "<script>alert('hi!');</script>";

// set a DIV's text:
$("div.someClass").text(someHtmlString);
// after:
// <div class="someClass"><script>alert('hi!');</script></div>

// get the text in a string:
var escaped = $("<div>").text(someHtmlString).html();
// value:
// <script>alert('hi!');</script>

PHP Escape a string if it hasn't already been escaped with entities

No one seems to be answering your actual question, so I will

How can I use htmlentities or htmlspecialchars only if it hasn't already been used on the string? Is there a built-in way to detect if entities are already present in the string?

It's impossible. What if I'm making an educational post about HTML entities and I want to actually print this on the screen:

The Lion&#8217;s Pride

... it would need to be encoded as...

The Lion&amp;#8217;s Pride 

But what if that was the actual string we wanted to print on the string ? ... and so on.


Bottom line is, you have to know what you've been given and work from there – which is where the advice from the other answers comes in – which is still just a workaround.

What if they give you double-encoded strings? What if they start wrapping the html-encoded strings in XML? And then wrap that in JSON? ... And then the JSON is converted to binary strings? the possibilities are endless.

It's not impossible for the API you depend on to suddenly switch the output type, but it's also a pretty big violation of the original contract with your users. To some extent, you have to put some trust in the API to do what it says it's going to do. Unit/Integration tests make up the rest of the trust.

And because you could never write a program that works for any possible change they could make, it's senseless to try to anticipate any change at all.

What is the best way to escape HTML-specific characters in a string (PowerShell)?

There's a class that will do this in System.Web.

Add-Type -AssemblyName System.Web
[System.Web.HttpUtility]::HtmlEncode('something <somthing else>')

You can even go the other way:

[System.Web.HttpUtility]::HtmlDecode('something <something else>')

Short way to escape HTML in Bash?

Escaping HTML really just involves replacing three characters: <, >, and &. For extra points, you can also replace " and '. So, it's not a long sed script:

sed 's/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/'"'"'/\'/g'

How to escape HTML

(See further down for an answer to the question as updated by comments from the OP below)

Can this be handled with HTML DOM and javascript?

No, once the text is in the DOM, the concept of "escaping" it doesn't apply. The HTML source text needs to be escaped so that it's parsed into the DOM correctly; once it's in the DOM, it isn't escaped.

This can be a bit tricky to understand, so let's use an example. Here's some HTML source text (such as in an HTML file that you would view with your browser):

<div>This & That</div>

Once that's parsed into the DOM by the browser, the text within the div is This & That, because the & has been interpreted at that point.

So you'll need to catch this earlier, before the text is parsed into the DOM by the browser. You can't handle it after the fact, it's too late.

Separately, the string you're starting with is invalid if it has things like <div>This & That</div> in it. Pre-processing that invalid string will be tricky. You can't just use built-in features of your environment (PHP or whatever you're using server-side) because they'll escape the tags as well. You'll need to do text processing, extracting only the parts that you want to process and then running those through an escaping process. That process will be tricky. An & followed by whitespace is easy enough, but if there are unescaped entities in the source text, how do you know whether to escape them or not? Do you assume that if the string contains &, you leave it alone? Or turn it into &amp;? (Which is perfectly valid; it's how you show the actual string & in an HTML page.)

What you really need to do is correct the underlying problem: The thing creating these invalid, half-encoded strings.


Edit: From our comment stream below, the question is totally different than it seemed from your example (that's not meant critically). To recap the comments for those coming to this fresh, you said that you were getting these strings from WebKit's innerHTML, and I said that was odd, innerHTML should encode & correctly (and pointed you at a couple of test pages that suggested it did). Your reply was:

This works for &. But the same test page do not work for entities like , ®, « and many more.

That changes the nature of the question. You want to make entities out of characters that, while perfectly valid when used literally (provided you have your text encoding right), could be expressed as entities instead and therefore made more resilient to text encoding changes.

We can do that. According to the spec, the character values in a JavaScript string are UTF-16 (using Unicode Normalized Form C) and any conversion from the source character encoding (ISO 8859-1, Windows-1252, UTF-8, whatever) is performed before the JavaScript runtime sees it. (If you're not 100% sure you know what I mean by character encoding, it's well worth stopping now, going off and reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, then coming back.) So that's the input side. On the output side, HTML entities identify Unicode code points. So we can convert from JavaScript strings to HTML entities reliably.

The devil is in the detail, though, as always. JavaScript explicitly assumes that each 16-bit value is a character (see section 8.4 in the spec), even though that's not actually true of UTF-16 — one 16-bit value might be a "surrogate" (such as 0xD800) that only makes sense when combined with the next value, meaning that two "characters" in the JavaScript string are actually one character. This isn't uncommon for far Eastern languages.

So a robust conversion that starts with a JavaScript string and results in an HTML entity can't assume that a JavaScript "character" actually equals a character in the text, it has to handle surrogates. Fortunately, doing so is dead easy because the smart people defining Unicode made it dead easy: The first surrogate value is always in the range 0xD800-0xDBFF (inclusive), and the second surrogate is always in the range 0xDC00-0xDFFF (inclusive). So any time you see a pair of "characters" in a JavaScript string that match those ranges, you're dealing with a single character defined by a surrogate pair. The formulae for converting from the pair of surrogate values to a code point value are given in the above links, although fairly obtusely; I find this page much more approachable.

Armed with all of this information, we can write a function that will take a JavaScript string and search for characters (real characters, which may be one or two "characters" long) you might want to turn into entities, replacing them with named entities from a map or numeric entities if we don't have them in our named map:

// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
"160": " ",
"161": "¡",
"162": "&#cent;",
"163": "&#pound;",
"164": "&#curren;",
"165": "&#yen;",
"166": "&#brvbar;",
"167": "&#sect;",
"168": "&#uml;",
"169": "©",
// ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
"8364": "€" // Last one must not have a comma after it, IE doesn't like trailing commas
};

// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
// The regular expression below uses an alternation to look for a surrogate pair _or_
// a single character that we might want to make an entity out of. The first part of the
// alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
// alone, it searches for the surrogates. The second part of the alternation you can
// adjust as you see fit, depending on how conservative you want to be. The example
// below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
// character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
// it's not "printable ASCII" (in the old parlance), convert it. That's probably
// overkill, but you said you wanted to make entities out of things, so... :-)
return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
var high, low, charValue, rep

// Get the character value, handling surrogate pairs
if (match.length == 2) {
// It's a surrogate pair, calculate the Unicode code point
high = match.charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue = (high * 0x400) + low + 0x10000;
}
else {
// Not a surrogate pair, the value *is* the Unicode code point
charValue = match.charCodeAt(0);
}

// See if we have a mapping for it
rep = entityMap[charValue];
if (!rep) {
// No, use a numeric entity. Here we brazenly (and possibly mistakenly)
rep = "&#" + charValue + ";";
}

// Return replacement
return rep;
});
}

You should be fine passing all of the HTML through it, since if these characters appear in attribute values, you almost certainly want to encode them there as well.

I have not used the above in production (I actually wrote it for this answer, because the problem intrigued me) and it is totally supplied without warrantee of any kind. I have tried to ensure that it handles surrogate pairs because that's necessary for far Eastern languages, and supporting them is something we should all be doing now that the world has gotten smaller.

Complete example page:

<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Test Page</title>
<style type='text/css'>
body {
font-family: sans-serif;
}
#log p {
margin: 0;
padding: 0;
}
</style>
<script type='text/javascript'>

// Make the function available as a global, but define it within a scoping
// function so we can have data (the `entityMap`) that only it has access to
var prepEntities = (function() {

// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
"160": " ",
"161": "¡",
"162": "&#cent;",
"163": "&#pound;",
"164": "&#curren;",
"165": "&#yen;",
"166": "&#brvbar;",
"167": "&#sect;",
"168": "&#uml;",
"169": "©",
// ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
"8364": "€" // Last one must not have a comma after it, IE doesn't like trailing commas
};

// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
// The regular expression below uses an alternation to look for a surrogate pair _or_
// a single character that we might want to make an entity out of. The first part of the
// alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
// alone, it searches for the surrogates. The second part of the alternation you can
// adjust as you see fit, depending on how conservative you want to be. The example
// below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
// character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
// it's not "printable ASCII" (in the old parlance), convert it. That's probably
// overkill, but you said you wanted to make entities out of things, so... :-)
return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
var high, low, charValue, rep

// Get the character value, handling surrogate pairs
if (match.length == 2) {
// It's a surrogate pair, calculate the Unicode code point
high = match.charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue = (high * 0x400) + low + 0x10000;
}
else {
// Not a surrogate pair, the value *is* the Unicode code point
charValue = match.charCodeAt(0);
}

// See if we have a mapping for it
rep = entityMap[charValue];
if (!rep) {
// No, use a numeric entity. Here we brazenly (and possibly mistakenly)
rep = "&#" + charValue + ";";
}

// Return replacement
return rep;
});
}

// Return the function reference out of the scoping function to publish it
return prepEntities;
})();

function go() {
var d = document.getElementById('d1');
var s = d.innerHTML;
alert("Before: " + s);
s = prepEntities(s);
alert("After: " + s);
}

</script>
</head>
<body>
<div id='d1'>Copyright: © Yen: ¥ Cedilla: ¸ Surrogate pair: 𐀀</div>
<input type='button' id='btnGo' value='Go' onclick="return go();">
</body>
</html>

There I've included the cedilla as an example of converting to a numeric entity rather than a named one (since I left cedil out of my very small example map). And note that the surrogate pair at the end shows up in the first alert as two "characters" because of the way JavaScript handles UTF-16.

Is there any built-in function in NET Framework which encodes a string to a valid XML unicode?

XMLTextWriter is what you're looking for. You should avoid using any of the HTMLEncode methods (there are several) unless you're actually encoding your text for use in an HTML document. If you're encoding text for use in an XML document (including XHTML), you should use XMLTextWriter.

Something like this should do the trick:

StringWriter strWriter = new StringWriter();
XmlTextWriter xmlWriter = new XmlTextWriter(strWriter);
xmlWriter.WriteString('Your String Goes here, < and >, as well as other special chars will be properly encoded');
xmlWriter.Flush();

Console.WriteLine("XML Text: {0}", strWriter.ToString());

See also this other stackoverflow discussion.

Can I escape HTML special chars in JavaScript?

Here's a solution that will work in practically every web browser:

function escapeHtml(unsafe)
{
return unsafe
.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">")
.replace(/"/g, """)
.replace(/'/g, "'");
}

If you only support modern web browsers (2020+), then you can use the new replaceAll function:

const escapeHtml = (unsafe) => {
return unsafe.replaceAll('&', '&').replaceAll('<', '<').replaceAll('>', '>').replaceAll('"', '"').replaceAll("'", ''');
}


Related Topics



Leave a reply



Submit