How to Get &Curren to Display Literally, Not as an HTML Entity

How to get ¤ to display literally, not as an HTML entity

Use the php function urlencode:

urlencode("https://site.com/bacon_report?Id=1&report=1¤tDimension=2¶m=1"

will output

https%3A%2F%2Fsite.com%2Fbacon_report%3FId%3D1%26report%3D1%26currentDimension%3D2%26param%3D1 

PHP ¤ string turns into weird symbol

That's the entity code for the currency symbol being interpreted. If you're building your GET url, you can solve it in various ways:

  • Use urlencode() on your query values:

    $s = 'page.com?' . urlencode("a=1¤tPage=2");

  • Use the entity for & itself;

    'page.com?a=1&currentPage=2'

  • Or use your variable at the beginning so no & is required:

    'page.com?currentPage=2&a=1'

php prevent & creating codes in a string

You should urlencode the content of the variables:

$apistr     = 'https://remitradar.com/JsonRequests.aspx?action=getOnlineQuotes&companyKey='.urlencode($companyKey).'&countryFrom='.urlencode($countryFrom).'&countryTo='.urlencode($countryTo).'¤cyFrom='.urlencode($currencyFrom).'¤cyTo='.urlencode($currencyTo).'&amount='.urlencode($amount);

Than it may also get more clear if there is content in them you did not expect.

And no, you should not use & to code an & in the url.

To check the contents of the variables you may do:

var_dump($companyKey);
var_dump($countryFrom);
var_dump($countryTo);
var_dump($currencyFrom);
var_dump($currencyTo);
var_dump($amount);
var_dump($apistr);

If you echo the content of $apistr to your webbrowser ¤ will be displayed as the currency glyph ¤ as the html entity ¤ is reserved.

Try to echo it this way to your browser instead (but dont use this as url! The variable $apistr contains what you expect - only the debug echo output was wrong in case of the rendering of your browser):

echo htmlspecialchars($apistr);

When ever you just output a string of characters your rendering application is your webbrowser. You also may look at the sourcecode of the webside containing the presumed wrong url. You should see the correct characters in the source. The output of htmlspecialchars($apistr); however would be look wrong in the source code but correct in the rendered webpage.

http_build_query encode `currency` key to `¤cy=USD`

You get 83 bytes:

string(83) "merchant_id=2005197514857165061&merchant_site_id=144033¤cy=USD&total_amount=1"

However, the string shown has only 77 characters, most of which can be safely assumed to be single-byte. That means that you are actually getting currency rather than ¤cy. Thus the straneous ¤ symbol must be the result of some further post-processing.

var_dump() output contains line feeds and you've shared it in a single line, what suggests you aren't looking at the generated HTML code but the rendered view. In HTML, ¤ can be encoded as the ¤ entity.

For some reason, this entity appears to be treated differently than others:

<p>¤cy / ¤cy</p><p>€pe / &europe</p>

How to escape HTML

(See further down for an answer to the question as updated by comments from the OP below)

Can this be handled with HTML DOM and javascript?

No, once the text is in the DOM, the concept of "escaping" it doesn't apply. The HTML source text needs to be escaped so that it's parsed into the DOM correctly; once it's in the DOM, it isn't escaped.

This can be a bit tricky to understand, so let's use an example. Here's some HTML source text (such as in an HTML file that you would view with your browser):

<div>This & That</div>

Once that's parsed into the DOM by the browser, the text within the div is This & That, because the & has been interpreted at that point.

So you'll need to catch this earlier, before the text is parsed into the DOM by the browser. You can't handle it after the fact, it's too late.

Separately, the string you're starting with is invalid if it has things like <div>This & That</div> in it. Pre-processing that invalid string will be tricky. You can't just use built-in features of your environment (PHP or whatever you're using server-side) because they'll escape the tags as well. You'll need to do text processing, extracting only the parts that you want to process and then running those through an escaping process. That process will be tricky. An & followed by whitespace is easy enough, but if there are unescaped entities in the source text, how do you know whether to escape them or not? Do you assume that if the string contains &, you leave it alone? Or turn it into &amp;? (Which is perfectly valid; it's how you show the actual string & in an HTML page.)

What you really need to do is correct the underlying problem: The thing creating these invalid, half-encoded strings.


Edit: From our comment stream below, the question is totally different than it seemed from your example (that's not meant critically). To recap the comments for those coming to this fresh, you said that you were getting these strings from WebKit's innerHTML, and I said that was odd, innerHTML should encode & correctly (and pointed you at a couple of test pages that suggested it did). Your reply was:

This works for &. But the same test page do not work for entities like , ®, « and many more.

That changes the nature of the question. You want to make entities out of characters that, while perfectly valid when used literally (provided you have your text encoding right), could be expressed as entities instead and therefore made more resilient to text encoding changes.

We can do that. According to the spec, the character values in a JavaScript string are UTF-16 (using Unicode Normalized Form C) and any conversion from the source character encoding (ISO 8859-1, Windows-1252, UTF-8, whatever) is performed before the JavaScript runtime sees it. (If you're not 100% sure you know what I mean by character encoding, it's well worth stopping now, going off and reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, then coming back.) So that's the input side. On the output side, HTML entities identify Unicode code points. So we can convert from JavaScript strings to HTML entities reliably.

The devil is in the detail, though, as always. JavaScript explicitly assumes that each 16-bit value is a character (see section 8.4 in the spec), even though that's not actually true of UTF-16 — one 16-bit value might be a "surrogate" (such as 0xD800) that only makes sense when combined with the next value, meaning that two "characters" in the JavaScript string are actually one character. This isn't uncommon for far Eastern languages.

So a robust conversion that starts with a JavaScript string and results in an HTML entity can't assume that a JavaScript "character" actually equals a character in the text, it has to handle surrogates. Fortunately, doing so is dead easy because the smart people defining Unicode made it dead easy: The first surrogate value is always in the range 0xD800-0xDBFF (inclusive), and the second surrogate is always in the range 0xDC00-0xDFFF (inclusive). So any time you see a pair of "characters" in a JavaScript string that match those ranges, you're dealing with a single character defined by a surrogate pair. The formulae for converting from the pair of surrogate values to a code point value are given in the above links, although fairly obtusely; I find this page much more approachable.

Armed with all of this information, we can write a function that will take a JavaScript string and search for characters (real characters, which may be one or two "characters" long) you might want to turn into entities, replacing them with named entities from a map or numeric entities if we don't have them in our named map:

// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
"160": " ",
"161": "¡",
"162": "&#cent;",
"163": "&#pound;",
"164": "&#curren;",
"165": "&#yen;",
"166": "&#brvbar;",
"167": "&#sect;",
"168": "&#uml;",
"169": "©",
// ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
"8364": "€" // Last one must not have a comma after it, IE doesn't like trailing commas
};

// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
// The regular expression below uses an alternation to look for a surrogate pair _or_
// a single character that we might want to make an entity out of. The first part of the
// alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
// alone, it searches for the surrogates. The second part of the alternation you can
// adjust as you see fit, depending on how conservative you want to be. The example
// below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
// character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
// it's not "printable ASCII" (in the old parlance), convert it. That's probably
// overkill, but you said you wanted to make entities out of things, so... :-)
return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
var high, low, charValue, rep

// Get the character value, handling surrogate pairs
if (match.length == 2) {
// It's a surrogate pair, calculate the Unicode code point
high = match.charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue = (high * 0x400) + low + 0x10000;
}
else {
// Not a surrogate pair, the value *is* the Unicode code point
charValue = match.charCodeAt(0);
}

// See if we have a mapping for it
rep = entityMap[charValue];
if (!rep) {
// No, use a numeric entity. Here we brazenly (and possibly mistakenly)
rep = "&#" + charValue + ";";
}

// Return replacement
return rep;
});
}

You should be fine passing all of the HTML through it, since if these characters appear in attribute values, you almost certainly want to encode them there as well.

I have not used the above in production (I actually wrote it for this answer, because the problem intrigued me) and it is totally supplied without warrantee of any kind. I have tried to ensure that it handles surrogate pairs because that's necessary for far Eastern languages, and supporting them is something we should all be doing now that the world has gotten smaller.

Complete example page:

<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Test Page</title>
<style type='text/css'>
body {
font-family: sans-serif;
}
#log p {
margin: 0;
padding: 0;
}
</style>
<script type='text/javascript'>

// Make the function available as a global, but define it within a scoping
// function so we can have data (the `entityMap`) that only it has access to
var prepEntities = (function() {

// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
"160": " ",
"161": "¡",
"162": "&#cent;",
"163": "&#pound;",
"164": "&#curren;",
"165": "&#yen;",
"166": "&#brvbar;",
"167": "&#sect;",
"168": "&#uml;",
"169": "©",
// ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
"8364": "€" // Last one must not have a comma after it, IE doesn't like trailing commas
};

// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
// The regular expression below uses an alternation to look for a surrogate pair _or_
// a single character that we might want to make an entity out of. The first part of the
// alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
// alone, it searches for the surrogates. The second part of the alternation you can
// adjust as you see fit, depending on how conservative you want to be. The example
// below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
// character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
// it's not "printable ASCII" (in the old parlance), convert it. That's probably
// overkill, but you said you wanted to make entities out of things, so... :-)
return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
var high, low, charValue, rep

// Get the character value, handling surrogate pairs
if (match.length == 2) {
// It's a surrogate pair, calculate the Unicode code point
high = match.charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue = (high * 0x400) + low + 0x10000;
}
else {
// Not a surrogate pair, the value *is* the Unicode code point
charValue = match.charCodeAt(0);
}

// See if we have a mapping for it
rep = entityMap[charValue];
if (!rep) {
// No, use a numeric entity. Here we brazenly (and possibly mistakenly)
rep = "&#" + charValue + ";";
}

// Return replacement
return rep;
});
}

// Return the function reference out of the scoping function to publish it
return prepEntities;
})();

function go() {
var d = document.getElementById('d1');
var s = d.innerHTML;
alert("Before: " + s);
s = prepEntities(s);
alert("After: " + s);
}

</script>
</head>
<body>
<div id='d1'>Copyright: © Yen: ¥ Cedilla: ¸ Surrogate pair: 𐀀</div>
<input type='button' id='btnGo' value='Go' onclick="return go();">
</body>
</html>

There I've included the cedilla as an example of converting to a numeric entity rather than a named one (since I left cedil out of my very small example map). And note that the surrogate pair at the end shows up in the first alert as two "characters" because of the way JavaScript handles UTF-16.

Handling UTF characters in html form submission

Set the character encoding for the form page before you output the HTML.

header('Content-Type: text/html; charset=utf-8');

HTML character decoding in Objective-C / Cocoa Touch

Those are called Character Entity References. When they take the form of &#<number>; they are called numeric entity references. Basically, it's a string representation of the byte that should be substituted. In the case of &, it represents the character with the value of 38 in the ISO-8859-1 character encoding scheme, which is &.

The reason the ampersand has to be encoded in RSS is it's a reserved special character.

What you need to do is parse the string and replace the entities with a byte matching the value between &# and ;. I don't know of any great ways to do this in objective C, but this stack overflow question might be of some help.

Edit: Since answering this some two years ago there are some great solutions; see @Michael Waterfall's answer below.

Encode HTML entities in JavaScript

You can use regex to replace any character in a given unicode range with its html entity equivalent. The code would look something like this:

var encodedStr = rawStr.replace(/[\u00A0-\u9999<>\&]/g, function(i) {
return '&#'+i.charCodeAt(0)+';';
});

This code will replace all characters in the given range (unicode 00A0 - 9999, as well as ampersand, greater & less than) with their html entity equivalents, which is simply &#nnn; where nnn is the unicode value we get from charCodeAt.

See it in action here: http://jsfiddle.net/E3EqX/13/ (this example uses jQuery for element selectors used in the example. The base code itself, above, does not use jQuery)

Making these conversions does not solve all the problems -- make sure you're using UTF8 character encoding, make sure your database is storing the strings in UTF8. You still may see instances where the characters do not display correctly, depending on system font configuration and other issues out of your control.

Documentation

  • String.charCodeAt - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charCodeAt
  • HTML Character entities - http://www.chucke.com/entities.html


Related Topics



Leave a reply



Submit