Html Encoding Issues - "Â" Character Showing Up Instead of "&Nbsp;"

HTML encoding issues - Â character showing up instead of  

Somewhere in that mess, the non-breaking spaces from the HTML template (the  s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character

That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.

What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your   strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.

Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:

  • for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  • for HTML5: <meta charset="utf-8">

If you've done that, then any remaining problem is ActivePDF's fault.

 character showing up instead of  

Found it!

@$doc->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8'));

This answer explains the issue and gives the work around above;

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

’ showing on page instead of '

Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.

Or use .

What is this character ( Â ) and how do I remove it with PHP?

"Latin 1" is your problem here. There are approx 65256 UTF-8 characters available to a web page which you cannot store in a Latin-1 code page.

For your immediate problem you should be able to

$clean = str_replace(chr(194)," ",$dirty)

However I would switch your database to use utf-8 ASAP as the problem will almost certainly reoccur.

Why am I printing » instead of »?

Why is  being added to it?

Because your stylesheet is saved as UTF-8, but the browser is decoding it using Windows-1252. This is probably because the page that's referencing the stylesheet has no declared encoding and the browser is arbitrarily guessing the Windows-1252, which is typically the default encoding on Western European locales. The byte sequence 0xC2 0xBB represents » in UTF-8 but » in Windows-1252.

Adding the <meta charset> declaration in Akjm's answer to the page(s) that reference the stylesheet should make this work. If you can't do this (for example because you are making a stylesheet that might be referenced by other people's pages which could be in any encoding), alternatives are:

  1. encoding the character using CSS backslash-escapes, as in @RobFonseca's answer. (The HTML character reference syntax in @Akjm's answer is not effective here.)

  2. putting the rule @charset "utf-8"; at the top of the stylesheet to tell the browser that the stylesheet has its own encoding, independently of whatever the page uses

  3. setting the web server to serve the stylesheet with an HTTP Content-Type: text/css;charset=utf-8 header

Support for approaches 2–4 has traditionally been rocky, though I haven't checked browser support recently.



Related Topics



Leave a reply



Submit