’ Is Getting Converted as "\U0092" by Nokogiri in Ruby on Rails

’ is getting converted as \u0092 by nokogiri in ruby on rails

they’re

is wrong and should be avoided. If you want to use a close-single-quote there, to reproduce the typographical practice of rendering apostrophes as a slanted quote, then the correct character is U+2019 RIGHT SINGLE QUOTATION MARK, which can be written as or . Or, if you're using UTF-8, just included verbatim as .

should refer to character U+0092, a little-used and pointless control character that typically renders as blank or a missing-glyph box. And indeed in XML, it does.

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range to Ÿ are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

The problem is that Nokogiri doesn't know about this quirk, and takes character reference 146 at its word, ending up with the character 146 (\u0092) that you don't really want. I think Nokogiri is using libxml2 to parse HTML, so ultimately the proper fix would be to libxml2's htmlParseCharRef function, to substitute characters 128–159.

In the meantime you could perhaps try ‘fixing up’ character references manually with crude string substitution like -> before parsing. It's a bit wrong, but at least in HTML the only other place you can have the markup sequence without it being a character reference would be in a comment, so hopefully it wouldn't matter if you changed the content there accidentally too.

Preventing Nokogiri from escaping characters?

You are obliged to escape some characters in text elements like:

"   "
' '
< <
> >
& &

If you want your text verbatim use a CDATA section since everything inside a CDATA section is ignored by the parser.

Nokogiri example:

builder = Nokogiri::HTML::Builder.new do |b|
b.html do
b.head do
b.cdata "<%= stylesheet_link_tag 'style'%>"
end
end
end
builder.to_html

This should keep you erb tags intact!

Decoding numeric html entities via PHP

html_entity_decode already does what you're looking for:

$string = '’';

echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');

It will return the character:

’   binary hex: c292

Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.

Also there are some more quirks:

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range to Ÿ are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

See: ’ is getting converted as “\u0092” by nokogiri in ruby on rails

Using PowerShell to parse XML and edit data into Excel, strange foreign characters appearing

The two bytes C2 92 right before the single quote (27) are a non-printable control character (U+0092). Not sure what the purpose of this Unicode character is, or how the character got into your XML data (if I had to guess I'd say it was copy/pasted from somewhere, perhaps some HTML text).

If you open the file in Notepad and position the cursor right of the single quote in I' you most likely need to press 3 times to move the cursor from the right side of the ' to the left side of the I.

Simply remove the character from the XML file (delete the faulty character sequence, type I' in its place, then save the file) and you'll be fine.



Related Topics



Leave a reply



Submit