is getting converted as \u0092 by nokogiri in ruby on rails
theyre
is wrong and should be avoided. If you want to use a close-single-quote there, to reproduce the typographical practice of rendering apostrophes as a slanted quote, then the correct character is U+2019 RIGHT SINGLE QUOTATION MARK, which can be written as ’
or ’
. Or, if you're using UTF-8, just included verbatim as ’
.
should refer to character U+0092, a little-used and pointless control character that typically renders as blank or a missing-glyph box. And indeed in XML, it does.
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range
to
are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
The problem is that Nokogiri doesn't know about this quirk, and takes character reference 146 at its word, ending up with the character 146 (\u0092
) that you don't really want. I think Nokogiri is using libxml2 to parse HTML, so ultimately the proper fix would be to libxml2's htmlParseCharRef
function, to substitute characters 128–159.
In the meantime you could perhaps try ‘fixing up’ character references manually with crude string substitution like
->’
before parsing. It's a bit wrong, but at least in HTML the only other place you can have the markup sequence
without it being a character reference would be in a comment, so hopefully it wouldn't matter if you changed the content there accidentally too.
Preventing Nokogiri from escaping characters?
You are obliged to escape some characters in text elements like:
" "
' '
< <
> >
& &
If you want your text verbatim use a CDATA section since everything inside a CDATA section is ignored by the parser.
Nokogiri example:
builder = Nokogiri::HTML::Builder.new do |b|
b.html do
b.head do
b.cdata "<%= stylesheet_link_tag 'style'%>"
end
end
end
builder.to_html
This should keep you erb tags intact!
Decoding numeric html entities via PHP
html_entity_decode
already does what you're looking for:
$string = '';
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
It will return the character:
’ binary hex: c292
Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.
Also there are some more quirks:
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range
to
are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
See: is getting converted as “\u0092” by nokogiri in ruby on rails
Using PowerShell to parse XML and edit data into Excel, strange foreign characters appearing
The two bytes C2 92
right before the single quote (27
) are a non-printable control character (U+0092). Not sure what the purpose of this Unicode character is, or how the character got into your XML data (if I had to guess I'd say it was copy/pasted from somewhere, perhaps some HTML text).
If you open the file in Notepad and position the cursor right of the single quote in I'
you most likely need to press ← 3 times to move the cursor from the right side of the '
to the left side of the I
.
Simply remove the character from the XML file (delete the faulty character sequence, type I'
in its place, then save the file) and you'll be fine.
Related Topics
Recommended Two-Way Encryption Gems for Ruby
Stack Level Too Deep When Using Carrierwave Versions
Ruby Tcpsocket: Find Out How Much Data Is Available
Turn on Full Backtrace in Ruby on Rails Testcase
How to Add a Single Backslash Character to a String in Ruby
How to Require File from 'Gem' Which Are Not Under 'Lib' Directory
Nokogiri Requires Ruby Version < 2.3
Use Rspec's "Expect" etc. Outside a Describe ... It Block
Clicking a Button with Ruby Mechanize
Ssl_Connect Syscall Returned=5 Errno=0 State=Sslv2/V3 Read Server Hello A
Pg_Config, Ruby Pg, Postgresql 9.0 Problem After Upgrade, Centos 5
Is This the Best Way to Grab Common Elements from a Hash of Arrays
Multi Level Block Method Is Generating Issue
Ruby Can Not Access Variable Outside the Method