How to Save Unescaped & in Nokogiri Xml

Disable HTML within XML escaping with Nokogiri

Because I don't have the Google Directions API installed I can't access the XML, but I have a strong suspicion the problem is the result of telling Nokogiri you're dealing with XML. As a result it's going to return you the HTML encoded like it should be in XML.

You can unescape the HTML using something like:

CGI::unescape_html('Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"

unescape_html is an alias to unescapeHTML:


Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" "

I had to think about this a bit more. It's something I've run into, but it was one of those things that escaped me during the rush at work. The fix is simple: You're using the wrong method to retrieve the content. Instead of:

puts h.inner_html

Use:

puts h.text

I proved this using:

require 'httpclient'
require 'nokogiri'

# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new

doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
puts html.text
end

Which outputs:

Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]

The difference is that inner_html is reading the content of the node directly, without decoding. text decodes it for you. text, to_str and inner_text are aliased to content internally in Nokogiri::XML::Node for our parsing pleasure.

Reading malformed XML with Nokogiri: Unescaped Ampersands in URL field

Had the same issue parsing SVGs with image links containing ampersands.

Parsing SVGs as HTML seems to correctly handle the links, escaping &.

fixed_svg = Nokogiri::HTML.fragment(raw_svg).to_html
# proceed with XML parsing
svg = Nokogiri::XML(fixed_svg)

How to save my changes in XML file with Nokogiri

Read the file into an in-memory XML document, modify the document as needed, then serialize the document back into the original file:

filename = 'exam.xml'
xml = File.read(filename)
doc = Nokogiri::XML(xml)
# ... make changes to doc ...
File.write(filename, doc.to_xml)

Preventing Nokogiri from escaping characters?

You are obliged to escape some characters in text elements like:

"   "
' '
< <
> >
& &

If you want your text verbatim use a CDATA section since everything inside a CDATA section is ignored by the parser.

Nokogiri example:

builder = Nokogiri::HTML::Builder.new do |b|
b.html do
b.head do
b.cdata "<%= stylesheet_link_tag 'style'%>"
end
end
end
builder.to_html

This should keep you erb tags intact!

How to get Nokogiri inner_HTML object to ignore/remove escape sequences

page.at_css("td[custom-attribute='foo']")
.parent
.css('td')
.css('a')
.text # since you need a text, not inner_html
.strip # this will strip a result

String#strip.

Sidenote: css('td a') is likely more efficient than css('td').css('a').

How to unescape HTML in Nokogiri Ruby, so & remains & and not &

Use content instead of inner_html to get the content as plain text instead of (X)HTML.

irb(main):011:0> doc.at('head/title').content
=> "Foo & Bar"


Related Topics



Leave a reply



Submit