Nokogiri To_Xml Without Carriage Returns

Nokogiri to_xml without carriage returns

Builder#to_xml by default outputs formatted (i.e. indented) XML. You can use the Nokogiri::XML::Node::SaveOptions to get an almost unformatted result.

b = Nokogiri::XML::Builder.new do |xml|
  xml.root do
    xml.foo do
      xml.text("Value")
    end
  end
end

b.to_xml
# => "<?xml version=\"1.0\"?>\n<root>\n  <foo>Value</foo>\n</root>\n"

b.to_xml(:save_with => Nokogiri::XML::Node::SaveOptions::AS_XML)
# => "<?xml version=\"1.0\"?>\n<root><foo>Value</foo></root>\n"

Now you could either just get rid of the XML header (which is optional anyway) and remove the last newline

b.to_xml(:save_with => Nokogiri::XML::Node::SaveOptions::AS_XML | Nokogiri::XML::Node::SaveOptions::NO_DECLARATION).strip
# => "<root><foo>Value</foo></root>"

Just removing all newlines in the XML is probably a bad idea as newlines can actually be significant (e.g. in <pre> blocks of XHTML). If that is not the case for you (and you are really sure of that) you could just do it.

Nokogiri builder #to_xml, no carriage return after adding text fragments

XML serialization is handled by the underlying libxml2. "If libxml2 detects that there is already some text nodes as children of a node it will disable automatic indenting for the whole subtree." AFAIK this libxml2 behaviour cannot be changed.

In your example such a text node was produced by the newline between elements, but the same happens for any inter-element text. Since the text node was added to the root element, the whole document was rendered without indentation. Were it added somewhere down the document structure, only the subtree containing it would lack indentation:

xml_text1 = "<text1>text1</text1>a<text2>text2</text2>"
xml = Nokogiri::XML::Builder.new(encoding: "utf-8")
xml.Message do
  xml.Header do
    xml.NumberOne "1"
    xml.NumberTwo "2"
  end
  # wrapper element added
  xml.Wrapper do
    xml << xml_text1
  end
end

puts xml.to_xml

Only the content of <Wrapper> is without indentation:

<?xml version="1.0" encoding="utf-8"?>
<Message>
  <Header>
    <NumberOne>1</NumberOne>
    <NumberTwo>2</NumberTwo>
  </Header>
  <Wrapper><text1>text1</text1>a<text2>text2</text2></Wrapper>
</Message>

A possibly useful hack would be parsing the XML strings yourself and removing the unwanted text elements:

xml_text1 = "<text1>text1</text1>\n<text2>text2</text2>"

xml = Nokogiri::XML::Builder.new(encoding: "utf-8")
xml.Message do
  xml.Header do
    xml.NumberOne "1"
    xml.NumberTwo "2"
  end

  doc.fragment(xml_text1).children.each do |node|
    # drop all whitespace-only text nodes
    next if node.text? && node.content =~ /\A\s+\Z/
    insert node
  end
end

Print an XML document without the XML header line at the top

The simplest way to get the XML for a Document without the leading "PI" (processing instruction) is to call to_s on the root element instead of the document itself:

require 'nokogiri'
doc = Nokogiri.XML('<hello world="true" />')

puts doc
#=> <?xml version="1.0"?>
#=> <hello world="true"/>

puts doc.root
#=> <hello world="true"/>

The 'correct' way to do it at the document or builder level, though, is to use SaveOptions:

formatted_no_decl = Nokogiri::XML::Node::SaveOptions::FORMAT +
                    Nokogiri::XML::Node::SaveOptions::NO_DECLARATION

puts doc.to_xml( save_with:formatted_no_decl )
#=> <hello world="true"/>

# Making your code shorter, but horribly confusing for future readers
puts doc.to_xml save_with:3
#=> <hello world="true"/>

Note that DocumentFragments do not automatically include this PI:

frag = Nokogiri::XML::DocumentFragment.parse('<hello world="true" />')
puts frag
#=> <hello world="true"/>

If you are seeing a PI in your fragment output, it means it was there when you parsed it.

xml = '<?xml version="1.0"?><hello world="true" />'
frag = Nokogiri::XML::DocumentFragment.parse(xml)
puts frag
#=> <?xml version="1.0"?><hello world="true"/>

If so, and you want to get rid of any PIs, you ~~can do so~~ should be able to do so with a little XPath:

frag.xpath('//processing-instruction()').remove
puts frag

…except that this does not appear to work due to oddness with XPath in DocumentFragments. To work around these bugs do this instead:

# To remove only PIs at the root level of the fragment
frag.xpath('processing-instruction()').remove
puts frag
#=> <hello world="true"/>

# Alternatively, to remove all PIs everywhere, including inside child nodes
frag.xpath('processing-instruction()|.//processing-instruction()').remove

If you have a Builder object, do either of:

builder = Nokogiri::XML::Builder.new{ |xml| xml.hello(world:"true") }

puts builder.to_xml
#=> <?xml version="1.0"?>
#=> <hello world="true"/>

puts builder.doc.root.to_xml
#=> <hello world="true"/>

formatted_no_decl = Nokogiri::XML::Node::SaveOptions::FORMAT +
                    Nokogiri::XML::Node::SaveOptions::NO_DECLARATION

puts builder.to_xml save_with:formatted_no_decl
#=> <hello world="true"/>

Problems inserting elements into XML fragment

Well, the solution was just to update the version of Nokogiri. Presumably, this was a bug that was fixed between versions 1.6.3.1 and 1.6.6.2.

Where can I found some official documentation for text() xpath query syntax?

Where can I found some official
php/xpath docs that explain it ?

The notation:

text()

is a node test as defined in the W3C XPath 1.0 specification, which is the only official XPath 1.0 definition.

In particular, the spec says:

"The node test text() is true for any text node".

And a "text node" is one of the seven different kinds of nodes in the XPath data model.

How do you use the rspec have_selector method to verify XML?

Capybara doesn't support XML responses. It always uses Nokogiri::HTML to parse the content, which produces unexpected results when given XML.

Adding XML support has been requested but was rejected by Capybara's maintainer.

Nokogiri To_Xml Without Carriage Returns