How to Get Nokogiri to Parse and Return an Xml Document

How can I get Nokogiri to parse and return an XML document?

It has to do with the way Nokogiri's parse method works. Here's the source:

# File lib/nokogiri.rb, line 55
def parse string, url = nil, encoding = nil, options = nil
doc =
if string =~ /^\s*<[^Hh>]*html/i # Probably html
Nokogiri::HTML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_HTML)
else
Nokogiri::XML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_XML)
end
yield doc if block_given?
doc
end

The key is the line if string =~ /^\s*<[^Hh>]*html/i # Probably html. When you just use open, it returns an object that doesn't work with regex, thus it always returns false. On the other hand, read returns a string, so it could be regarded as HTML. In this case it is, because it matches that regex. Here's the start of that string:

<!DOCTYPE html PUBLIC

The regex matches the "!DOCTYPE " to [^Hh>]* and then matches the "html", thus assuming it's HTML. Why someone selected this regex to determine if the file is HTML is beyond me. With this regex, a file that begins with a tag like <definitely-not-html> is considered HTML, but <this-is-still-not-html> is considered XML. You're probably best off staying away from this dumb function and invoking Nokogiri::HTML::Document#parse or Nokogiri::XML::Document#parse directly.

How do I use Nokogiri to parse an XML file?

Here I will try to explain you all the questions/confusions you are having:

require 'nokogiri'

doc = Nokogiri::XML.parse <<-XML
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
XML

So from my understanding of Nokogiri, each 'Items' is a node, and under that there are children nodes of 'Item'?

No, each Items are Nokogiri::XML::NodeSet. And under that there are 2 children nodes of Items,which are of Nokogiri::XML::Element class object. You can say them also Nokogiri::XML::Node

doc.class # => Nokogiri::XML::Document
@block = doc.xpath("//Items/Item")
@block.class # => Nokogiri::XML::NodeSet
@block.count # => 2
@block.map { |node| node.name }
# => ["Item", "Item"]
@block.map { |node| node.class }
# => [Nokogiri::XML::Element, Nokogiri::XML::Element]
@block.map { |node| node.children.count }
# => [19, 19]
@block.map { |node| node.class.superclass }
# => [Nokogiri::XML::Node, Nokogiri::XML::Node]

We create a map of this, which returns a hash I believe, and the code in {} goes through each node and places the children text into @block. Then I can display all of this child node's text to the screen.

I don't understand this. Although I tried to explain below to show what is Node,and what is Nodeset in Nokogiri. Remember Nodeset is a collection of Nodes.

@chld_class = @block.map do |node|
node.children.class
end
@chld_class
# => [Nokogiri::XML::NodeSet, Nokogiri::XML::NodeSet]
@chld_name = @block.map do |node|
node.children.map { |n| [n.name,n.class] }
end
@chld_name
# => [[["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]],
# [["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]]]

@chld_name = @block.map do |node|
node.children.map{|n| [n.name,n.text.strip] if n.elem? }.compact
end.compact
@chld_name
# => [[["Title", "Funfair in Bangkok"],
# ["Caption", "A small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-07T19:22:08"],
# ["Keywords", "Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]],
# [["Title", "Bumper Cars at a Funfair in Bangkok"],
# ["Caption", "Bumper cars at a small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-03T22:08:24"],
# ["Keywords",
# "Bumper Cars\n Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]]]

Rails nokogiri parse XML file

You're on the right track. parts = xml_doc.xpath('/root/rows/row') gives you back a NodeSet i.e. a list of the <row> elements.

You can loop through these using each or use row indexes like parts[0], parts[1] to access specific rows. You can then get the values of child nodes using xpath on the individual rows.

e.g. you could build a list of the AnalogueCode for each part with:

codes = []
parts.each do |row|
codes << row.xpath('AnalogueCode').text
end

Looking at the full example of the XML you're processing there are 2 issues preventing your XPath from matching:

  1. the <root> tag isn't actually the root element of the XML so /root/.. doesn't match

  2. The XML is using namespaces so you need to include these in your XPaths

so there are a couple of possible solutions:

  1. use CSS selectors rather than XPaths (i.e. use search) as suggested by the Tin Man

  2. after xml_doc = Nokogiri::XML(response.body) do xml_doc.remove_namespaces! and then use parts = xml_doc.xpath('//root/rows/row') where the double slash is XPath syntax to locate the root node anywhere in the document

  3. specify the namespaces:

e.g.

xml_doc  = Nokogiri::XML(response.body)
ns = xml_doc.collect_namespaces
parts = xml_doc.xpath('//xmlns:rows/xmlns:row', ns)

codes = []
parts.each do |row|
codes << xpath('xmlns:AnalogueCode', ns).text
end

I would go with 1. or 2. :-)

Convert Nokogiri XML Document into Array of Strings?

For your case you should simply use .text to extract the content of tags. Something like titles.text would work.

How to parse this returned XML with Nokogiri

To see what's wrong with a document use the errors method. After parsing your XML:

doc.errors
# => [#<Nokogiri::XML::SyntaxError: xmlns: URI www.example.com/ is not absolute>,
# #<Nokogiri::XML::SyntaxError: XML declaration allowed only at the start of the document>,
# #<Nokogiri::XML::SyntaxError: xmlns: URI www.example.com/SellerMessages is not absolute>]

To extract the data I'd use something like this:

doc = Nokogiri::XML(XML)
doc.remove_namespaces!
dealers = doc.search('Dealer').map{ |dealer|
{
buyer_id: dealer.at( 'BUYER_ID' ).text,
reservation_id: dealer.at( 'Reservation_ID' ).text,
name: dealer.at( 'Name' ).text
}
}

dealers
# => [{:buyer_id=>"0000-2127",
# :reservation_id=>
# "1779853194|0000-2067|520a8037-57c8-497e-be4b-f4ea8dfa6c6f|14187-20",
# :name=>"Randy's Rides"},
# {:buyer_id=>"0000-2127",
# :reservation_id=>
# "1779853194|0000-2067|e42fd5c6-0a36-4552-8b6a-ad2decebd0db|14200-10",
# :name=>"Jarrett's New Car Dealership 01"},
# {:buyer_id=>"0000-2127",
# :reservation_id=>
# "1779853194|0000-2067|3fecb591-3a81-49f9-82b3-1f0d7fb3f7a6|14160-20",
# :name=>"Campbell's Crazy Cars"},
# {:buyer_id=>"0000-2127",
# :reservation_id=>
# "1779853194|0000-2067|731b09e9-700b-4f41-8cb0-eaf80e861d76|14158-7",
# :name=>"Demo Dealer 3"}]

Of course you'll want to add/remove/change fields being extracted to fit your use-case.

Using slop mode has its dangers, as stated by the Nokogiri documentation.

  1. Don’t use this.
  2. This may or may not be a backhanded compliment.
  3. No, really, don’t use this. If you use it, don’t report bugs.
  4. You’ve been warned!

I've never used it as a result. Often we don't want to use remove_namespaces! either, but it appears safe in your situation.

Error parsing XML document with Ruby's Nokogiri

So the issue is that your XML contains namespaces.

There are 2 options:

  1. Remove the namespaces
doc.remove_namespaces! 
doc.at_xpath("//tsn")
#=> #<Nokogiri::XML::Element:0x2add795ea3b8 name="tsn" children=[#<Nokogiri::XML::Text:0x2add795e5f70 "26339">]>

  1. Reference the namespace :
doc.at_xpath("//ax21:tsn", 'ax21' => "http://data.itis_service.itis.usgs.gov/xsd") 
#=> #<Nokogiri::XML::Element:0x2add795ea3b8 name="tsn" children=[#<Nokogiri::XML::Text:0x2add795e5f70 "26339">]>

Based on the comments it seems you are really only interested in the text for that node. You can retrieve that in multiple ways:

doc.at_xpath("//tsn").text()
#=> "26339"
doc.at_xpath("//tsn/text()").to_s
#=> "26339"
# If you want tsn and kingdom at the same time
doc.xpath('//tsn/text() | //kingdom/text()').map(&:to_s)
#=> ["26339", "Plantae"]

Example



Related Topics



Leave a reply



Submit