How to Use Nokogiri to Parse an Xml File

How do I use Nokogiri to parse an XML file?

Here I will try to explain you all the questions/confusions you are having:

require 'nokogiri'

doc = Nokogiri::XML.parse <<-XML
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
XML

So from my understanding of Nokogiri, each 'Items' is a node, and under that there are children nodes of 'Item'?

No, each Items are Nokogiri::XML::NodeSet. And under that there are 2 children nodes of Items,which are of Nokogiri::XML::Element class object. You can say them also Nokogiri::XML::Node

doc.class # => Nokogiri::XML::Document
@block = doc.xpath("//Items/Item")
@block.class # => Nokogiri::XML::NodeSet
@block.count # => 2
@block.map { |node| node.name }
# => ["Item", "Item"]
@block.map { |node| node.class }
# => [Nokogiri::XML::Element, Nokogiri::XML::Element]
@block.map { |node| node.children.count }
# => [19, 19]
@block.map { |node| node.class.superclass }
# => [Nokogiri::XML::Node, Nokogiri::XML::Node]

We create a map of this, which returns a hash I believe, and the code in {} goes through each node and places the children text into @block. Then I can display all of this child node's text to the screen.

I don't understand this. Although I tried to explain below to show what is Node,and what is Nodeset in Nokogiri. Remember Nodeset is a collection of Nodes.

@chld_class = @block.map do |node|
node.children.class
end
@chld_class
# => [Nokogiri::XML::NodeSet, Nokogiri::XML::NodeSet]
@chld_name = @block.map do |node|
node.children.map { |n| [n.name,n.class] }
end
@chld_name
# => [[["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]],
# [["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]]]

@chld_name = @block.map do |node|
node.children.map{|n| [n.name,n.text.strip] if n.elem? }.compact
end.compact
@chld_name
# => [[["Title", "Funfair in Bangkok"],
# ["Caption", "A small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-07T19:22:08"],
# ["Keywords", "Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]],
# [["Title", "Bumper Cars at a Funfair in Bangkok"],
# ["Caption", "Bumper cars at a small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-03T22:08:24"],
# ["Keywords",
# "Bumper Cars\n Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]]]

Rails nokogiri parse XML file

You're on the right track. parts = xml_doc.xpath('/root/rows/row') gives you back a NodeSet i.e. a list of the <row> elements.

You can loop through these using each or use row indexes like parts[0], parts[1] to access specific rows. You can then get the values of child nodes using xpath on the individual rows.

e.g. you could build a list of the AnalogueCode for each part with:

codes = []
parts.each do |row|
codes << row.xpath('AnalogueCode').text
end

Looking at the full example of the XML you're processing there are 2 issues preventing your XPath from matching:

  1. the <root> tag isn't actually the root element of the XML so /root/.. doesn't match

  2. The XML is using namespaces so you need to include these in your XPaths

so there are a couple of possible solutions:

  1. use CSS selectors rather than XPaths (i.e. use search) as suggested by the Tin Man

  2. after xml_doc = Nokogiri::XML(response.body) do xml_doc.remove_namespaces! and then use parts = xml_doc.xpath('//root/rows/row') where the double slash is XPath syntax to locate the root node anywhere in the document

  3. specify the namespaces:

e.g.

xml_doc  = Nokogiri::XML(response.body)
ns = xml_doc.collect_namespaces
parts = xml_doc.xpath('//xmlns:rows/xmlns:row', ns)

codes = []
parts.each do |row|
codes << xpath('xmlns:AnalogueCode', ns).text
end

I would go with 1. or 2. :-)

Parse xml file with nokogiri

Problem #1

In this line:

       @parentN =parent.xpath('///ancestor::*/@name')

you override the previous value of @parentN.

Problem #2

By running

<% for x in 0...@parentN.count %>

You will be getting 2 values for a single valued array. .count is equivalent to the last index +1 (for an array with only [0] .count is 1. Your @parentN is assigned to an object

Recommendation (simple)

Use a single array to hold the nested values (as a hash) rather than two variables.

#xmlController.rb
@codes = []
doc.xpath('Report/Node').each do |parent|
@codes << { parent.xpath('@name') => parent.xpath('Node').map { |child| child.text }
end

#show.html.erb

<% @codes.each do |parent, children| %>
<p> PARENT: <%= @parent %> </p>
<p> CHILDREN: <%= @children.each { |child| p child } %> </p>

Recommendation based on comments below

The above was shown to demonstrate the simpilest way to think about the problem. Now that we are ready to parse all the data in the node, we need to change our xpath and our map. The doc.xpath('Report/Node') is used to select the parent node, and that can stay the same. We will want to set the @codes key to the actual value of the string embedded in the Node which is not parent.xpath('@name') but actually parent.xpath('@name')[0].value. There could be multiple xml representations of nodes with the attribute 'name' and we want the first ([0]) one. The value of the name attribute is returned using the .value method.

Make a class so the nodes become objects

Your Parent node has a name and a color and your children have name, color, and rank. It looks like you have a model for Node that looks like:

class Node
include ActiveModel::Model
attr_accessor :name, :color, :rank, :children
end

I'm simplifying things by not using persistence here, but you may want to save your records to disk, and if you do look into the slew of things ActiveRecord does on RailsGuides

Now when we go through the xml document, we will create an array of objects rather than the hash of strings (which both happen to be objects, but I'll leave that quandry for you to check out).

Parse the Xpath to get attributes of Node Objects

A quick way to set the name and color attributes of the parent looks like this:

@node = Node.new(doc.xpath('Report/Node').first.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs })

OK, so maybe that wasn't all that easy. What we do is take the Enumerable result of the XPath, navigate to the first attributes and make a hash of string attribute names (name, color, rank) and their corresponding values. Once we have the hash we pass it to our Node class' new method to instanciate (create) a node. This will pass us an object that we can use:

@node.name
#=> "Example Parent 1"

Extend the Class for children

Once we have the parent node, we can give it children, creating new nodes in an array. To facilitate this, we extend the definition of the model to include an overridden initializer (new()).

class Node
include ActiveModel::Model
attr_accessor :name, :color, :rank, :children

def initialize(*args)
self.children = []
super(*args)
end
end
Adding children
@node.children << Node.new(doc.xpath('Report/Node').first.xpath('Node').first.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs })

We can automate this process now that we know how to create a Node object using .first and a child of it using .first with the previous enumeration.

doc.xpath('Report/Node').each do |parent|
node = Node.new(parent.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs }))
node.children = parent.xpath('Node').map do |child|
Node.new(child.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs }))
end
end

Ugly controller code

Move it to the model

But Wait! That isn't very DRY! Let's move the logic that hurts our eyes to look at into the model to make it easier to work with.

class Node
include ActiveModel::Model
attr_accessor :name, :color, :rank, :children

def initialize(*args)
self.children = []
super(*args)
end

def self.new_from_xpath(xml_node)
self.new(xml_node.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs })
end
end

Final controller

Now the controller looks like this:

@nodes = []
doc.xpath('Report/Node').each do |parent|
node = Node.new_from_xpath(parent)
node.children = parent.xpath('Node').map do |child|
Node.new_from_xpath(child)
end
@nodes << node
end

Using this in the view

In the view you can use the @nodes like this:

<% for @node in @nodes %>
Parent: <%= @node.name %>
Children: <% for @child in @node.children %>
<%= @child.name %> is <%= @child.color %>
<% end %>
<% end %>

Simple XML parsing example for Nokogiri

The solution can be easier to see if you try it in steps.

Example XML:

<?xml version="1.0" encoding="UTF-8"?>
<xml>
<foo>
<goo>a</goo>
<hoo>b</hoo>
</foo>
<foo>
<goo>c</goo>
<hoo>d</hoo>
</foo>
</xml>

The syntax //foo selects all the foo elements.

> puts doc.xpath("//foo")
<foo>
<goo>a</goo>
<hoo>b</hoo>
</foo>
<foo>
<goo>c</goo>
<hoo>d</hoo>
</foo>

Nokogiri returns nodes as a NodeSet like this:

> puts doc.xpath("//foo").class
Nokogiri::XML::NodeSet

A NodeSet is enumerable; you can use methods such as each, map, etc.

> puts doc.xpath("//foo").kind_of?(Enumerable)
true

This NodeSet contains two foo elements:

> doc.xpath("//foo").each{|e| puts e.class }
Nokogiri::XML::Element
Nokogiri::XML::Element

The syntax //foo/* selects the foo elements' child elements:

> puts doc.xpath("//foo/*")
<goo>a</goo>
<hoo>b</hoo>
<goo>c</goo>
<hoo>d</hoo>

To print an element's info, see Nokogiri/XML/Node documentation; the two methods you'll likely want are name and text.

Solution for you:

> doc.xpath("//foo/*").each{|e|
puts "#{e.name}:#{e.text}"
}
goo:a
hoo:b
goo:c
hoo:d

For your second question, you're essentially asking:

  1. for each foo element, get its child elements
  2. for each child element, print the name and text

Solution for you:

> doc.xpath("//foo").each_with_index{|parent_elem, parent_count| 
puts "Parent #{parent_count + 1}"
parent_elem.elements.each{|child_elem|
puts "#{child_elem.name}:#{child_elem.text}"
}
}

Error parsing XML document with Ruby's Nokogiri

So the issue is that your XML contains namespaces.

There are 2 options:

  1. Remove the namespaces
doc.remove_namespaces! 
doc.at_xpath("//tsn")
#=> #<Nokogiri::XML::Element:0x2add795ea3b8 name="tsn" children=[#<Nokogiri::XML::Text:0x2add795e5f70 "26339">]>

  1. Reference the namespace :
doc.at_xpath("//ax21:tsn", 'ax21' => "http://data.itis_service.itis.usgs.gov/xsd") 
#=> #<Nokogiri::XML::Element:0x2add795ea3b8 name="tsn" children=[#<Nokogiri::XML::Text:0x2add795e5f70 "26339">]>

Based on the comments it seems you are really only interested in the text for that node. You can retrieve that in multiple ways:

doc.at_xpath("//tsn").text()
#=> "26339"
doc.at_xpath("//tsn/text()").to_s
#=> "26339"
# If you want tsn and kingdom at the same time
doc.xpath('//tsn/text() | //kingdom/text()').map(&:to_s)
#=> ["26339", "Plantae"]

Example



Related Topics



Leave a reply



Submit