How do I use Nokogiri to parse an XML file?
Here I will try to explain you all the questions/confusions you are having:
require 'nokogiri'
doc = Nokogiri::XML.parse <<-XML
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
XML
So from my understanding of Nokogiri, each 'Items' is a node, and under that there are children nodes of 'Item'?
No, each Items are Nokogiri::XML::NodeSet
. And under that there are 2 children nodes of Items,which are of Nokogiri::XML::Element
class object. You can say them also Nokogiri::XML::Node
doc.class # => Nokogiri::XML::Document
@block = doc.xpath("//Items/Item")
@block.class # => Nokogiri::XML::NodeSet
@block.count # => 2
@block.map { |node| node.name }
# => ["Item", "Item"]
@block.map { |node| node.class }
# => [Nokogiri::XML::Element, Nokogiri::XML::Element]
@block.map { |node| node.children.count }
# => [19, 19]
@block.map { |node| node.class.superclass }
# => [Nokogiri::XML::Node, Nokogiri::XML::Node]
We create a map of this, which returns a hash I believe, and the code in {} goes through each node and places the children text into @block. Then I can display all of this child node's text to the screen.
I don't understand this. Although I tried to explain below to show what is Node,and what is Nodeset in Nokogiri. Remember Nodeset
is a collection of Nodes.
@chld_class = @block.map do |node|
node.children.class
end
@chld_class
# => [Nokogiri::XML::NodeSet, Nokogiri::XML::NodeSet]
@chld_name = @block.map do |node|
node.children.map { |n| [n.name,n.class] }
end
@chld_name
# => [[["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]],
# [["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]]]
@chld_name = @block.map do |node|
node.children.map{|n| [n.name,n.text.strip] if n.elem? }.compact
end.compact
@chld_name
# => [[["Title", "Funfair in Bangkok"],
# ["Caption", "A small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-07T19:22:08"],
# ["Keywords", "Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]],
# [["Title", "Bumper Cars at a Funfair in Bangkok"],
# ["Caption", "Bumper cars at a small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-03T22:08:24"],
# ["Keywords",
# "Bumper Cars\n Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]]]
Rails nokogiri parse XML file
You're on the right track. parts = xml_doc.xpath('/root/rows/row')
gives you back a NodeSet
i.e. a list of the <row>
elements.
You can loop through these using each
or use row indexes like parts[0]
, parts[1]
to access specific rows. You can then get the values of child nodes using xpath
on the individual rows.
e.g. you could build a list of the AnalogueCode
for each part with:
codes = []
parts.each do |row|
codes << row.xpath('AnalogueCode').text
end
Looking at the full example of the XML you're processing there are 2 issues preventing your XPath from matching:
the
<root>
tag isn't actually the root element of the XML so/root/..
doesn't matchThe XML is using namespaces so you need to include these in your XPaths
so there are a couple of possible solutions:
use CSS selectors rather than XPaths (i.e. use
search
) as suggested by the Tin Manafter
xml_doc = Nokogiri::XML(response.body)
doxml_doc.remove_namespaces!
and then useparts = xml_doc.xpath('//root/rows/row')
where the double slash is XPath syntax to locate theroot
node anywhere in the documentspecify the namespaces:
e.g.
xml_doc = Nokogiri::XML(response.body)
ns = xml_doc.collect_namespaces
parts = xml_doc.xpath('//xmlns:rows/xmlns:row', ns)
codes = []
parts.each do |row|
codes << xpath('xmlns:AnalogueCode', ns).text
end
I would go with 1. or 2. :-)
Parse xml file with nokogiri
Problem #1
In this line:
@parentN =parent.xpath('///ancestor::*/@name')
you override the previous value of @parentN
.
Problem #2
By running
<% for x in 0...@parentN.count %>
You will be getting 2 values for a single valued array. .count
is equivalent to the last index +1 (for an array with only [0] .count
is 1. Your @parentN
is assigned to an object
Recommendation (simple)
Use a single array to hold the nested values (as a hash) rather than two variables.
#xmlController.rb
@codes = []
doc.xpath('Report/Node').each do |parent|
@codes << { parent.xpath('@name') => parent.xpath('Node').map { |child| child.text }
end
#show.html.erb
<% @codes.each do |parent, children| %>
<p> PARENT: <%= @parent %> </p>
<p> CHILDREN: <%= @children.each { |child| p child } %> </p>
Recommendation based on comments below
The above was shown to demonstrate the simpilest way to think about the problem. Now that we are ready to parse all the data in the node, we need to change our xpath and our map. The doc.xpath('Report/Node')
is used to select the parent node, and that can stay the same. We will want to set the @codes
key to the actual value of the string embedded in the Node which is not parent.xpath('@name')
but actually parent.xpath('@name')[0].value
. There could be multiple xml representations of nodes with the attribute 'name' and we want the first ([0]
) one. The value of the name attribute is returned using the .value
method.
Make a class so the nodes become objects
Your Parent node has a name and a color and your children have name, color, and rank. It looks like you have a model for Node that looks like:
class Node
include ActiveModel::Model
attr_accessor :name, :color, :rank, :children
end
I'm simplifying things by not using persistence here, but you may want to save your records to disk, and if you do look into the slew of things ActiveRecord does on RailsGuides
Now when we go through the xml document, we will create an array of objects rather than the hash of strings (which both happen to be objects, but I'll leave that quandry for you to check out).
Parse the Xpath to get attributes of Node Objects
A quick way to set the name and color attributes of the parent looks like this:
@node = Node.new(doc.xpath('Report/Node').first.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs })
OK, so maybe that wasn't all that easy. What we do is take the Enumerable result of the XPath, navigate to the first attributes and make a hash of string attribute names (name, color, rank) and their corresponding values. Once we have the hash we pass it to our Node class' new method to instanciate (create) a node. This will pass us an object that we can use:
@node.name
#=> "Example Parent 1"
Extend the Class for children
Once we have the parent node, we can give it children, creating new nodes in an array. To facilitate this, we extend the definition of the model to include an overridden initializer (new()).
class Node
include ActiveModel::Model
attr_accessor :name, :color, :rank, :children
def initialize(*args)
self.children = []
super(*args)
end
end
Adding children@node.children << Node.new(doc.xpath('Report/Node').first.xpath('Node').first.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs })
We can automate this process now that we know how to create a Node object using .first
and a child of it using .first
with the previous enumeration.
doc.xpath('Report/Node').each do |parent|
node = Node.new(parent.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs }))
node.children = parent.xpath('Node').map do |child|
Node.new(child.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs }))
end
end
Ugly controller code
Move it to the modelBut Wait! That isn't very DRY! Let's move the logic that hurts our eyes to look at into the model to make it easier to work with.
class Node
include ActiveModel::Model
attr_accessor :name, :color, :rank, :children
def initialize(*args)
self.children = []
super(*args)
end
def self.new_from_xpath(xml_node)
self.new(xml_node.attributes.inject({}) { |attrs, value| attrs[value[0].to_sym] = value[1].value; attrs })
end
end
Final controller
Now the controller looks like this:
@nodes = []
doc.xpath('Report/Node').each do |parent|
node = Node.new_from_xpath(parent)
node.children = parent.xpath('Node').map do |child|
Node.new_from_xpath(child)
end
@nodes << node
end
Using this in the view
In the view you can use the @nodes like this:
<% for @node in @nodes %>
Parent: <%= @node.name %>
Children: <% for @child in @node.children %>
<%= @child.name %> is <%= @child.color %>
<% end %>
<% end %>
Simple XML parsing example for Nokogiri
The solution can be easier to see if you try it in steps.
Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<foo>
<goo>a</goo>
<hoo>b</hoo>
</foo>
<foo>
<goo>c</goo>
<hoo>d</hoo>
</foo>
</xml>
The syntax //foo
selects all the foo
elements.
> puts doc.xpath("//foo")
<foo>
<goo>a</goo>
<hoo>b</hoo>
</foo>
<foo>
<goo>c</goo>
<hoo>d</hoo>
</foo>
Nokogiri returns nodes as a NodeSet
like this:
> puts doc.xpath("//foo").class
Nokogiri::XML::NodeSet
A NodeSet
is enumerable; you can use methods such as each
, map
, etc.
> puts doc.xpath("//foo").kind_of?(Enumerable)
true
This NodeSet
contains two foo
elements:
> doc.xpath("//foo").each{|e| puts e.class }
Nokogiri::XML::Element
Nokogiri::XML::Element
The syntax //foo/*
selects the foo
elements' child elements:
> puts doc.xpath("//foo/*")
<goo>a</goo>
<hoo>b</hoo>
<goo>c</goo>
<hoo>d</hoo>
To print an element's info, see Nokogiri/XML/Node documentation; the two methods you'll likely want are name
and text
.
Solution for you:
> doc.xpath("//foo/*").each{|e|
puts "#{e.name}:#{e.text}"
}
goo:a
hoo:b
goo:c
hoo:d
For your second question, you're essentially asking:
- for each
foo
element, get its child elements - for each child element, print the name and text
Solution for you:
> doc.xpath("//foo").each_with_index{|parent_elem, parent_count|
puts "Parent #{parent_count + 1}"
parent_elem.elements.each{|child_elem|
puts "#{child_elem.name}:#{child_elem.text}"
}
}
Error parsing XML document with Ruby's Nokogiri
So the issue is that your XML contains namespaces.
There are 2 options:
- Remove the namespaces
doc.remove_namespaces!
doc.at_xpath("//tsn")
#=> #<Nokogiri::XML::Element:0x2add795ea3b8 name="tsn" children=[#<Nokogiri::XML::Text:0x2add795e5f70 "26339">]>
- Reference the namespace :
doc.at_xpath("//ax21:tsn", 'ax21' => "http://data.itis_service.itis.usgs.gov/xsd")
#=> #<Nokogiri::XML::Element:0x2add795ea3b8 name="tsn" children=[#<Nokogiri::XML::Text:0x2add795e5f70 "26339">]>
Based on the comments it seems you are really only interested in the text for that node. You can retrieve that in multiple ways:
doc.at_xpath("//tsn").text()
#=> "26339"
doc.at_xpath("//tsn/text()").to_s
#=> "26339"
# If you want tsn and kingdom at the same time
doc.xpath('//tsn/text() | //kingdom/text()').map(&:to_s)
#=> ["26339", "Plantae"]
Example
Related Topics
Rails: Serializing Objects in a Database
Rails Generate Has_Many Association
How to Generate a Random Date in Ruby
Find Records with Datetime That Match Today's Date - Ruby on Rails
What Is the Easiest Way I Can Create a 'Beep' Sound from a Ruby Program
Best Way to Combine Fragment and Object Caching for Memcached and Rails
Sorting: Sort Array Based on Multiple Conditions in Ruby
How to Add Two Weeks to Time.Now
Iterate Every Month with Date Objects
How to Use "Gets" on a Rake Task
Ruby Easy Search for Key-Value Pair in an Array of Hashes
How to Make Sign Up Page Be Root Page in Devise
Rails - Testing JSON API with Functional Tests
Ruby: How to Chain Multiple Method Calls Together with "Send"
Ruby: Put Request with JSON Body