How can I get Nokogiri to parse and return an XML document?
It has to do with the way Nokogiri's parse method works. Here's the source:
# File lib/nokogiri.rb, line 55
def parse string, url = nil, encoding = nil, options = nil
doc =
if string =~ /^\s*<[^Hh>]*html/i # Probably html
Nokogiri::HTML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_HTML)
else
Nokogiri::XML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_XML)
end
yield doc if block_given?
doc
end
The key is the line if string =~ /^\s*<[^Hh>]*html/i # Probably html
. When you just use open
, it returns an object that doesn't work with regex, thus it always returns false. On the other hand, read
returns a string, so it could be regarded as HTML. In this case it is, because it matches that regex. Here's the start of that string:<!DOCTYPE html PUBLIC
The regex matches the "!DOCTYPE " to [^Hh>]*
and then matches the "html", thus assuming it's HTML. Why someone selected this regex to determine if the file is HTML is beyond me. With this regex, a file that begins with a tag like <definitely-not-html>
is considered HTML, but <this-is-still-not-html>
is considered XML. You're probably best off staying away from this dumb function and invoking Nokogiri::HTML::Document#parse
or Nokogiri::XML::Document#parse
directly. How do I use Nokogiri to parse an XML file?
Here I will try to explain you all the questions/confusions you are having:
require 'nokogiri'
doc = Nokogiri::XML.parse <<-XML
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
XML
No, each Items areSo from my understanding of Nokogiri, each 'Items' is a node, and under that there are children nodes of 'Item'?
Nokogiri::XML::NodeSet
. And under that there are 2 children nodes of Items,which are of Nokogiri::XML::Element
class object. You can say them also Nokogiri::XML::Node
doc.class # => Nokogiri::XML::Document
@block = doc.xpath("//Items/Item")
@block.class # => Nokogiri::XML::NodeSet
@block.count # => 2
@block.map { |node| node.name }
# => ["Item", "Item"]
@block.map { |node| node.class }
# => [Nokogiri::XML::Element, Nokogiri::XML::Element]
@block.map { |node| node.children.count }
# => [19, 19]
@block.map { |node| node.class.superclass }
# => [Nokogiri::XML::Node, Nokogiri::XML::Node]
I don't understand this. Although I tried to explain below to show what is Node,and what is Nodeset in Nokogiri. RememberWe create a map of this, which returns a hash I believe, and the code in {} goes through each node and places the children text into @block. Then I can display all of this child node's text to the screen.
Nodeset
is a collection of Nodes.@chld_class = @block.map do |node|
node.children.class
end
@chld_class
# => [Nokogiri::XML::NodeSet, Nokogiri::XML::NodeSet]
@chld_name = @block.map do |node|
node.children.map { |n| [n.name,n.class] }
end
@chld_name
# => [[["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]],
# [["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]]]
@chld_name = @block.map do |node|
node.children.map{|n| [n.name,n.text.strip] if n.elem? }.compact
end.compact
@chld_name
# => [[["Title", "Funfair in Bangkok"],
# ["Caption", "A small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-07T19:22:08"],
# ["Keywords", "Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]],
# [["Title", "Bumper Cars at a Funfair in Bangkok"],
# ["Caption", "Bumper cars at a small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-03T22:08:24"],
# ["Keywords",
# "Bumper Cars\n Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]]]
Rails nokogiri parse XML file
You're on the right track. parts = xml_doc.xpath('/root/rows/row')
gives you back a NodeSet
i.e. a list of the <row>
elements.
You can loop through these using each
or use row indexes like parts[0]
, parts[1]
to access specific rows. You can then get the values of child nodes using xpath
on the individual rows.
e.g. you could build a list of the AnalogueCode
for each part with:
codes = []
parts.each do |row|
codes << row.xpath('AnalogueCode').text
end
Looking at the full example of the XML you're processing there are 2 issues preventing your XPath from matching:
the
<root>
tag isn't actually the root element of the XML so/root/..
doesn't matchThe XML is using namespaces so you need to include these in your XPaths
use CSS selectors rather than XPaths (i.e. use
search
) as suggested by the Tin Manafter
xml_doc = Nokogiri::XML(response.body)
doxml_doc.remove_namespaces!
and then useparts = xml_doc.xpath('//root/rows/row')
where the double slash is XPath syntax to locate theroot
node anywhere in the documentspecify the namespaces:
xml_doc = Nokogiri::XML(response.body)
ns = xml_doc.collect_namespaces
parts = xml_doc.xpath('//xmlns:rows/xmlns:row', ns)
codes = []
parts.each do |row|
codes << xpath('xmlns:AnalogueCode', ns).text
end
I would go with 1. or 2. :-) Convert Nokogiri XML Document into Array of Strings?
For your case you should simply use .text
to extract the content of tags. Something like titles.text
would work.
How to parse this returned XML with Nokogiri
To see what's wrong with a document use the errors
method. After parsing your XML:
doc.errors
# => [#<Nokogiri::XML::SyntaxError: xmlns: URI www.example.com/ is not absolute>,
# #<Nokogiri::XML::SyntaxError: XML declaration allowed only at the start of the document>,
# #<Nokogiri::XML::SyntaxError: xmlns: URI www.example.com/SellerMessages is not absolute>]
To extract the data I'd use something like this:doc = Nokogiri::XML(XML)
doc.remove_namespaces!
dealers = doc.search('Dealer').map{ |dealer|
{
buyer_id: dealer.at( 'BUYER_ID' ).text,
reservation_id: dealer.at( 'Reservation_ID' ).text,
name: dealer.at( 'Name' ).text
}
}
dealers
# => [{:buyer_id=>"0000-2127",
# :reservation_id=>
# "1779853194|0000-2067|520a8037-57c8-497e-be4b-f4ea8dfa6c6f|14187-20",
# :name=>"Randy's Rides"},
# {:buyer_id=>"0000-2127",
# :reservation_id=>
# "1779853194|0000-2067|e42fd5c6-0a36-4552-8b6a-ad2decebd0db|14200-10",
# :name=>"Jarrett's New Car Dealership 01"},
# {:buyer_id=>"0000-2127",
# :reservation_id=>
# "1779853194|0000-2067|3fecb591-3a81-49f9-82b3-1f0d7fb3f7a6|14160-20",
# :name=>"Campbell's Crazy Cars"},
# {:buyer_id=>"0000-2127",
# :reservation_id=>
# "1779853194|0000-2067|731b09e9-700b-4f41-8cb0-eaf80e861d76|14158-7",
# :name=>"Demo Dealer 3"}]
Of course you'll want to add/remove/change fields being extracted to fit your use-case.Using slop
mode has its dangers, as stated by the Nokogiri documentation.
I've never used it as a result. Often we don't want to use
- Don’t use this.
- This may or may not be a backhanded compliment.
- No, really, don’t use this. If you use it, don’t report bugs.
- You’ve been warned!
remove_namespaces!
either, but it appears safe in your situation. Error parsing XML document with Ruby's Nokogiri
So the issue is that your XML contains namespaces.
There are 2 options:
- Remove the namespaces
doc.remove_namespaces!
doc.at_xpath("//tsn")
#=> #<Nokogiri::XML::Element:0x2add795ea3b8 name="tsn" children=[#<Nokogiri::XML::Text:0x2add795e5f70 "26339">]>
- Reference the namespace :
doc.at_xpath("//ax21:tsn", 'ax21' => "http://data.itis_service.itis.usgs.gov/xsd")
#=> #<Nokogiri::XML::Element:0x2add795ea3b8 name="tsn" children=[#<Nokogiri::XML::Text:0x2add795e5f70 "26339">]>
Based on the comments it seems you are really only interested in the text for that node. You can retrieve that in multiple ways:doc.at_xpath("//tsn").text()
#=> "26339"
doc.at_xpath("//tsn/text()").to_s
#=> "26339"
# If you want tsn and kingdom at the same time
doc.xpath('//tsn/text() | //kingdom/text()').map(&:to_s)
#=> ["26339", "Plantae"]
Example
Related Topics
Ruby Gem Development - How to Use Activerecord
Changing The Reading Order of Rubygem Sources
Rails for Zombies Lab 4 > Exercise 3
Mongodb Server Doesn't Start at Gitlab Runner Using Gitlab-Ci
Persistent Tcp Connection in Rails App
Time Availability Comparison, Using Ruby on Rails
Extend Model in Plugin with "Has_Many" Using a Module
Importing CSV Data into a Ruby Array/Variable
Is It a Bad Practice to Randomly-Generate Test Data
Ruby Linkify for Urls in Strings
Should I Move My Custom Methods to Model from Controller
Rails on Netbeans: Uncaught Exception: No Such File to Load - Script/Server or Script/Console
If I Have a Stripe Token from a Charge, How to Get Its Charge Id
No Such File to Load - Mechanize