Parsing Large XML files w/ Ruby & Nokogiri
You can dramatically decrease your time to execute by changing your code to the following. Just change the "99" to whatever category you want to check.:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
icount = 0
xmlfeed = Nokogiri::XML(open("test.xml"))
items = xmlfeed.xpath("//item")
items.each do |item|
text = item.children.children.first.text
if ( text =~ /99/ )
icount += 1
end
end
othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount
puts icount
puts othercount
This took about three seconds on my machine. I think a key error you made was that you chose the "items" iterate over instead of creating a collection of the "item" nodes. That made your iteration code awkward and slow.
Parsing Large XML with Nokogiri
I see a few possible problems. First of all, this:
@doc = Nokogiri::XML(sympFile)
will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.
Then you do things like this:
@doc.xpath(...).each
That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet
that xpath
returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.
Then you make your copy of what you're interested in:
symptomsList.push([signId, name])
and finally iterate over that array:
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:
class D < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [ ])
if(name == 'DisorderSign')
@data = { }
elsif(name == 'ClinicalSign')
@key = :sign
@data[@key] = ''
elsif(name == 'SignFreq')
@key = :freq
@data[@key] = ''
elsif(name == 'Name')
@in_name = true
end
end
def characters(str)
@data[@key] += str if(@key && @in_name)
end
def end_element(name, attrs = [ ])
if(name == 'DisorderSign')
# Dump @data into the database here.
@data = nil
elsif(name == 'ClinicalSign')
@key = nil
elsif(name == 'SignFreq')
@key = nil
elsif(name == 'Name')
@in_name = false
end
end
end
The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the
# Dump @data into the database here.
comment.
This structure makes it pretty easy to watch for the <Disorder id="17601">
elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.
Reading large XML file with Nokogiri
The XML has default namespace declared at the root element level :
xmlns="http://www.w3.org/2005/Atom"
In XML, descendant elements without prefix inherits default namespace from ancestor implicitly. That said, entry
element that you tried to get is in the root element's default namespace.
On the other side, in XPath, element without prefix always considered in empty namespace. To reference element in XML's default namespace using XPath, we need to map a prefix to the default namespace URI and use that prefix in our XPath, for example :
page.xpath("//d:entry", 'd' => 'http://www.w3.org/2005/Atom')
How do I use Nokogiri::XML::Reader to parse large XML files?
Each element in the stream comes through as two events: one to open the element and one to close it. The opening event will have
node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
and the closing event will have
node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
The empty strings you're seeing are just the element closing events. Remember that with SAX parsing, you're basically walking through a tree so you need the second event to tell you when you're going back up and closing an element.
You probably want something more like this:
reader.each do |node|
if node.name == "PMID" && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
p << node.inner_xml
end
end
Or perhaps:
reader.each do |node|
next if node.name != 'PMID'
next if node.node_type != Nokogiri::XML::Reader::TYPE_ELEMENT
p << node.inner_xml
end
Or some other variation on that.
Parsing Large XML with Nokogiri
I see a few possible problems. First of all, this:
@doc = Nokogiri::XML(sympFile)
will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.
Then you do things like this:
@doc.xpath(...).each
That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet
that xpath
returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.
Then you make your copy of what you're interested in:
symptomsList.push([signId, name])
and finally iterate over that array:
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:
class D < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [ ])
if(name == 'DisorderSign')
@data = { }
elsif(name == 'ClinicalSign')
@key = :sign
@data[@key] = ''
elsif(name == 'SignFreq')
@key = :freq
@data[@key] = ''
elsif(name == 'Name')
@in_name = true
end
end
def characters(str)
@data[@key] += str if(@key && @in_name)
end
def end_element(name, attrs = [ ])
if(name == 'DisorderSign')
# Dump @data into the database here.
@data = nil
elsif(name == 'ClinicalSign')
@key = nil
elsif(name == 'SignFreq')
@key = nil
elsif(name == 'Name')
@in_name = false
end
end
end
The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the
# Dump @data into the database here.
comment.
This structure makes it pretty easy to watch for the <Disorder id="17601">
elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.
How to parse large xml file in ruby
You can try using Nokogiri::XML::SAX
The basic way a SAX style parser works is by creating a parser,
telling the parser about the events we’re interested in, then giving
the parser some XML to process. The parser will notify you when it
encounters events your said you would like to know about.
Related Topics
Access Ruby Hash Using Dotted Path Key String
Create a Human-Readable List with "And" Inserted Before the Last Element from a Ruby List
Save Image with Mechanize and Nokogiri
Yielding in an Anonymous Block
Creating Spectral Heat Maps or Intensity Maps from Cdip Data Using Ruby
Rails Helper - How to Get a Helper to Give Me a '<Br/>' (Or Other Markup)
How to Perform Vector Addition in Ruby
Nokogiri Requires Ruby Version < 2.3
How to Understand Sender and Receiver in Ruby
How to Extend Redcarpet to Support Auto Linking User Mentions
Exception_Notification for Delayed_Job
How to Destroy a Record Without an Id Column in Ruby Activerecord
Rails Render of Partial and Layout in Controller
How Does Ruby Allow a Method and a Class with the Same Name
How to Force a Gem's Dependencies in Gemfile
Disable Rdoc and Ri Generation by Default for Rubygems 1.8.X