What Are Fast Xml Parsers for Ruby

What are fast XML parsers for Ruby?

Nokogiri is based on libxml2, which is one of the fastest XML/HTML parsers in any language. It is written in C, but there are bindings in many languages.

The problem is that the more complex the file, the longer it takes to build a complete DOM structure in memory. Creating a DOM is slower and more memory-hungry than other parsing methods (generally the entire DOM must fit into memory). XPath relies on this DOM.

SAX is often what people turn to for speed or for large documents that don't fit into memory. It is more event driven: it notifies you of a start element, end element, etc, and you write handlers to react to them. It's a bit of a pain because you end up keeping track of state yourself (e.g. which elements you're "inside").

There is a middle ground: some parsers have a "pull parsing" capability where you have a cursor-like navigation. You still visit each node sequentially, but you can "fast-forward" to the end of an element you're not interested in. It's got the speed of SAX but a better interface for many uses. I don't know if Nokogiri can do this for HTML, but I'd look into its Reader API if you're interested.

Note that Nokogiri is also very lenient with malformed markup (such as real-world HTML) and this alone makes it a very good choice for HTML parsing.

XML Parser for Ruby

I've been successful with the Nokogiri SAX Parser

Rails XML parsing

There are a lot of Ruby XML parsing libraries. However, if your XML is small, you can use the ActiveSupport Hash extension .from_xml:

Hash.from_xml(x)["message"]["param"].inject({}) do |result, elem| 
result[elem["name"]] = elem["value"]
result
end
# => {"msg"=>"xxxxxxxxxxxxx", "messageType"=>"SMS", "udh"=>nil, "id"=>"xxxxxxxxxxxxxx", "target"=>"xxxxxxxxxxxxx", "source"=>"xxxxxxxxxxx"}

How to parse a large XML file by Oga in Ruby?

You could use the SAX style parser. Since SAX parsers don't create a document from the XML they are useful for parsing large documents.
The drawback is that you will need to keep track of the state on your own. I've never used OGA for SAX parsing but I assume it will be suitable for your 5 GB XML.

Here is self contained example. Just paste it to a file and run it (the part after __END__ will be available as input, in DATA).

require "oga"

class PeopleHandler
PERSON_PATH = ["xml", "people", "person"]
ATTRIBUTE_PATH = ["xml", "people", "person", "attribute"]
attr_reader :people

def initialize
@people = []
@current_person = nil
@current_path = []
end

def on_element(_namespace, name, attrs = {})
current_path.push(name)
if current_path == PERSON_PATH
people.push({id: attrs["id"]})
elsif current_path == ATTRIBUTE_PATH
people.last[attrs["name"]] = attrs["value"]
end
end

def after_element(_namespace, name)
current_path.pop
end

private

attr_reader :current_path, :current_person
end

handler = PeopleHandler.new

Oga.sax_parse_xml(handler, DATA.read)

p handler.people

# [{:id=>"12", "first-name"=>"Pascal", "country"=>"Switzerland"}, {:id=>"13", "first-name"=>"Fred", "country"=>"Sweden"}, {:id=>"45", "first-name"=>"Karl", "country"=>"Hungary"}]

__END__
<xml>
<people>
<person id="12">
<attribute name="first-name" value="Pascal" />
<attribute name="country" value="Switzerland" />
</person>
<person id="13">
<attribute name="first-name" value="Fred" />
<attribute name="country" value="Sweden" />
</person>
<person id="45">
<attribute name="first-name" value="Karl" />
<attribute name="country" value="Hungary" />
</person>
</xml>

Sax parsers work by emitting events to a handler. See a list of available events (methods that get called) here: https://github.com/YorickPeterse/oga/blob/master/lib/oga/xml/sax_parser.rb

The sample uses an array (current_path) to keep track of the position inside the document. Perhaps this is not required in your case and the element name is enough.

If a <person> element is reached I push a Hash to my list of people. Then for each <attribute> element I augment that hash (people.last) with some key/value pairs. After parsing is complete I have a list of people handler.people I can process further.

This is only to give you an example of how SAX parsers work.

  • Perhaps you do not need to keep track of the path, perhaps the element name is good enough (i.e. when your element has an unique name). Then you can avoid keeping track of the position in an array.
  • Maybe you do not want to build a collection of items to process further. It could be that you trade memory saved by using a SAX parser for memory you need for your items. Instead you might want to process an item once you have all the necessary information (probably in after_element) and then throw it away.

If you want to time different sections of your code you can use a simple solution:

Timing can be done pretty simple to get an idea:

t1 = Time.now
operation_1
t2 = Time.now
operation_2
t3 = Time.now
puts "Operation 1 took: #{t2 - t1}"
puts "Operation 2 took: #{t3 - t2}"

How to parse large xml file in ruby

You can try using Nokogiri::XML::SAX

The basic way a SAX style parser works is by creating a parser,
telling the parser about the events we’re interested in, then giving
the parser some XML to process. The parser will notify you when it
encounters events your said you would like to know about.

Parsing Large XML files w/ Ruby & Nokogiri

You can dramatically decrease your time to execute by changing your code to the following. Just change the "99" to whatever category you want to check.:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0
xmlfeed = Nokogiri::XML(open("test.xml"))
items = xmlfeed.xpath("//item")
items.each do |item|
text = item.children.children.first.text
if ( text =~ /99/ )
icount += 1
end
end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount

puts icount
puts othercount

This took about three seconds on my machine. I think a key error you made was that you chose the "items" iterate over instead of creating a collection of the "item" nodes. That made your iteration code awkward and slow.

Rails fastest way to parse XML feed?

it parses the XML feeds on page load

Really bad idea. Unless you need super-fresh information and are willing to sacrifice some machine resources for it.

Fetch/parse them in a background process. Store results in a db (or file, whatever works) and serve your local content. This will be much faster.

Parse them in background even if they change very frequently. This way you don't burn CPU and load network by having several web workers do exactly the same work.

XML parsing in Ruby

I almost immediately found the answer.

The first thing I did was to search in the ruby source code for the error being thrown.
I found that regex.h was responsible for this.

In regex.h, the code flow is something like this:

/* Maximum number of duplicates an interval can allow.  */
#ifndef RE_DUP_MAX
#define RE_DUP_MAX ((1 << 15) - 1)
#endif

Now the problem here is RE_DUP_MAX. On AIX box, the same constant has been defined somewhere in /usr/include.
I searched for it and found in

/usr/include/NLregexp.h
/usr/include/sys/limits.h
/usr/include/unistd.h

I am not sure which of the three is being used(most probably NLregexp.h).
In these headers, the value of RE_DUP_MAX has been set to 255! So there is a cap placed on the number of repetitions of a regex!

In short, the reason is the compilation taking the system defined value than that we define in regex.h!

This also answers my question which i had asked recently:
Regex limit in ruby 64 bit aix compilation

I was not able to answer it immediately as i need to have min of 100 reputation :D :D
Cheers!



Related Topics



Leave a reply



Submit