How to Parse Xml Nodes to CSV with Ruby and Nokogiri

How to parse XML nodes to CSV with Ruby and Nokogiri

This assumes that each Offer element always has the same child nodes (though they can be empty):

CSV.open('output.csv', 'wb') do |csv|
doc.search('Offer').each do |x|
csv << x.search('*').map(&:text)
end
end

And to get headers (from the first Offer element):

CSV.open('output.csv', 'wb') do |csv|
csv << doc.at('Offer').search('*').map(&:name)
doc.search('Offer').each do |x|
csv << x.search('*').map(&:text)
end
end

search and at are Nokogiri functions that can take either XPath or CSS selector strings. at will return the first occurrence of an element; search will provide an array of matching elements (or an empty array if no matches are found). The * in this case will select all nodes that are direct children of the current node.

Both name and text are also Nokogiri functions (for an element). name provides the element's name; text provides the text or CDATA content of a node.

How to use Nokogiri to combine multiple like-formatted XML files into CSV

I'd handle the first part, of parsing one XML file, like this:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<advisories>
<advisory id="12345">
<link> https://www.google.com </link>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<language>
<name>en</name>
</language>
</os>
<reference>
<name>Full</name>
</reference>
</advisory>
<advisory id="98765">
<link> https://www.msn.com </link>
<release_date>2016-04-08</release_date>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<language>
<name>fr</name>
</language>
</os>
<reference>
<name>Partial</name>
</reference>
</advisory>
</advisories>
EOT

Note: This has nodes removed because they weren't important to the question. Please remove fluff when asking as it's distracting.

With this being the core of the code:

doc.search('advisory').map{ |advisory|
link = advisory.at('link').text
title = advisory.at('title').text
location = advisory.at('location').text
os_language_name = advisory.at('os > language > name').text
reference_name = advisory.at('reference > name').text

{
link: link,
title: title,
location: location,
os_language_name: os_language_name,
reference_name: reference_name
}
}

That could be DRY'd but was written as an example of what to do.

Running that results in an array of hashes, which would be easily output via CSV:

# => [
{:link=>" https://www.google.com ", :title=>" The Short Description Would Go Here ", :location=>" Location Name Here ", :os_language_name=>"en", :reference_name=>"Full"},
{:link=>" https://www.msn.com ", :title=>" The Short Description Would Go Here ", :location=>" Location Name Here ", :os_language_name=>"fr", :reference_name=>"Partial"}
]

Once you've got that working then fit it into a modified version of your loops to output CSV and read the XML files. This is untested but looks about right:

CSV.open('output_file.csv', 'w', 
headers: ['Link', 'Title', 'Location', 'OS Name', 'Reference Name'],
write_headers: true
) do |csv|
Dir.glob("*.xml").each do |file|
xml = Nokogiri::XML(File.read(file))
# parse a file and get the array of hashes
end

# pass the array of hashes to CSV for output
end

Note that you were using a file mode of 'wb'. You rarely need b with CSV as CSV is supposed to be a text format. If you are sure you will encounter binary data then use 'b' also, but that could lead down a path containing dragons.

Also note that this is using read. read is not scalable, which means it doesn't care how big a file is, it's going to try to read it into memory, whether or not it'll actually fit. There are lots of reasons to avoid that, but the best is it'll take your program to its knees. If your XML files could exceed the available free memory for your system then you'll want to rewrite using a SAX parser, which Nokogiri supports. How to do that is a different question.


it was actually an Array of array of hashes. I'm not sure how I ended up there but I was easily able to use array.flatten

Meditate on this:

foo = []    # => []
foo << [{}] # => [[{}]]
foo.flatten # => [{}]

You probably wanted to do this:

foo = []    # => []
foo += [{}] # => [{}]

Any time I have to use flatten I look to see if I can create the array without it being an array of arrays of something. It's not that they're inherently bad, because sometimes they're very useful, but you really wanted an array of hashes so you knew something was wrong and flatten was a cheap way out, but using it also costs more CPU time. It's better to figure out the problem and fix it and end up with faster/more efficient code. (And some will say that's a wasted effort or is premature optimization, but writing efficient code is a very good trait and goal.)

Parsing XML into CSV using Nokogiri

Here's how to get at the make and model data. How to convert it to CSV is left to you:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<VINResult xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://basicvalues.pentondata.com/">
<Vehicles>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
</Vehicles>
<Errors/>
<InvalidVINMsg/>
</VINResult>
EOT

vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
[
'make', vehicle.at('Make').content,
'model', vehicle.at('Model').content
]
}

This results in:

vehicle_make_and_models # => [["make", "Freightliner", "model", "FLD12064T"], ["make", "Freightliner", "model", "FLD12064T"]]

If you don't want the field names:

vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
[
vehicle.at('Make').content,
vehicle.at('Model').content
]
}

vehicle_make_and_models # => [["Freightliner", "FLD12064T"], ["Freightliner", "FLD12064T"]]

Note: You have XML, not HTML. Don't assume that Nokogiri treats them the same, or that the difference is insignificant. Nokogiri parses XML strictly, since XML is a strict standard.

I use CSS selectors unless I absolutely have to use XPath. CSS results in a much clearer selector most of the time, which results in easier to read code.

vinxml.xpath('//VINResult//Vehicles//Vehicle//Make').text doesn't work, because // means "start at the top of the document". Each time it's encountered Nokogiri starts at the top, searches down, and finds all matching nodes. xpath returns all matching nodes as a NodeSet, not just a particular Node, and text will return the text of all Nodes in the NodeSet, resulting in a concatenated string of the text, which is probably not what you want.

I prefer to use search instead of xpath or css. It returns a NodeSet like the other two, but it also lets us use either CSS or XPath selectors. If your particular selector was ambiguous and could be interpreted as either CSS or XPath, then you can use the explicit form. Likewise, you can use at or xpath_at or css_at to find just the first matching node, which is equivalent to search('foo').first.

XML to CSV ruby

You need to first build an array of hashes and extract the keys as headers and then put the values in the right column, all nodes flattened to columns, ignoring the root and record keys.

Something like this

require 'nokogiri'
require 'set'

file = File.read('jobProfile.xml')
doc = Nokogiri::XML(file)
record = {}
keys = Set.new
records = []
csv = ""

doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name
if node.name != "text" # skip these nodes
if value.length > 0 # skip empty nodes
key = node.name.gsub(/sd:/,'').to_sym
# if a new and not empty record, add to our records collection
if key == :Job_Profile_Data && !record.empty?
records << record
record = {}
elsif key[/Job_Profile|^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/sd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not allready in the set
keys << key
end
end
end
end
end

# build our csv
File.open('./xmloutput.csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
end

Which gives in our csv file the following

"Job_Code","Effective_Date","Job_Title","Inactive","Include_Job_Code_in_Name","Job_Description","Additional_Job_Description","Work_Shift_Required","Public_Job"
"30000","1900-01-01","Chief Executive Officer","","","","","","",
"30100","1900-01-01","Administrator Job Profile","","","","","","",
"30200","1900-01-01","Facilities & Grounds Maintenance Attendant","0","0","Job Description rich text!","Additional Job Description rich text!","0","1",
"30300","1900-01-01","Sample_Job_Title","0","0","Sample Job Description","Sample Additional Job Description","1","",

How to use Ruby and Nokogiri to parse XML

You can get multiple element texts as Array in the following way.

require 'nokogiri'
require 'csv'

fwpol = File.open(ARGV[0]) { |f| Nokogiri::XML(f) }
rule_array = []

fwpol.xpath('./table/tablebody/tablerow').each do |item|
rules = {}

rules[:name] = item.xpath('./tablecell[1]/item').text
rules[:action] = item.xpath('./tablecell[2]/item').text
rules[:source] = item.xpath('./tablecell[3]/item').map(&:text)
rule_array << rules
end

puts rule_array

output is here.

{:name=>"test_inbound", :action=>"Allow", :source=>["[Group] test_b2_group"]}
{:name=>"host02_inbound", :action=>"Allow", :source=>["[Group] host02_group"]}
{:name=>"randomhost", :action=>"Allow", :source=>["[Group] Host_group_2", "[Group] another_server"]}

How to use Nokogiri to make many changes to an XML file

I'd start with something like this:

require 'csv'
require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<Description>This single bit curved grip axe handle is made for 3 to 5 pound axes. A good quality replacement handle made of American hickory with a natural wax finish. Hardwood handles do not conduct electricity and American Hickory is known for its strength, elasticity and ability to absorb shock. These handles provide exceptional value and economy for homeowners and other occasional use applications. Each Link handle comes with the required wedges, rivets, or epoxy needed for proper application of the tool head.</Description>
<ImageFile>100024.jpg</ImageFile>
<AltImageFile1>103387-1.jpg</AltImageFile1>
<ItemNumber>100024</ItemNumber>
<ModelNumber>64707</ModelNumber>
</Product>

<Product>
<Name>1-1/4-Inch Lavatory Pop Up Assembly</Name>
<Description>Classic chrome finish with ABS plastic top & body includes push rod, no overflow.</Description>
<ImageFile>100024.jpg</ImageFile>
<AltImageFile1>103429-1.jpg</AltImageFile1>
<ItemNumber>100024</ItemNumber>
<ModelNumber>64707</ModelNumber>
</Product>

<Product>
<Name>30-Inch Belt-Drive Whole-House Attic Fan With Shutter</Name>
<Description>The 30" belt drive whole house fan (5700 CFM) with automatic shutter helps cool living spaces up to 1900 square feet. It runs on high & low and a 2 speed wall switch is included. The automatic shutter is white. It needs 1095 square inches of open exhaust vents in attic space, with a rough opening of 34-1/4" x 29". You do have to cut joist when installing fan, with the motor mounted on struts above housing. The fan will be quieter than direct drive models. There is a 10 year limited parts warranty, 5 year limited labor warranty.</Description>

<ImageFile>100073.jpg</ImageFile>
<AltImageFile1>
<ItemNumber>100024</ItemNumber>
<ModelNumber>64707</ModelNumber>
</Product>
</Products>
EOT

This is the logic:

NODES_TO_COLUMNS = {
'ItemNumber' => 'handle',
'Name' => 'title',
'Description' => 'description',
# 'FLDeptName' => 'collection',
'ImageFile' => 'image1'
}

all_things = doc.search('Product').map do |product|
NODES_TO_COLUMNS.keys.map { |node|
product.at(node).text
}
end

CSV.open('/dev/stdout', 'wb') do |csv|
csv << NODES_TO_COLUMNS.values
all_things.each do |r|
csv << r
end
end

Which, when run, results in:

handle,title,description,image1
100024,36-In. Homeowner Bent Single-Bit Axe Handle,"This single bit curved grip axe handle is made for 3 to 5 pound axes. A good quality replacement handle made of American hickory with a natural wax finish. Hardwood handles do not conduct electricity and American Hickory is known for its strength, elasticity and ability to absorb shock. These handles provide exceptional value and economy for homeowners and other occasional use applications. Each Link handle comes with the required wedges, rivets, or epoxy needed for proper application of the tool head.",100024.jpg
100024,1-1/4-Inch Lavatory Pop Up Assembly,"Classic chrome finish with ABS plastic top & body includes push rod, no overflow.",100024.jpg
100024,30-Inch Belt-Drive Whole-House Attic Fan With Shutter,"The 30"" belt drive whole house fan (5700 CFM) with automatic shutter helps cool living spaces up to 1900 square feet. It runs on high & low and a 2 speed wall switch is included. The automatic shutter is white. It needs 1095 square inches of open exhaust vents in attic space, with a rough opening of 34-1/4"" x 29"". You do have to cut joist when installing fan, with the motor mounted on struts above housing. The fan will be quieter than direct drive models. There is a 10 year limited parts warranty, 5 year limited labor warranty.",100073.jpg

Because FLDeptName is missing in the XML, which it should not be to be a proper question on SO, I commented it out. How to use it is left for you.

You'll want to change:

CSV.open('/dev/stdout', 'wb') do |csv|

to whatever you want to use for the filename. '/dev/stdout' is merely an way for me to preserve the coding and route output to STDOUT to display it.

In your code you're using things like:

xpath("./ItemNumber").first.text

Don't do that. Nokogiri provides the at shortcut which is equivalent to xpath....first but is more succinct. Also, it's not necessary to use xpath as Nokogiri's search and at methods are smart enough to figure out what is an XPath or CSS selector almost every time.

I'd also recommend not using XPath unless forced to. CSS selectors are more readable and include many jQuery CSS extensions (if not all by now) so you can avoid some XPath visual noise using them.

Your requirement to create a secondary, mostly-blank row if AltImageFile1 is not empty isn't something I'd do or recommend. A CSV row is considered a single, separate, record, and would be interpreted as that by every application that supports CSV I've seen, so you're asking to create a secondary record without fields that is a non-standard format. Instead, that field should be appended to the same row as an additional field. That logic isn't hard and is left for you to figure out.

The IETF CSV document specifies:

Each record is located on a separate line, delimited by a line
break (CRLF). For example:

   aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF

As a result, not doing that will break data movement through a lot of other applications, which is something you should want to avoid as CSV is supposed to be for data transfer.

If you are moving the data into a DBM, create a temporary table for importing directly from the XML, perform database statements to manipulate the records appropriately and then append them to the primary table. If you are importing the data into Excel, use a separate table, modify the fields, then copy or merge the data into the regular table. Creating a non-standard representation of the data seems like a dead-end to me.

An alternate would be to use YAML files which are more flexible and much more robust.

How to parse XML to CSV where data is in attributes only

I'd simplify it a bit using CSS accessors:

xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT

require 'nokogiri'

doc = Nokogiri::XML(xml)

data = [ %w[ class_ref property_ref is_required UOM_ref] ]

doc.css('|prescribed_item').each do |pi|
pi.css('|prescribed_property').each do |pp|
data << [
pi['class_ref'],
pp['property_ref'],
pp['is_required'],
pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
]
end
end

puts data.map{ |row| row.join('|') }

Which outputs:

class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1

Could you explain this line in greater detail "pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"

In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a NodeSet, and the "at" methods return the first Node of the NodeSet which will be the first encountered Node that matched the accessor.

The "search" methods are things like search, css, xpath and /. The "at" methods are things like at, at_css, at_xpath and %. Both search and at accept either XPath or CSS accessors.

Back to pp.at_css('|prescribed_unit_of_measure')['UOM_ref']: At that point in the code pp is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node under pp that matches the CSS |prescribed_unit_of_measure accessor, in other words the first <dt:prescribed_unit_of_measure> tag contained by the pp node. When Nokogiri finds that node, it returns the value of the UOM_ref attribute of the node.

As a FYI, the / and % operators are aliased to search and at respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.

Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use |. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.



Related Topics



Leave a reply



Submit