Trying to Get Content Inside Cdata Tags in Xml File Using Nokogiri

trying to get content inside cdata tags in xml file using nokogiri

You're trying to parse XML using Nokogiri's HMTL parser. If node as from the XML parser then r would be nil since XML is case sensitive; your r is not nil so you're using the HTML parser which is case insensitive.

Use Nokogiri's XML parser and you will get things like this:

>> r = doc.at_xpath('.//NewsLineText')
=> #<Nokogiri::XML::Element:0x8066ad34 name="NewsLineText" children=[#<Nokogiri::XML::Text:0x8066aac8 "\n ">, #<Nokogiri::XML::CDATA:0x8066a9c4 "\n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n ">, #<Nokogiri::XML::Text:0x8066a8d4 "\n">]>
>> r.text
=> "\n \n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n \n"

and you'll be able to get at the CDATA through r.text or r.children.

How to get Nokogiri to parse XML with CDATA in and

Your input contains escaped < and > characters (< and >). When you use characters instead of HTML entities, everything works as expected:

input = "<DATA>
<NAME><![CDATA[FIRSTNAME LASTNAME MIDDLENAME ]]></NAME>
<NUM>3731</NUM>
<person_type>4</person_type>
<birth_date><![CDATA[01.11.1992]]></birth_date>
<DESCRIPTION><![CDATA[DESCRIPTION]]></DESCRIPTION>
</DATA>"
doc = Nokogiri::XML(input)
doc.xpath('//DATA/NAME').text

=> "FIRSTNAME LASTNAME MIDDLENAME "

doc.xpath('//DATA').each do |terr|
puts "\nName: "+terr.xpath('NAME').text
end

=> Name: FIRSTNAME LASTNAME MIDDLENAME

To get rid of HTML entities, you can call CGI.unescapeHTML on the input:

doc = Nokogiri::XML(CGI.unescapeHTML(File.read("test2.xml")))

How do I access the CDATA in a title tag in XML with Nokogiri?

Use Nokogiri::XML, not Nokogiri::HTML:

2.3.0 :023 > doc = Nokogiri::XML(data)
=> #<Nokogiri::XML::Document:0x3fcd0e41a42c name="document" children=[#<Nokogiri::XML::Element:0x3fcd0e417f60 name="item" attributes=[#<Nokogiri::XML::Attr:0x3fcd0e417efc name="rdf:about" value="http://auburn.craigslist.org/cpg/5368609005.html">] children=[#<Nokogiri::XML::Text:0x3fcd0e417948 "\n">, #<Nokogiri::XML::Element:0x3fcd0e417830 name="title" children=[#<Nokogiri::XML::CDATA:0x3fcd0e417574 "Help Wanted for Online Business">]>, #<Nokogiri::XML::Text:0x3fcd0e417204 "\n">, #<Nokogiri::XML::Element:0x3fcd0e41709c name="link" children=[#<Nokogiri::XML::Text:0x3fcd0e416cf0 "http://auburn.craigslist.org/cpg/5368609005.html">]>, #<Nokogiri::XML::Text:0x3fcd0e416aac "\n">, #<Nokogiri::XML::Element:0x3fcd0e4169bc name="description" children=[#<Nokogiri::XML::CDATA:0x3fcd0e416728 "Create a safer environment for your children and WORK FROM HOME helping others do the same. \nNO Sales, No Home parties, No Tele-marketing! 1/2 computer 1/2 telephone .......No Risk Involved!........High speed Internet and telephone with long distance [...]">]>, #<Nokogiri::XML::Text:0x3fcd0e416444 "\n">, #<Nokogiri::XML::Element:0x3fcd0e41632c name="dc:date" children=[#<Nokogiri::XML::Text:0x3fcd0e413e74 "2016-01-16T09:14:35-06:00">]>, #<Nokogiri::XML::Text:0x3fcd0e4135f0 "\n">, #<Nokogiri::XML::Element:0x3fcd0e413028 name="dc:language" children=[#<Nokogiri::XML::Text:0x3fcd0e41277c "en-us">]>, #<Nokogiri::XML::Text:0x3fcd0e412588 "\n">, #<Nokogiri::XML::Element:0x3fcd0e412420 name="dc:rights" children=[#<Nokogiri::XML::Text:0x3fcd0e412128 "© 2016 <span class=\"desktop\">craigslist</span><span class=\"mobile\">CL</span>">]>, #<Nokogiri::XML::Text:0x3fcd0e40fe78 "\n">, #<Nokogiri::XML::Element:0x3fcd0e40fd9c name="dc:source" children=[#<Nokogiri::XML::Text:0x3fcd0e40fae0 "http://auburn.craigslist.org/cpg/5368609005.html">]>, #<Nokogiri::XML::Text:0x3fcd0e40f7c0 "\n">, #<Nokogiri::XML::Element:0x3fcd0e40f6bc name="dc:title" children=[#<Nokogiri::XML::CDATA:0x3fcd0e40f3c4 "Help Wanted for Online Business">]>, #<Nokogiri::XML::Text:0x3fcd0e40f0cc "\n">, #<Nokogiri::XML::Element:0x3fcd0e40ef78 name="dc:type" children=[#<Nokogiri::XML::Text:0x3fcd0e40eb2c "text">]>, #<Nokogiri::XML::Text:0x3fcd0e40e6a4 "\n">, #<Nokogiri::XML::Element:0x3fcd0e40e500 name="dcterms:issued" children=[#<Nokogiri::XML::Text:0x3fcd0e40e08c "2016-01-16T09:14:35-06:00">]>, #<Nokogiri::XML::Text:0x3fcd0e407944 "\n">]>]>
2.3.0 :026 > doc.at_xpath('//title')
=> #<Nokogiri::XML::Element:0x3fcd0e417830 name="title" children=[#<Nokogiri::XML::CDATA:0x3fcd0e417574 "Help Wanted for Online Business">]>
2.3.0 :027 > doc.at_xpath('//title').text
=> "Help Wanted for Online Business"

Find element in XML file by CDATA attribute using Nokogiri?

Don't worry about CDATA, Nokogiri can deal with it. You can simply iterate on all <product> and, for each of them, on its children (in the code below I limited the children only to that three of them).

doc = Nokogiri::XML(xml)
out = []
doc.xpath('//products/product').each do |product|
h = {}
product.xpath('name | price | SKU').each do |child|
h["#{child.name}"] = child.text.strip
end
out << h
end

The result is an array of hashes:

[{"name"=>"name1", "price"=>"Price1", "SKU"=>"p-1"},
{"name"=>"name2", "price"=>"Price2", "SKU"=>"p-2"}]

Nokogiri extract data from xml

The result of an xpath call made in Nokogiri is going to be a NodeSet, which is simply a list of Nokigiri Nodes

With this in mind we can just pull examples from the Nokogiri Documentation and adapt them.

To answer your question, "Could you show me how extract just the src attribute from the img tag ?", here is one such way.

#the 'open' method here is part of the open-uri library
xml = Nokogiri::XML(open(your_url_here))

all_images = xml.xpath("//img") #returns NodeSet (list of Nokogiri Nodes)

image_sources = []

#iterate through each node
all_images.each() do |node|
image_sources << node.get_attribute('src') #One method
#image_sources << node['src'] #Another convention we could use
end

As Phrogz notes below, a more idomatic way of pulling the 'src' attribute from all of the images nodes is to map the 'src' attribute directly rather than iterating and pushing onto an array.

image_sources = all_images.map{ |node| node['src'] }

How to use SAX to get CDATA content

It's not clear what you're trying to do, but this might help clear things up.

A <![CDATA[...]]> entry isn't a tag, it's a block, and is treated differently by the parser. When the block is encountered the <![CDATA[ and ]]> are stripped off so you'll only see the string inside. See "What does <![CDATA[]]> in XML mean?" for more information.

If you're trying to create a CDATA block in XML it can be done easily using:

doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string') << Nokogiri::XML::CDATA.new(Nokogiri::XML::Document.new, "Hey I'm a tag with & and other characters")
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\"><![CDATA[Hey I'm a tag with & and other characters]]></string>\n"

<< is just shorthand to create a child node.

Trying to use inner_html doesn't do what you want as it creates a text node as a child:

doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string').inner_html = "Hey I'm a tag with & and other characters"
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\">Hey I'm a tag with & and other characters</string>\n"
doc.at('string').children.first.text # => "Hey I'm a tag with & and other characters"
doc.at('string').children.first.class # => Nokogiri::XML::Text

Using inner_html causes HTML encoding of the string to occur, which is the alternative way of embedding text that could include tags. Without the encoding or using CDATA the XML parsers could get confused about what is text versus what is a real tag. I've written RSS aggregators, and having to deal with incorrectly encoded embedded HTML in a feed is a pain.

How to get values in XML data using Nokogiri?

str = "<roar ......"
doc = Nokogiri.XML(str)
puts doc.xpath('//create_oauth/@status') # => ok
puts doc.xpath('//auth_token').text # => 148....
# player_id is the same as auth_token

And it is a great idea to learn you some good xpath from w3schools.

Building blank XML tags with Nokogiri?

SaveOptions::NO_EMPTY_TAGS will get you what you want.

require 'nokogiri'

builder = Nokogiri::XML::Builder.new do |xml|
xml.blah(nil)
end

puts 'broken:'
puts builder.to_xml
puts 'fixed:'
puts builder.to_xml(save_with: Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS)

output:

(511)-> ruby derp.rb 
broken:
<?xml version="1.0"?>
<blah/>
fixed:
<?xml version="1.0"?>
<blah></blah>

How to create text ... /text node in Nokogiri?

From the docs:

The builder works by taking advantage of method_missing. Unfortunately some methods are defined in ruby that are difficult or dangerous to remove. You may want to create tags with the name “type”, “class”, and “id” for example. In that case, you can use an underscore to disambiguate your tag name from the method call.

Appending an underscore also works for “text”, i.e. use text_ instead:

builder = Nokogiri::XML::Builder.new(:encoding => 'UTF-8') do |xml|
xml.job {
xml.text_ {
xml.cdata 'foo bar baz'
}
}
end

puts builder.to_xml

Output:

<?xml version="1.0" encoding="UTF-8"?>
<job>
<text><![CDATA[foo bar baz]]></text>
</job>

Navigating an XML doc with nokogiri

Try using the XPath /Lineup/Player/* to inspect each child element of each "Player" node:

doc = Nokogiri::XML(File.read('my.xml'))
doc.xpath('/Lineup/Player/*').each do |node|
puts "#{node.name}: #{node.text}"
end
# GameID: 20150010
# GameDate: 2015-04-06T00:00:00-04:00
# DH: 0
# ...etc...

Alternatively, you can select each "Player" and iterate over its element children (by using #elements or #element_children):

doc.xpath('/Lineup/Player').each do |player|
puts "-- NEXT PLAYER --"
player.elements.each do |node|
puts "#{node.name}: #{node.text}"
end
end


Related Topics



Leave a reply



Submit