Getting the Siblings of a Node with Nokogiri

Getting the siblings of a node with Nokogiri

require 'nokogiri'
doc = Nokogiri::XML.parse(File.open('info.xml'))
details = doc.css('details').find{|node| node.css('id').text == "5678"}
email = details.css('email').text # => "zzzz@zzz.com"
images = details.css('image').map(&:text) # => ["images/4.jpg", "images/5.jpg"]

Update: There are shorter, arguably better, ways to grab the details node you want:

details = doc.at('details:has(id[text()="5678"])')

or

details = doc.search('id[text()="5678"] ~ *')

Those are both courtesy of pguardiario.

How to get siblings' child according to specific defined sibling content

You can use XPath following-sibling axis for this purpose assuming the target element always located after role :

doc.xpath('//comic').each do |main_element|
main_element.xpath("mainsection/credits/credit/role[@id='dfWriter']").each do |n|
writer << n.xpath('following-sibling::person/displayname').text
end
main_element.xpath("mainsection/credits/credit/role[@id='dfPenciler']").each do |n|
penciler << n.xpath('following-sibling::person/displayname').text
end
end

Or you can just iterate through credit instead of role in the first place :

doc.xpath('//comic').each do |main_element|
main_element.xpath("mainsection/credits/credit[role/@id='dfWriter']").each do |n|
writer << n.xpath('person/displayname').text
end
main_element.xpath("mainsection/credits/credit[role/@id='dfPenciler']").each do |n|
penciler << n.xpath('person/displayname').text
end
end

CSS/Xpath sibling selector in Nokogiri

The problem is actually with your XPath for getting the the surname and given name, i.e., the XPath is incorrect for the lines:

puts surname = corrdetails.xpath( "//surname" ).text
puts givennames = corrdetails.xpath("//given-names").text

Starting the XPath with // means to look for the node anywhere in the document. You only want to look within the corrdetails node, which means the XPath needs to start with a dot, e.g., .//.

Change the two lines to:

puts surname = corrdetails.xpath( ".//surname" ).text
puts givennames = corrdetails.xpath(".//given-names").text

Using Nokogiri to find element before another element

Nokogiri allows you to use xpath expressions to locate an element:

categories = []

doc.xpath("//li").each do |elem|
categories << elem.parent.xpath("preceding-sibling::h2").last.text
end

categories.uniq!
p categories

The first part looks for all "li" elements, then inside, we look for the parent (ul, ol), the for an element before (preceding-sibling) which is an h2. There can be more than one, so we take the last (ie, the one closest to the current position).

We need to call "uniq!" as we get the h2 for each 'li' (as the 'li' is the starting point).

Using your own HTML example, this code output:

["Destinations", "Shopping List"]

XPath to find all following siblings up until the next sibling of a particular type

One possible solution:

dl.xpath('dt').each_with_index do |dt, i|
dds = dt.xpath("following-sibling::dd[not(../dt[#{i + 2}]) or " +
"following-sibling::dt[1]=../dt[#{i + 2}]]")
puts "#{dt.text}: #{dds.map(&:text).join(', ')}"
end

This relies on a value comparison of dt elements and will fail when there are duplicates. The following (much more complicated) expression does not depend on unique dt values:

following-sibling::dd[not(../dt[$n]) or 
(following-sibling::dt[1] and count(following-sibling::dt[1]|../dt[$n])=1)]

Note: Your use of self fails because you're not properly using it as an axis (self::). Also, self always contains just the context node, so it would refer to each dd inspected by the expression, not back to the original dt

Use XPath to group siblings from an HTML/XML document?

Updated Answer

Here's a general solution that creates a hierarchy of <section> elements based on header levels and their following siblings:

class Nokogiri::XML::Node
# Create a hierarchy on a document based on heading levels
# wrap : e.g. "<section>" or "<div class='section'>"
# stops : array of tag names that stop all sections; use nil for none
# levels : array of tag names that control nesting, in order
def auto_section(wrap='<section>', stops=%w[hr], levels=%w[h1 h2 h3 h4 h5 h6])
levels = Hash[ levels.zip(0...levels.length) ]
stops = stops && Hash[ stops.product([true]) ]
stack = []
children.each do |node|
unless level = levels[node.name]
level = stops && stops[node.name] && -1
end
stack.pop while (top=stack.last) && top[:level]>=level if level
stack.last[:section].add_child(node) if stack.last
if level && level >=0
section = Nokogiri::XML.fragment(wrap).children[0]
node.replace(section); section << node
stack << { :section=>section, :level=>level }
end
end
end
end

Here is this code in use, and the result it gives.

The original HTML

<body>
<h1>Main Section 1</h1>
<p>Intro</p>
<h2>Subhead 1.1</h2>
<p>Meat</p><p>MOAR MEAT</p>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<h3>Caveats</h3>
<p>FYI</p>
<h4>ProTip</h4>
<p>Get it done</p>
<h2>Subhead 1.3</h2>
<p>Meat</p>

<h1>Main Section 2</h1>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<h4>Dive! Dive!</h4>
<p>...and down</p>

<hr /><p id="footer">Copyright © All Done</p>
</body>

The conversion code

# Use XML only so that we can pretty-print the results; HTML works fine, too
doc = Nokogiri::XML(html,&:noblanks) # stripping whitespace allows indentation
doc.at('body').auto_section # make the magic happen
puts doc.to_xhtml # show the result with indentation

The result

<body>
<section>
<h1>Main Section 1</h1>
<p>Intro</p>
<section>
<h2>Subhead 1.1</h2>
<p>Meat</p>
<p>MOAR MEAT</p>
</section>
<section>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<section>
<h3>Caveats</h3>
<p>FYI</p>
<section>
<h4>ProTip</h4>
<p>Get it done</p>
</section>
</section>
</section>
<section>
<h2>Subhead 1.3</h2>
<p>Meat</p>
</section>
</section>
<section>
<h1>Main Section 2</h1>
<section>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
</section>
<section>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<section>
<h4>Dive! Dive!</h4>
<p>...and down</p>
</section>
</section>
</section>
<hr />
<p id="footer">Copyright All Done</p>
</body>

Original Answer

Here's an answer using no XPath, but Nokogiri. I've taken the liberty of making the solution somewhat flexible, handling arbitrary start/stops (but not nested sections).

html = "<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>

<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>

<hr>
<p id='footer'>All done!</p>"

require 'nokogiri'
class Nokogiri::XML::Node
# Provide a block that returns:
# true - for nodes that should start a new section
# false - for nodes that should not start a new section
# :stop - for nodes that should stop any current section but not start a new one
def group_under(name="section")
group = nil
element_children.each do |child|
case yield(child)
when false, nil
group << child if group
when :stop
group = nil
else
group = document.create_element(name)
child.replace(group)
group << child
end
end
end
end

doc = Nokogiri::HTML(html)
doc.at('body').group_under do |node|
if node.name == 'hr'
:stop
else
%w[h1 h2 h3 h4 h5 h6].include?(node.name)
end
end

puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <section><h2>Header</h2>
#=> <p>First paragraph</p>
#=> <p>Second paragraph</p></section>
#=>
#=> <section><h2>Second header</h2>
#=> <p>Third paragraph</p>
#=> <p>Fourth paragraph</p></section>
#=>
#=> <hr>
#=> <p id="footer">All done!</p>
#=> </body></html>

For XPath, see XPath : select all following siblings until another sibling



Related Topics



Leave a reply



Submit