Nokogiri: Select Content Between Element a and B

Nokogiri: Select content between element A and B

A way-too-smart oneliner which uses recursion:

def collect_between(first, last)
first == last ? [first] : [first, *collect_between(first.next, last)]
end

An iterative solution:

def collect_between(first, last)
result = [first]
until first == last
first = first.next
result << first
end
result
end

EDIT: (Short) explanation of the asterix

It's called the splat operator. It "unrolls" an array:

array = [3, 2, 1]
[4, array] # => [4, [3, 2, 1]]
[4, *array] # => [4, 3, 2, 1]

some_method(array) # => some_method([3, 2, 1])
some_method(*array) # => some_method(3, 2, 1)

def other_method(*array); array; end
other_method(1, 2, 3) # => [1, 2, 3]

Grab everything between b elements with Nokogiri

I reduced your HTML to be less verbose. It achieves the same thing without extra text.

I'd do it like this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<tr class="level2">
<td>
<b>word</b>
"Text I need"
<b>word</b>
"Text I need"
<i>blabla</i>
"Text I need"
<b>word</b>
"Text I need"
<i>blabla</i>
"Text I need"
<i>blabla</i>
<b>word</b>
</td>
</tr>
EOT

doc.search('td i').remove

Since the <i> nodes aren't needed simply strip them. The resulting doc looks like:

puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <tr class="level2">
# >> <td>
# >> <b>word</b>
# >> "Text I need"
# >> <b>word</b>
# >> "Text I need"
# >>
# >> "Text I need"
# >> <b>word</b>
# >> "Text I need"
# >>
# >> "Text I need"
# >>
# >> <b>word</b>
# >>
# >> </td>
# >> </tr>
# >> </body></html>

Once the <i> nodes are gone it's possible to iterate over the contents of the <td> and process their text:

text = doc.at('td').children.reject { |n| n.text.strip == '' }.slice_before { |n| n.name == 'b' }.map{ |a| a.map { |n| n.text.strip }}

At this point text contains:

text
# => [["word", "\"Text I need\""],
# ["word", "\"Text I need\"", "\"Text I need\""],
# ["word", "\"Text I need\"", "\"Text I need\""],
# ["word"]]

Note, there's a trailing "word", which mimics the sample HTML you gave. If you know you won't have any trailing text you want to keep, you can simply pop off that element. If you think there are elements that are only single items you could iterate over the list looking for singles and reject them also. How to handle it is up to you and for you to figure out.

Nokogiri and Xpath: find all text between two tags

It's not trivial. In the context of the nodes you selected (the td), to get everything between two elements, you need to perform an intersection of these two sets:

  1. Set A: All the nodes preceding the first h3: //h3[1]/preceding::node()
  2. Set B: All the nodes following the first h2: //h2[1]/following::node()

To perform an intersection, you can use the Kaysian method (after Michael Kay, who proposed it). The basic formula is:

A[count(.|B) = count(B)]

Applying it to your sets, as defined above, where A = //h3[1]/preceding::node(), and B = //h2[1]/following::node(), we have:

//h3[1]/preceding::node()[ count( . | //h2[1]/following::node()) = count(//h2[1]/following::node()) ]

which will select all elements and text nodes starting with the first <br> after the </h2> tag, to the whitespace text node after the last <br>, just before the next <h3> tag.

You can easily select just the text nodes between h2 and h3 replacing node() for text() in the expression. This one will return all text nodes (including whitespace and linebreaks) between the two headers:

//h3[1]/preceding::text()[ count( . | //h2[1]/following::text()) = count(//h2[1]/following::text()) ]

Using Nokogiri to find element before another element

Nokogiri allows you to use xpath expressions to locate an element:

categories = []

doc.xpath("//li").each do |elem|
categories << elem.parent.xpath("preceding-sibling::h2").last.text
end

categories.uniq!
p categories

The first part looks for all "li" elements, then inside, we look for the parent (ul, ol), the for an element before (preceding-sibling) which is an h2. There can be more than one, so we take the last (ie, the one closest to the current position).

We need to call "uniq!" as we get the h2 for each 'li' (as the 'li' is the starting point).

Using your own HTML example, this code output:

["Destinations", "Shopping List"]

Getting Data Between Elements in Nokogiri

This works:

doc = Nokogiri::HTML(html)
doc.xpath('//maindeck[1]/text()').map { |n| [n.text.to_i, n.next.text] }
#=> [[1, "Blood Crypt"], [2, "Temple Garden"]]

How to navigate the DOM using Nokogiri

I came across this a few years too late I suppose, but felt compelled to post because all the other solutions are way too complicated.

It's a single statement with XPath:

start = doc.at('div.block#X2')

start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>

start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>

This accommodates either direct previous siblings or children of previous siblings. Regardless of which one matches, the last() predicate ensures that you get the closest previous match.

Use Nokogiri to get all nodes in an element that contain a specific attribute name


elements = @doc.xpath("//*[@*[blah]]")

This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.

The DZone snippet is confusing in that when they say

elements = @doc.xpath("//*[@*[attribute_name]]")

the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p

They also have an extra * in there, after the @.

What you want is

elements = @doc.xpath("//*[@blah]")

This will give you all the elements that have an attribute named 'blah'.

Get all anchors with attribute href = a,b or c with Nokogiri

With Nokogiri, you can always use xpath:

<!doctype html>
<html lang="en">
<head></head>
<body>
This is <a href="http://b.com">a link</a>
This is <a href="http://a.com">another link</a>
</body>
</html>


noko_page.xpath("//a[@href='http://a.com' or @href= 'http://b.com']")



=> [#<Nokogiri::XML::Element:0x3fc9360be368 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc9360bdcd8 name="href" value="http://b.com">] children=[#<Nokogiri::XML::Text:0x3fc93618e93c "a link">]>, #<Nokogiri::XML::Element:0x3fc93618dc08 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc93618d71c name="href" value="http://a.com">] children=[#<Nokogiri::XML::Text:0x3fc93618fd78 "another link">]>]

Nokogiri to Find All Data Attrabutes Using a Wildcard

You can search for img tags with an attribute that starts with "data-" using the following:

//img[@*[starts-with(name(),'data-')]]

To break this down:

  • // - Anywhere in the document
  • img - img tag
  • @* - All Attributes
  • starts-with(name(),'data-') - Attribute's name starts with "data-"

Example:

require 'nokogiri'

doc = Nokogiri::HTML(<<-END_OF_HTML)
<img src='' />
<img data-method='a' src= ''>
<img data-info='b' src= ''>
<img data-type='c' src= ''>
<img src= ''>
END_OF_HTML

imgs = doc.xpath("//img[@*[starts-with(name(),'data-')]]")

puts imgs
# <img data-method="a" src="">
# <img data-info="b" src="">
# <img data-type="c" src="">

or using your desired loop

doc.css('img').select do |img|
img.xpath(".//@*[starts-with(name(),'data-')]").any?
end
#[#<Nokogiri::XML::Element:0x384 name="img" attributes=[#<Nokogiri::XML::Attr:0x35c name="data-method" value="a">, #<Nokogiri::XML::Attr:0x370 name="src">]>,
# #<Nokogiri::XML::Element:0x3c0 name="img" attributes=[#<Nokogiri::XML::Attr:0x398 name="data-info" value="b">, #<Nokogiri::XML::Attr:0x3ac name="src">]>,
# #<Nokogiri::XML::Element:0x3fc name="img" attributes=[#<Nokogiri::XML::Attr:0x3d4 name="data-type" value="c">, #<Nokogiri::XML::Attr:0x3e8 name="src">]>]

UPDATE To remove the attributes:

doc.css('img').each do |img|
img.xpath(".//@*[starts-with(name(),'data-')]").each(&:remove)
end

puts doc.to_s
#<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" #\"http://www.w3.org/TR/REC-html40/loose.dtd\">
#<html>
#<body>
# <img src=\"\">
# <img src=\"\">
# <img src=\"\">
# <img src=\"\">
# <img src=\"\">
#</body>
#</html>

This can be simplified to doc.xpath("//img/@*[starts-with(name(),'data-')]").each(&:remove)



Related Topics



Leave a reply



Submit