Nokogiri: Select content between element A and B
A way-too-smart oneliner which uses recursion:
def collect_between(first, last)
first == last ? [first] : [first, *collect_between(first.next, last)]
end
An iterative solution:
def collect_between(first, last)
result = [first]
until first == last
first = first.next
result << first
end
result
end
EDIT: (Short) explanation of the asterix
It's called the splat operator. It "unrolls" an array:
array = [3, 2, 1]
[4, array] # => [4, [3, 2, 1]]
[4, *array] # => [4, 3, 2, 1]
some_method(array) # => some_method([3, 2, 1])
some_method(*array) # => some_method(3, 2, 1)
def other_method(*array); array; end
other_method(1, 2, 3) # => [1, 2, 3]
Grab everything between b elements with Nokogiri
I reduced your HTML to be less verbose. It achieves the same thing without extra text.
I'd do it like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<tr class="level2">
<td>
<b>word</b>
"Text I need"
<b>word</b>
"Text I need"
<i>blabla</i>
"Text I need"
<b>word</b>
"Text I need"
<i>blabla</i>
"Text I need"
<i>blabla</i>
<b>word</b>
</td>
</tr>
EOT
doc.search('td i').remove
Since the <i>
nodes aren't needed simply strip them. The resulting doc
looks like:
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <tr class="level2">
# >> <td>
# >> <b>word</b>
# >> "Text I need"
# >> <b>word</b>
# >> "Text I need"
# >>
# >> "Text I need"
# >> <b>word</b>
# >> "Text I need"
# >>
# >> "Text I need"
# >>
# >> <b>word</b>
# >>
# >> </td>
# >> </tr>
# >> </body></html>
Once the <i>
nodes are gone it's possible to iterate over the contents of the <td>
and process their text:
text = doc.at('td').children.reject { |n| n.text.strip == '' }.slice_before { |n| n.name == 'b' }.map{ |a| a.map { |n| n.text.strip }}
At this point text
contains:
text
# => [["word", "\"Text I need\""],
# ["word", "\"Text I need\"", "\"Text I need\""],
# ["word", "\"Text I need\"", "\"Text I need\""],
# ["word"]]
Note, there's a trailing "word", which mimics the sample HTML you gave. If you know you won't have any trailing text you want to keep, you can simply pop
off that element. If you think there are elements that are only single items you could iterate over the list looking for singles and reject them also. How to handle it is up to you and for you to figure out.
Nokogiri and Xpath: find all text between two tags
It's not trivial. In the context of the nodes you selected (the td
), to get everything between two elements, you need to perform an intersection of these two sets:
- Set A: All the nodes preceding the first
h3
://h3[1]/preceding::node()
- Set B: All the nodes following the first
h2
://h2[1]/following::node()
To perform an intersection, you can use the Kaysian method (after Michael Kay, who proposed it). The basic formula is:
A[count(.|B) = count(B)]
Applying it to your sets, as defined above, where A = //h3[1]/preceding::node()
, and B = //h2[1]/following::node()
, we have:
//h3[1]/preceding::node()[ count( . | //h2[1]/following::node()) = count(//h2[1]/following::node()) ]
which will select all elements and text nodes starting with the first <br>
after the </h2>
tag, to the whitespace text node after the last <br>
, just before the next <h3>
tag.
You can easily select just the text nodes between h2
and h3
replacing node()
for text()
in the expression. This one will return all text nodes (including whitespace and linebreaks) between the two headers:
//h3[1]/preceding::text()[ count( . | //h2[1]/following::text()) = count(//h2[1]/following::text()) ]
Using Nokogiri to find element before another element
Nokogiri allows you to use xpath expressions to locate an element:
categories = []
doc.xpath("//li").each do |elem|
categories << elem.parent.xpath("preceding-sibling::h2").last.text
end
categories.uniq!
p categories
The first part looks for all "li" elements, then inside, we look for the parent (ul, ol), the for an element before (preceding-sibling) which is an h2. There can be more than one, so we take the last (ie, the one closest to the current position).
We need to call "uniq!" as we get the h2 for each 'li' (as the 'li' is the starting point).
Using your own HTML example, this code output:
["Destinations", "Shopping List"]
Getting Data Between Elements in Nokogiri
This works:
doc = Nokogiri::HTML(html)
doc.xpath('//maindeck[1]/text()').map { |n| [n.text.to_i, n.next.text] }
#=> [[1, "Blood Crypt"], [2, "Temple Garden"]]
How to navigate the DOM using Nokogiri
I came across this a few years too late I suppose, but felt compelled to post because all the other solutions are way too complicated.
It's a single statement with XPath:
start = doc.at('div.block#X2')
start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>
start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>
This accommodates either direct previous siblings or children of previous siblings. Regardless of which one matches, the last()
predicate ensures that you get the closest previous match.
Use Nokogiri to get all nodes in an element that contain a specific attribute name
elements = @doc.xpath("//*[@*[blah]]")
This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.
The DZone snippet is confusing in that when they say
elements = @doc.xpath("//*[@*[attribute_name]]")
the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p
They also have an extra *
in there, after the @
.
What you want is
elements = @doc.xpath("//*[@blah]")
This will give you all the elements that have an attribute named 'blah'.
Get all anchors with attribute href = a,b or c with Nokogiri
With Nokogiri, you can always use xpath:
<!doctype html>
<html lang="en">
<head></head>
<body>
This is <a href="http://b.com">a link</a>
This is <a href="http://a.com">another link</a>
</body>
</html>
noko_page.xpath("//a[@href='http://a.com' or @href= 'http://b.com']")
=> [#<Nokogiri::XML::Element:0x3fc9360be368 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc9360bdcd8 name="href" value="http://b.com">] children=[#<Nokogiri::XML::Text:0x3fc93618e93c "a link">]>, #<Nokogiri::XML::Element:0x3fc93618dc08 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc93618d71c name="href" value="http://a.com">] children=[#<Nokogiri::XML::Text:0x3fc93618fd78 "another link">]>]
Nokogiri to Find All Data Attrabutes Using a Wildcard
You can search for img tags with an attribute that starts with "data-" using the following:
//img[@*[starts-with(name(),'data-')]]
To break this down:
- // - Anywhere in the document
- img - img tag
- @* - All Attributes
- starts-with(name(),'data-') - Attribute's name starts with "data-"
Example:
require 'nokogiri'
doc = Nokogiri::HTML(<<-END_OF_HTML)
<img src='' />
<img data-method='a' src= ''>
<img data-info='b' src= ''>
<img data-type='c' src= ''>
<img src= ''>
END_OF_HTML
imgs = doc.xpath("//img[@*[starts-with(name(),'data-')]]")
puts imgs
# <img data-method="a" src="">
# <img data-info="b" src="">
# <img data-type="c" src="">
or using your desired loop
doc.css('img').select do |img|
img.xpath(".//@*[starts-with(name(),'data-')]").any?
end
#[#<Nokogiri::XML::Element:0x384 name="img" attributes=[#<Nokogiri::XML::Attr:0x35c name="data-method" value="a">, #<Nokogiri::XML::Attr:0x370 name="src">]>,
# #<Nokogiri::XML::Element:0x3c0 name="img" attributes=[#<Nokogiri::XML::Attr:0x398 name="data-info" value="b">, #<Nokogiri::XML::Attr:0x3ac name="src">]>,
# #<Nokogiri::XML::Element:0x3fc name="img" attributes=[#<Nokogiri::XML::Attr:0x3d4 name="data-type" value="c">, #<Nokogiri::XML::Attr:0x3e8 name="src">]>]
UPDATE To remove the attributes:
doc.css('img').each do |img|
img.xpath(".//@*[starts-with(name(),'data-')]").each(&:remove)
end
puts doc.to_s
#<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" #\"http://www.w3.org/TR/REC-html40/loose.dtd\">
#<html>
#<body>
# <img src=\"\">
# <img src=\"\">
# <img src=\"\">
# <img src=\"\">
# <img src=\"\">
#</body>
#</html>
This can be simplified to doc.xpath("//img/@*[starts-with(name(),'data-')]").each(&:remove)
Related Topics
Prepend a Single Line to File with Ruby
Hash Ordering Preserved Between Iterations If Not Modified
Rmagick Remove White Background from Image and Make It Transparent
How to Order Files by Last Modified Time in Ruby
Millisecond Resolution of Datetime in Ruby
How to Use Xmlns Declarations with Xpath in Nokogiri
What Is the %W "Thing" in Ruby
How to Override a Column in Rails Model
Heroku Rails 4 Could Not Connect to Server: Connection Refused
Can't Install Thrift Gem on Os X El Capitan
Heroku Wrongly Detecting My Node App as a Ruby App
Difference Between 'Self.Method_Name' and 'Class << Self' in Ruby
How to Add a Primary Key to a Table in Rails
Ruby Sinatra Webservice Running on Localhost:4567 But Not on Ip