Nokogiri: Searching for <Div> Using Xpath

Nokogiri: Searching for div using XPath

Mike Dalessio (one half of the Nokogiri core developers) gave me an answer on #nokogiri (irc.freenode.net). Looks like neither Nokogiri CSS nor XPath search do support regex matching. This is his solution on how to search for regular expressions with Nokogiri:

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <p id='para-22'>B</p>
      <h1>Bla</h1>
      <p id='para-3'>C</p>
      <p id='para-4'>D</p>
      <div class="foo" id="eq-1_bl-1">
        <p id='para-5'>3</p>
      </div>
      <div class="bar" id="eq-1_bl-1">
        <p id='para-5'>3</p>
      </div>
    </body>
  </html>"
HTML_END

# my_block is given
my_bl = "1"
# my_eq corresponds to this regex
my_eq = "[0-9]+"
# full regex to search for in node ids
full_regex = %r(eq-#{my_eq}_bl-#{my_bl})

filter_by_id = Class.new do
  attr_accessor :matches

  def initialize(regex)
    @regex = regex
    @matches = []
  end

  def filter(node_set)
    @matches += node_set.find_all { |x| x['id'] =~ @regex }
  end
end.new(full_regex)

value.css("div.foo:filter()", filter_by_id)
filter_by_id.matches.each do |node|
  puts node
end

Selecting a specific div element with Xpath and Nokogiri?

You can use the XPath:

//div[@class = 'quoteText' and following-sibling::div[1][@class = 'quoteFooter' and .//a[@href and normalize-space() =  'hard-work']]]

to select all the div elements whose class is quoteText and which are followed by a div with class quoteFooter containing a link with hard-work.

How to use Nokogiri and XPath to get nodes with multiple attributes

I can get divs with a single id
attribute with no problem, but I can't
figure out a way of getting Nokogiri
to grab divs with both ids and
classes.

You want:

//div[id='bar' and class='baz bang' and style='display: block;']

Nokogiri - Get div with class by regex

You can use .xpath method for that purpose. E.g.

doc.xpath("//div[@class='x13' or @class='x15']")

Or you can use

//div[starts-with(@class, 'x') and (ends-with(@class, '13') or (ends-with(@class, '15'))]

Searching by regexp appears in XPath 2.0, but I don't know what xpath version nokogiri supports.

Extracting div elements (Nokogiri/XPath/ruby)

Try:

tids =  doc.xpath("//div[contains(concat(' ', @class, ' '),' thing ')]").collect {|node| node['data-thing-id']}
terms = doc.xpath("//div[contains(concat(' ', @class, ' '),' col_b ')]").collect {|node| node.text.strip }

tids.zip(terms).each do |tid, term|
  puts tid+" "+term
end
#  => 29966403 foobar desc

What the above code is doing is using an XPATH on the doc to find each of the DIVs that contain the classes thing and col_b respectively. Then it takes each of the found DIVs and extracts either the attribute data-thing-id or the displayed text contained within the element, and creates arrays out of the results.

Nokogiri supports both xpath and css, and you can find how to fully utilize those tools by looking at their respective documentations

Select element by attribute value with XPath in Nokogiri

Change class to @class. Remove the dot in the beginning. Then it will work.

Nokogiri: how to find a div by id and see what text it contains?

html = <<-HTML
  <html>
    <body>
      <div id="verify" style="display: none;">foobar</div>
    </body>
  </html>
HTML
doc = Nokogiri::HTML html
puts 'verified!' if doc.at_css('[id="verify"]').text.eql? 'foobar'

How to extract HTML links and text using Nokogiri (and XPATH and CSS)

This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.

Here are some common operations you might do when parsing links in HTTP, shown both in css and xpath syntax.

Starting with with this snippet:

require 'rubygems'
require 'nokogiri'

html = <<HTML
<div id="block1">
    <a href="http://google.com">link1</a>
</div>
<div id="block2">
    <a href="http://stackoverflow.com">link2</a>
    <a id="tips">just a bookmark</a>
</div>
HTML

doc = Nokogiri::HTML(html)

extracting all the links

We can use xpath or css to find all the <a> elements and then keep only the ones that have an href attribute:

nodeset = doc.xpath('//a')      # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact  # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a')          # Get all anchors via css
nodeset.map {|element| element["href"]}.compact  # => ["http://google.com", "http://stackoverflow.com"]

In the above cases, the .compact is necessary because the search for the <a> element returns the "just a bookmark" element in addition to the others.

But we can use a more refined search to find just the elements that contain an href attribute:

attrs = doc.xpath('//a/@href')  # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value}   # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a[href]')    # Get anchors w href attribute via css
nodeset.map {|element| element["href"]}  # => ["http://google.com", "http://stackoverflow.com"]

finding a specific link

To find a link within the <div id="block2">

nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"

nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"

If you know you're searching for just one link, you can use at_xpath or at_css instead:

attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value          # => "http://stackoverflow.com"

element = doc.at_css('div#block2 a[href]')
element['href']        # => "http://stackoverflow.com"

find a link from associated text

What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:

element = doc.at_xpath('//a[text()="link2"]')
element["href"]     # => "http://stackoverflow.com"

element = doc.at_css('a:contains("link2")')
element["href"]     # => "http://stackoverflow.com"

find text from a link

For completeness, here's how you'd get the text associated with a particular link:

element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text     # => "link2"

element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text     # => "link2"

useful references

In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:

a handy Nokogiri cheat sheet
a tutorial on parsing HTML with Nokogiri
interactively test CSS selector queries

Search by class in Nokogiri nodeset

Assuming that the class name is stored into class_name, I think that

doc.xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' #{class_name} ')]")

is what you're looking for.

This will match all the elements that contain class_name into their classes, ie if class_name is 'box', then it will match both elements like div class="box" and elements like div class="box left"

If you only want to match elements like div class="box" ie that have only one class and that class is the one you're looking for, then you could use this:

doc.xpath("//*[@class=\"#{class_name}\"]")

How do I use XPath in Nokogiri?

Seems you need to read a XPath Tutorial

Your //table/tbody[@id="threadbits_forum_251"]/tr expression means:

// - Anywhere in your XML document
table/tbody - take a table element with a tbody child
[@id="threadbits_forum_251"] - where id attribute are equals to "threadbits_forum_251"
tr - and take its tr elements

So, basically, you need to know:

attributes begins with @
conditions go inside [] brackets

If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"], or td[3]/div[1]/a/@href if there is just one <a> element.

Nokogiri: Searching for <Div> Using Xpath