Nokogiri: Searching for div using XPath
Mike Dalessio (one half of the Nokogiri core developers) gave me an answer on #nokogiri
(irc.freenode.net). Looks like neither Nokogiri CSS nor XPath search do support regex matching. This is his solution on how to search for regular expressions with Nokogiri:
require 'rubygems'
require 'nokogiri'
value = Nokogiri::HTML.parse(<<-HTML_END)
"<html>
<body>
<p id='para-1'>A</p>
<p id='para-22'>B</p>
<h1>Bla</h1>
<p id='para-3'>C</p>
<p id='para-4'>D</p>
<div class="foo" id="eq-1_bl-1">
<p id='para-5'>3</p>
</div>
<div class="bar" id="eq-1_bl-1">
<p id='para-5'>3</p>
</div>
</body>
</html>"
HTML_END
# my_block is given
my_bl = "1"
# my_eq corresponds to this regex
my_eq = "[0-9]+"
# full regex to search for in node ids
full_regex = %r(eq-#{my_eq}_bl-#{my_bl})
filter_by_id = Class.new do
attr_accessor :matches
def initialize(regex)
@regex = regex
@matches = []
end
def filter(node_set)
@matches += node_set.find_all { |x| x['id'] =~ @regex }
end
end.new(full_regex)
value.css("div.foo:filter()", filter_by_id)
filter_by_id.matches.each do |node|
puts node
end
Selecting a specific div element with Xpath and Nokogiri?
You can use the XPath:
//div[@class = 'quoteText' and following-sibling::div[1][@class = 'quoteFooter' and .//a[@href and normalize-space() = 'hard-work']]]
to select all the div
elements whose class is quoteText
and which are followed by a div
with class quoteFooter
containing a link with hard-work
.
How to use Nokogiri and XPath to get nodes with multiple attributes
I can get divs with a single id
attribute with no problem, but I can't
figure out a way of getting Nokogiri
to grab divs with both ids and
classes.
You want:
//div[id='bar' and class='baz bang' and style='display: block;']
Nokogiri - Get div with class by regex
You can use .xpath
method for that purpose. E.g.
doc.xpath("//div[@class='x13' or @class='x15']")
Or you can use
//div[starts-with(@class, 'x') and (ends-with(@class, '13') or (ends-with(@class, '15'))]
Searching by regexp appears in XPath 2.0, but I don't know what xpath version nokogiri supports.
Extracting div elements (Nokogiri/XPath/ruby)
Try:
tids = doc.xpath("//div[contains(concat(' ', @class, ' '),' thing ')]").collect {|node| node['data-thing-id']}
terms = doc.xpath("//div[contains(concat(' ', @class, ' '),' col_b ')]").collect {|node| node.text.strip }
tids.zip(terms).each do |tid, term|
puts tid+" "+term
end
# => 29966403 foobar desc
What the above code is doing is using an XPATH on the doc to find each of the DIVs that contain the classes thing
and col_b
respectively. Then it takes each of the found DIVs and extracts either the attribute data-thing-id
or the displayed text contained within the element, and creates arrays out of the results.
Nokogiri supports both xpath
and css
, and you can find how to fully utilize those tools by looking at their respective documentations
Select element by attribute value with XPath in Nokogiri
Change class to @class. Remove the dot in the beginning. Then it will work.
Nokogiri: how to find a div by id and see what text it contains?
html = <<-HTML
<html>
<body>
<div id="verify" style="display: none;">foobar</div>
</body>
</html>
HTML
doc = Nokogiri::HTML html
puts 'verified!' if doc.at_css('[id="verify"]').text.eql? 'foobar'
How to extract HTML links and text using Nokogiri (and XPATH and CSS)
This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.
Here are some common operations you might do when parsing links in HTTP, shown both in css
and xpath
syntax.
Starting with with this snippet:
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
extracting all the links
We can use xpath or css to find all the <a>
elements and then keep only the ones that have an href
attribute:
nodeset = doc.xpath('//a') # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
In the above cases, the .compact
is necessary because the search for the <a>
element returns the "just a bookmark" element in addition to the others.
But we can use a more refined search to find just the elements that contain an href
attribute:
attrs = doc.xpath('//a/@href') # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]
finding a specific link
To find a link within the <div id="block2">
nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"
nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use at_xpath
or at_css
instead:
attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"
element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"
find a link from associated text
What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
find text from a link
For completeness, here's how you'd get the text associated with a particular link:
element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
useful references
In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:
- a handy Nokogiri cheat sheet
- a tutorial on parsing HTML with Nokogiri
- interactively test CSS selector queries
Search by class in Nokogiri nodeset
Assuming that the class name is stored into class_name
, I think that
doc.xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' #{class_name} ')]")
is what you're looking for.
This will match all the elements that contain class_name
into their classes, ie if class_name
is 'box', then it will match both elements like div class="box"
and elements like div class="box left"
If you only want to match elements like div class="box"
ie that have only one class and that class is the one you're looking for, then you could use this:
doc.xpath("//*[@class=\"#{class_name}\"]")
How do I use XPath in Nokogiri?
Seems you need to read a XPath Tutorial
Your //table/tbody[@id="threadbits_forum_251"]/tr
expression means:
//
- Anywhere in your XML documenttable/tbody
- take a table element with a tbody child[@id="threadbits_forum_251"]
- where id attribute are equals to "threadbits_forum_251"tr
- and take itstr
elements
So, basically, you need to know:
- attributes begins with
@
- conditions go inside
[]
brackets
If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"]
, or td[3]/div[1]/a/@href
if there is just one <a>
element.
Related Topics
Error Install Rubyracer with Error "Invalid Gem: Package Is Corrupt"
How to Split a String by Commas Except Inside Parenthesis, Using a Regular Expression
How to Change the Default Value of a Struct Attribute
How Can Bundler/Gemfile Be Configured to Use Different Gem Sources During Development
Using Phonegap as a Native Container for a Rails 3 App
All Possible Combinations of Selected Character Substitution in a String in Ruby
Version Sort (With Alphas, Betas, etc.) in Ruby
Use Ruby Array for a JavaScript Array in Erb. Escaping Quotes
Eliminate Consecutive Duplicates of List Elements
Ruby 'Pass by Value' Clarification
React Error (Only a Reactowner Can Have Refs.)
In Ruby, Why Is a Method Invocation Not Able to Be Treated as a Unit When "Do" and "End" Is Used