How do I use XPath in Nokogiri?
Seems you need to read a XPath Tutorial
Your //table/tbody[@id="threadbits_forum_251"]/tr
expression means:
//
- Anywhere in your XML documenttable/tbody
- take a table element with a tbody child[@id="threadbits_forum_251"]
- where id attribute are equals to "threadbits_forum_251"tr
- and take itstr
elements
So, basically, you need to know:
- attributes begins with
@
- conditions go inside
[]
brackets
If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"]
, or td[3]/div[1]/a/@href
if there is just one <a>
element.
ruby nokogiri HTML table scraping using xpath
thanks to taro`s comment, I was able to solve the issue with some little effort
Here goes the correct code:
#!/usr/bin/ruby -w
require 'rubygems'
require 'nokogiri'
page1 = Nokogiri::HTML(open('test_simple.html'))
a = page1.xpath("/html/body/table/tr[2]/td[2]").text
p a
Nokogiri parse XML with xpath
With method 2, try using:
d.xpath('//feed/entry[title[node()]]'
This will return a nodeset containing nodes that have a non-empty title. Then you can iterate over set however you like.
How to get the content of an XML node using XPath and Nokogiri
This is the Synopsis example in the README file for Nokogiri showing one way to do it using CSS, XPath or a hybrid:
require 'nokogiri'
require 'open-uri'
# Get a Nokogiri::HTML:Document for the page we’re interested in...
doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
# Do funky things with it using Nokogiri::XML::Node methods...
####
# Search for nodes by css
doc.css('h3.r a.l').each do |link|
puts link.content
end
####
# Search for nodes by xpath
doc.xpath('//h3/a[@class="l"]').each do |link|
puts link.content
end
####
# Or mix and match.
doc.search('h3.r a.l', '//h3/a[@class="l"]').each do |link|
puts link.content
end
XPath along with nokogiri; tutorials/examples?
The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you're looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.
The second trick is to remember to use //
to start your XPath, not /
, unless you're absolutely sure you want to start at the root of the document. //
is like a '**/*'
wildcard at the command-line in Linux. It searches everywhere.
Also, don't trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody
, like you saw. Instead, use Ruby's OpenURI or curl
or wget
to retrieve the raw source, and look at it with an editor like vi
or vim
, or use less
or cat
it to the screen. There's no chance of having any changes to the file that way.
Finally, it's often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that's harder to maintain or more fragile.
Nokogiri itself is pretty easy. The majority of things you'll want to do are simple combinations of two different methods: search
and at
. Both take either a CSS or XPath selector. search
, along with its sibling methods xpath
and css
, return a NodeSet
, which is basically an array of nodes that you can iterate over. at
, css_at
and xpath_at
return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath
variants accept an XPath, and the ...css
ones take a CSS accessor.
Once you have a node, generally you'll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get]
and the text using text
.
Using those methods we can search for all the links in a page and return their text and related href, using something like:
require 'awesome_print'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]
Which outputs:
[
[0] [
[0] "/",
[1] ""
],
[1] [
[0] "/domains/",
[1] "Domains"
],
[2] [
[0] "/numbers/",
[1] "Numbers"
],
[3] [
[0] "/protocols/",
[1] "Protocols"
],
[4] [
[0] "/about/",
[1] "About IANA"
]
]
Select element by attribute value with XPath in Nokogiri
Change class to @class. Remove the dot in the beginning. Then it will work.
How to extract HTML links and text using Nokogiri (and XPATH and CSS)
This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.
Here are some common operations you might do when parsing links in HTTP, shown both in css
and xpath
syntax.
Starting with with this snippet:
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
extracting all the links
We can use xpath or css to find all the <a>
elements and then keep only the ones that have an href
attribute:
nodeset = doc.xpath('//a') # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
In the above cases, the .compact
is necessary because the search for the <a>
element returns the "just a bookmark" element in addition to the others.
But we can use a more refined search to find just the elements that contain an href
attribute:
attrs = doc.xpath('//a/@href') # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]
finding a specific link
To find a link within the <div id="block2">
nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"
nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use at_xpath
or at_css
instead:
attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"
element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"
find a link from associated text
What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
find text from a link
For completeness, here's how you'd get the text associated with a particular link:
element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
useful references
In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:
- a handy Nokogiri cheat sheet
- a tutorial on parsing HTML with Nokogiri
- interactively test CSS selector queries
Using nokogiri xpath to access nested elements within an xmlns
It’s a namespacing issue:
datasource.xpath(
'subsystem:connection-url',
'subsystem' => 'urn:jboss:domain:datasources:1.2')
#⇒ [#<... name="connection-url" namespace=...
How to use Nokogiri and XPath to get nodes with multiple attributes
I can get divs with a single id
attribute with no problem, but I can't
figure out a way of getting Nokogiri
to grab divs with both ids and
classes.
You want:
//div[id='bar' and class='baz bang' and style='display: block;']
how to use regex in nokogiri xpath
You can apply below XPath:
//div[substring(@class, string-length(@class) - 8)="signature"]
which means return div node which has "signature"
as last 9 characters of class name
Related Topics
How to Submit Polymorphic Comments on Feed? [Error]
How to Use Ruby for Shell Scripting
The Command Rbenv Install Is Missing
Mongodb with Mongoid in Rails - Geospatial Indexing
How to Convert Array of Activerecord Models to CSV
Restoring Rails 3's Bundle Install Path... It's Now Install in My Root
Nicely Formatting Output to Console, Specifying Number of Tabs
Creating Signature and Nonce for Oauth (Ruby)
Can't Install Ruby-Debug-Base19X Gem
What Is the Argument Against Using Before, Let and Subject in Rspec Tests
Ruby Shoes Gui: Continually Updating Paragraphs
How to Find Where Gem Files Are Installed
Passing a Hash to a Function ( *Args ) and Its Meaning
Read Input from Console in Ruby
Check Whether a String Contains One of Multiple Substrings
How to Find the Path a Ruby Gem Is Installed at (I.E. Gem.Lib_Path C.F. Gem.Bin_Path)
Implementation of "Remember Me" in a Rails Application
What's the Best Way to Deploy a Jruby on Rails Application to Tomcat