How to Use Xpath in Nokogiri

How do I use XPath in Nokogiri?

Seems you need to read a XPath Tutorial

Your //table/tbody[@id="threadbits_forum_251"]/tr expression means:

  • // - Anywhere in your XML document
  • table/tbody - take a table element with a tbody child
  • [@id="threadbits_forum_251"] - where id attribute are equals to "threadbits_forum_251"
  • tr - and take its tr elements

So, basically, you need to know:

  • attributes begins with @
  • conditions go inside [] brackets

If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"], or td[3]/div[1]/a/@href if there is just one <a> element.

ruby nokogiri HTML table scraping using xpath

thanks to taro`s comment, I was able to solve the issue with some little effort

Here goes the correct code:

#!/usr/bin/ruby -w
require 'rubygems'
require 'nokogiri'
page1 = Nokogiri::HTML(open('test_simple.html'))
a = page1.xpath("/html/body/table/tr[2]/td[2]").text
p a

Nokogiri parse XML with xpath

With method 2, try using:

d.xpath('//feed/entry[title[node()]]'

This will return a nodeset containing nodes that have a non-empty title. Then you can iterate over set however you like.

How to get the content of an XML node using XPath and Nokogiri

This is the Synopsis example in the README file for Nokogiri showing one way to do it using CSS, XPath or a hybrid:

require 'nokogiri'
require 'open-uri'

# Get a Nokogiri::HTML:Document for the page we’re interested in...

doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))

# Do funky things with it using Nokogiri::XML::Node methods...

####
# Search for nodes by css
doc.css('h3.r a.l').each do |link|
puts link.content
end

####
# Search for nodes by xpath
doc.xpath('//h3/a[@class="l"]').each do |link|
puts link.content
end

####
# Or mix and match.
doc.search('h3.r a.l', '//h3/a[@class="l"]').each do |link|
puts link.content
end

XPath along with nokogiri; tutorials/examples?

The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you're looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.

The second trick is to remember to use // to start your XPath, not /, unless you're absolutely sure you want to start at the root of the document. // is like a '**/*' wildcard at the command-line in Linux. It searches everywhere.

Also, don't trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody, like you saw. Instead, use Ruby's OpenURI or curl or wget to retrieve the raw source, and look at it with an editor like vi or vim, or use less or cat it to the screen. There's no chance of having any changes to the file that way.

Finally, it's often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that's harder to maintain or more fragile.

Nokogiri itself is pretty easy. The majority of things you'll want to do are simple combinations of two different methods: search and at. Both take either a CSS or XPath selector. search, along with its sibling methods xpath and css, return a NodeSet, which is basically an array of nodes that you can iterate over. at, css_at and xpath_at return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath variants accept an XPath, and the ...css ones take a CSS accessor.

Once you have a node, generally you'll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get] and the text using text.

Using those methods we can search for all the links in a page and return their text and related href, using something like:

require 'awesome_print'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]

Which outputs:

[
[0] [
[0] "/",
[1] ""
],
[1] [
[0] "/domains/",
[1] "Domains"
],
[2] [
[0] "/numbers/",
[1] "Numbers"
],
[3] [
[0] "/protocols/",
[1] "Protocols"
],
[4] [
[0] "/about/",
[1] "About IANA"
]
]

Select element by attribute value with XPath in Nokogiri

Change class to @class. Remove the dot in the beginning. Then it will work.

How to extract HTML links and text using Nokogiri (and XPATH and CSS)

This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.

Here are some common operations you might do when parsing links in HTTP, shown both in css and xpath syntax.

Starting with with this snippet:

require 'rubygems'
require 'nokogiri'

html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML

doc = Nokogiri::HTML(html)

extracting all the links

We can use xpath or css to find all the <a> elements and then keep only the ones that have an href attribute:

nodeset = doc.xpath('//a')      # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]

In the above cases, the .compact is necessary because the search for the <a> element returns the "just a bookmark" element in addition to the others.

But we can use a more refined search to find just the elements that contain an href attribute:

attrs = doc.xpath('//a/@href')  # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]

finding a specific link

To find a link within the <div id="block2">

nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"

nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"

If you know you're searching for just one link, you can use at_xpath or at_css instead:

attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"

element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"

find a link from associated text

What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:

element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"

element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"

find text from a link

For completeness, here's how you'd get the text associated with a particular link:

element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"

element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"

useful references

In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:

  • a handy Nokogiri cheat sheet
  • a tutorial on parsing HTML with Nokogiri
  • interactively test CSS selector queries

Using nokogiri xpath to access nested elements within an xmlns

It’s a namespacing issue:

datasource.xpath(
'subsystem:connection-url',
'subsystem' => 'urn:jboss:domain:datasources:1.2')
#⇒ [#<... name="connection-url" namespace=...

How to use Nokogiri and XPath to get nodes with multiple attributes

I can get divs with a single id
attribute with no problem, but I can't
figure out a way of getting Nokogiri
to grab divs with both ids and
classes.

You want:

//div[id='bar' and class='baz bang' and style='display: block;']

how to use regex in nokogiri xpath

You can apply below XPath:

//div[substring(@class, string-length(@class) - 8)="signature"]

which means return div node which has "signature" as last 9 characters of class name



Related Topics



Leave a reply



Submit