Getting Attribute's Value in Nokogiri to Extract Link Urls

Getting attribute's value in Nokogiri to extract link URLs

html = <<HTML
<div id="block">
<a href="http://google.com">link</a>
</div>
HTML
doc = Nokogiri::HTML(html)
doc.xpath('//div/a/@href')
#=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]

Or if you wanna be more specific about the div:

>> doc.xpath('//div[@id="block"]/a/@href')
=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]
>> doc.xpath('//div[@id="block"]/a/@href').first.value
=> "http://google.com"

How to extract HTML links and text using Nokogiri (and XPATH and CSS)

This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.

Here are some common operations you might do when parsing links in HTTP, shown both in css and xpath syntax.

Starting with with this snippet:

require 'rubygems'
require 'nokogiri'

html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML

doc = Nokogiri::HTML(html)

extracting all the links

We can use xpath or css to find all the <a> elements and then keep only the ones that have an href attribute:

nodeset = doc.xpath('//a')      # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]

In the above cases, the .compact is necessary because the search for the <a> element returns the "just a bookmark" element in addition to the others.

But we can use a more refined search to find just the elements that contain an href attribute:

attrs = doc.xpath('//a/@href')  # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]

finding a specific link

To find a link within the <div id="block2">

nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"

nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"

If you know you're searching for just one link, you can use at_xpath or at_css instead:

attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"

element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"

find a link from associated text

What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:

element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"

element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"

find text from a link

For completeness, here's how you'd get the text associated with a particular link:

element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"

element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"

useful references

In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:

  • a handy Nokogiri cheat sheet
  • a tutorial on parsing HTML with Nokogiri
  • interactively test CSS selector queries

Get link and href text from html doc with Nokogiri & Ruby?

Here's a one-liner:

Hash[doc.xpath('//a[@href]').map {|link| [link.text.strip, link["href"]]}]

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

Split up a bit to be arguably more readable:

h = {}
doc.xpath('//a[@href]').each do |link|
h[link.text.strip] = link['href']
end
puts h

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

How can I get the absolute URL when extracting links using Nokogiri?

Nokogiri is unrelated, other than the fact that it gives you the link anchor to begin with. Use Ruby's URI library to manage paths:

absolute_uri = URI.join( page_url, href ).to_s

Seen in action:

require 'uri'

# The URL of the page with the links
page_url = 'http://foo.com/zee/zaw/zoom.html'

# A variety of links to test.
hrefs = %w[
http://zork.com/ http://zork.com/#id
http://zork.com/bar http://zork.com/bar#id
http://zork.com/bar/ http://zork.com/bar/#id
http://zork.com/bar/jim.html http://zork.com/bar/jim.html#id
/bar /bar#id
/bar/ /bar/#id
/bar/jim.html /bar/jim.html#id
jim.html jim.html#id
../jim.html ../jim.html#id
../ ../#id
#id
]

hrefs.each do |href|
root_href = URI.join(page_url,href).to_s
puts "%-32s -> %s" % [ href, root_href ]
end
#=> http://zork.com/ -> http://zork.com/
#=> http://zork.com/#id -> http://zork.com/#id
#=> http://zork.com/bar -> http://zork.com/bar
#=> http://zork.com/bar#id -> http://zork.com/bar#id
#=> http://zork.com/bar/ -> http://zork.com/bar/
#=> http://zork.com/bar/#id -> http://zork.com/bar/#id
#=> http://zork.com/bar/jim.html -> http://zork.com/bar/jim.html
#=> http://zork.com/bar/jim.html#id -> http://zork.com/bar/jim.html#id
#=> /bar -> http://foo.com/bar
#=> /bar#id -> http://foo.com/bar#id
#=> /bar/ -> http://foo.com/bar/
#=> /bar/#id -> http://foo.com/bar/#id
#=> /bar/jim.html -> http://foo.com/bar/jim.html
#=> /bar/jim.html#id -> http://foo.com/bar/jim.html#id
#=> jim.html -> http://foo.com/zee/zaw/jim.html
#=> jim.html#id -> http://foo.com/zee/zaw/jim.html#id
#=> ../jim.html -> http://foo.com/zee/jim.html
#=> ../jim.html#id -> http://foo.com/zee/jim.html#id
#=> ../ -> http://foo.com/zee/
#=> ../#id -> http://foo.com/zee/#id
#=> #id -> http://foo.com/zee/zaw/zoom.html#id

The more convoluted answer here previously used URI.parse(root).merge(URI.parse(href)).to_s.

Thanks to @pguardiario for the improvement.

How to extract links by using nokogiri with filtering

The simplest way is to use Ruby's URI class and use the extract method:

require 'uri'

html = '
<html>
<body>
http://foo.bar.com
mailto://foo@bar.com
</html>
'
URI.extract(html) # => ["http://foo.bar.com", "mailto://foo@bar.com"]

This doesn't parse the HTML, but instead uses regex to look for URL-like patterns. It's a little error-prone, but simple and fast.

Beyond that, it's easy to navigate through XML and find URLs IF you know where they are, otherwise you're just shooting in the dark and should use something like URI.extract because it's well tested, has a number of patterns it recognizes, and allows you to customize what you want to find. Not using it will cause you to reinvent that wheel.

Your test, looking for a/@href will find anchors with href parameters, but those aren't necessarily going to be URLs, since JavaScript actions can live there too.

If using Nokogiri and only wanting to look in <a> hrefs, I'd do something like:

require 'nokogiri'

html = '
<html>
<body>
<p><a href="http://foo.bar.com/index.html">foo</a></p>
<p><a href="mailto://foo@bar.com">bar</a></p>
</html>
'
doc = Nokogiri::HTML(html)
doc.search('a[href]').select{ |n| n['href'][/\.html$/] }.map{ |n| n['href'] }
# => ["http://foo.bar.com/index.html"]

This uses CSS instead of XPath, which usually results in a more readable selector.

n['href'] is Nokogiri shorthand for getting the value of a parameter of a node.

[\.html$/] is a String shortcut for applying a regex match to that string.

Looking at what you wrote:

page.xpath("//a/@href").map{|item| item.value if item.value =~ /.*.html$/ }.compact

You're having to use compact to clean out unwanted/unexpected nil values in your array because of the if conditional in map. Don't do that; It's reactionary and defensive programming when you don't need to write it that way. Instead, use select or reject to handle your conditional test, which then feeds only acceptable nodes to map, which then transforms them:

doc.search('a[href]').select{ |n| n['href'][/\.html$/] }.map{ |n| n['href'] }

How to get the value of an attribute using Nokogiri

It's idiomatic to access parameter values by treating the node as a hash:

require 'nokogiri'

doc = Nokogiri::HTML('<div class="foo"></div>')
doc.at('div')['class'] # => "foo"

And, just like a hash, you can assign to it too:

doc.at('div')['class'] = 'bar'
puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><div class="bar"></div></body></html>

See [] and []= "Modifying Nodes and Attributes" in the documentation.



Related Topics



Leave a reply



Submit