Get Link and Href Text from HTML Doc with Nokogiri & Ruby

Get link and href text from html doc with Nokogiri & Ruby?

Here's a one-liner:

Hash[doc.xpath('//a[@href]').map {|link| [link.text.strip, link["href"]]}]

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

Split up a bit to be arguably more readable:

h = {}
doc.xpath('//a[@href]').each do |link|
h[link.text.strip] = link['href']
end
puts h

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

How to extract HTML links and text using Nokogiri (and XPATH and CSS)

This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.

Here are some common operations you might do when parsing links in HTTP, shown both in css and xpath syntax.

Starting with with this snippet:

require 'rubygems'
require 'nokogiri'

html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML

doc = Nokogiri::HTML(html)

extracting all the links

We can use xpath or css to find all the <a> elements and then keep only the ones that have an href attribute:

nodeset = doc.xpath('//a')      # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]

In the above cases, the .compact is necessary because the search for the <a> element returns the "just a bookmark" element in addition to the others.

But we can use a more refined search to find just the elements that contain an href attribute:

attrs = doc.xpath('//a/@href')  # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]

finding a specific link

To find a link within the <div id="block2">

nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"

nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"

If you know you're searching for just one link, you can use at_xpath or at_css instead:

attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"

element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"

find a link from associated text

What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:

element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"

element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"

find text from a link

For completeness, here's how you'd get the text associated with a particular link:

element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"

element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"

useful references

In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:

  • a handy Nokogiri cheat sheet
  • a tutorial on parsing HTML with Nokogiri
  • interactively test CSS selector queries

Extract a link with Nokogiri from the text of link?

Original:

text = <<TEXT
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT

link_text = "site 1"

doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/@href").to_s

Updated:

As far as I know Nokogiri's XPath implementation doesn't support regular expressions, for basic starts with matching there's a function called starts-with that you can use like this (links starting with "s"):

doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/@href").map(&:to_s)

extract links (URLs), with nokogiri in ruby, from a href html tags?

You can do it like this:

doc = Nokogiri::HTML.parse(<<-HTML_END)
<div class="heat">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
<div class="wave">
<a href='http://example.org/site/4/'>site 4</a>
<a href='http://example.org/site/5/'>site 5</a>
<a href='http://example.org/site/6/'>site 6</a>
</div>
HTML_END

l = doc.css('div.heat a').map { |link| link['href'] }

This solution finds all anchor elements using a css selector and collects their href attributes.

How do I access the href parameter for a link?

You're almost done:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<a href="http://sample.com">
<span class="highlight">Web</span>
</a>
Sample text
EOT
obj = doc.search('a')

obj.first['href']
=> "http://sample.com"

If there's only one <a> tag in the document, you could simplify the code using at:

obj = doc.at('a')['href']

would return the same value.

I'm trying to extract each a href link on an html page for evaluation w/ nokogiri and xpath

Your XPATH of //a is pulling back all elements. Which includes the text content. You can use @attrname to access attributes. For example

//a/@href

Will get you the href of every a in the document

How to extract links by using nokogiri with filtering

The simplest way is to use Ruby's URI class and use the extract method:

require 'uri'

html = '
<html>
<body>
http://foo.bar.com
mailto://foo@bar.com
</html>
'
URI.extract(html) # => ["http://foo.bar.com", "mailto://foo@bar.com"]

This doesn't parse the HTML, but instead uses regex to look for URL-like patterns. It's a little error-prone, but simple and fast.

Beyond that, it's easy to navigate through XML and find URLs IF you know where they are, otherwise you're just shooting in the dark and should use something like URI.extract because it's well tested, has a number of patterns it recognizes, and allows you to customize what you want to find. Not using it will cause you to reinvent that wheel.

Your test, looking for a/@href will find anchors with href parameters, but those aren't necessarily going to be URLs, since JavaScript actions can live there too.

If using Nokogiri and only wanting to look in <a> hrefs, I'd do something like:

require 'nokogiri'

html = '
<html>
<body>
<p><a href="http://foo.bar.com/index.html">foo</a></p>
<p><a href="mailto://foo@bar.com">bar</a></p>
</html>
'
doc = Nokogiri::HTML(html)
doc.search('a[href]').select{ |n| n['href'][/\.html$/] }.map{ |n| n['href'] }
# => ["http://foo.bar.com/index.html"]

This uses CSS instead of XPath, which usually results in a more readable selector.

n['href'] is Nokogiri shorthand for getting the value of a parameter of a node.

[\.html$/] is a String shortcut for applying a regex match to that string.

Looking at what you wrote:

page.xpath("//a/@href").map{|item| item.value if item.value =~ /.*.html$/ }.compact

You're having to use compact to clean out unwanted/unexpected nil values in your array because of the if conditional in map. Don't do that; It's reactionary and defensive programming when you don't need to write it that way. Instead, use select or reject to handle your conditional test, which then feeds only acceptable nodes to map, which then transforms them:

doc.search('a[href]').select{ |n| n['href'][/\.html$/] }.map{ |n| n['href'] }

Ruby: How do I parse links with Nokogiri with content/text all the same?

The problem is with how results is defined. results is an array of Nokogiri::XML::Element:

results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element

When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:

puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."

Given that you want the href attribute of each link, you should collect that in the results instead:

results = hire_links.map{ |link| link['href'] }

Assuming you want each href/link displayed as a line in the file, you can join the array:

File.write('./jobs.html', results.join("\n"))

The modified script:

require 'nokogiri'
require 'open-uri'

def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end

find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...

Ruby extracting links from html

Nokogiri stores Strings always as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings.

You have a conflict UTF-8 and cp850 (you are working with windows?).
You may adapt your File.read(input_filename)

Try

File.read(input_filename, :encoding => 'cp850:utf-8') 

If your html-files are windows files.

If your html-files are already utf-8, the try:

File.read(input_filename, :encoding => 'utf-8') 

Another solution may be a Encoding.default_external = 'utf-8' at the begin of your code. (I wouldn't recommend it, use it only for small scripts).



Related Topics



Leave a reply



Submit