Get link and href text from html doc with Nokogiri & Ruby?
Here's a one-liner:
Hash[doc.xpath('//a[@href]').map {|link| [link.text.strip, link["href"]]}]
#=> {"Foo"=>"#foo", "Bar"=>"#bar"}
Split up a bit to be arguably more readable:
h = {}
doc.xpath('//a[@href]').each do |link|
h[link.text.strip] = link['href']
end
puts h
#=> {"Foo"=>"#foo", "Bar"=>"#bar"}
How to extract HTML links and text using Nokogiri (and XPATH and CSS)
This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.
Here are some common operations you might do when parsing links in HTTP, shown both in css
and xpath
syntax.
Starting with with this snippet:
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
extracting all the links
We can use xpath or css to find all the <a>
elements and then keep only the ones that have an href
attribute:
nodeset = doc.xpath('//a') # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
In the above cases, the .compact
is necessary because the search for the <a>
element returns the "just a bookmark" element in addition to the others.
But we can use a more refined search to find just the elements that contain an href
attribute:
attrs = doc.xpath('//a/@href') # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]
finding a specific link
To find a link within the <div id="block2">
nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"
nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use at_xpath
or at_css
instead:
attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"
element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"
find a link from associated text
What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
find text from a link
For completeness, here's how you'd get the text associated with a particular link:
element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
useful references
In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:
- a handy Nokogiri cheat sheet
- a tutorial on parsing HTML with Nokogiri
- interactively test CSS selector queries
Extract a link with Nokogiri from the text of link?
Original:
text = <<TEXT
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT
link_text = "site 1"
doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/@href").to_s
Updated:
As far as I know Nokogiri's XPath implementation doesn't support regular expressions, for basic starts with
matching there's a function called starts-with
that you can use like this (links starting with "s"):
doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/@href").map(&:to_s)
extract links (URLs), with nokogiri in ruby, from a href html tags?
You can do it like this:
doc = Nokogiri::HTML.parse(<<-HTML_END)
<div class="heat">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
<div class="wave">
<a href='http://example.org/site/4/'>site 4</a>
<a href='http://example.org/site/5/'>site 5</a>
<a href='http://example.org/site/6/'>site 6</a>
</div>
HTML_END
l = doc.css('div.heat a').map { |link| link['href'] }
This solution finds all anchor elements using a css selector and collects their href attributes.
How do I access the href parameter for a link?
You're almost done:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<a href="http://sample.com">
<span class="highlight">Web</span>
</a>
Sample text
EOT
obj = doc.search('a')
obj.first['href']
=> "http://sample.com"
If there's only one <a>
tag in the document, you could simplify the code using at
:
obj = doc.at('a')['href']
would return the same value.
I'm trying to extract each a href link on an html page for evaluation w/ nokogiri and xpath
Your XPATH of //a is pulling back all elements. Which includes the text content. You can use @attrname to access attributes. For example
//a/@href
Will get you the href of every a in the document
How to extract links by using nokogiri with filtering
The simplest way is to use Ruby's URI class and use the extract
method:
require 'uri'
html = '
<html>
<body>
http://foo.bar.com
mailto://foo@bar.com
</html>
'
URI.extract(html) # => ["http://foo.bar.com", "mailto://foo@bar.com"]
This doesn't parse the HTML, but instead uses regex to look for URL-like patterns. It's a little error-prone, but simple and fast.
Beyond that, it's easy to navigate through XML and find URLs IF you know where they are, otherwise you're just shooting in the dark and should use something like URI.extract
because it's well tested, has a number of patterns it recognizes, and allows you to customize what you want to find. Not using it will cause you to reinvent that wheel.
Your test, looking for a/@href
will find anchors with href
parameters, but those aren't necessarily going to be URLs, since JavaScript actions can live there too.
If using Nokogiri and only wanting to look in <a>
hrefs, I'd do something like:
require 'nokogiri'
html = '
<html>
<body>
<p><a href="http://foo.bar.com/index.html">foo</a></p>
<p><a href="mailto://foo@bar.com">bar</a></p>
</html>
'
doc = Nokogiri::HTML(html)
doc.search('a[href]').select{ |n| n['href'][/\.html$/] }.map{ |n| n['href'] }
# => ["http://foo.bar.com/index.html"]
This uses CSS instead of XPath, which usually results in a more readable selector.
n['href']
is Nokogiri shorthand for getting the value of a parameter of a node.
[\.html$/]
is a String shortcut for applying a regex match to that string.
Looking at what you wrote:
page.xpath("//a/@href").map{|item| item.value if item.value =~ /.*.html$/ }.compact
You're having to use compact
to clean out unwanted/unexpected nil
values in your array because of the if
conditional in map
. Don't do that; It's reactionary and defensive programming when you don't need to write it that way. Instead, use select
or reject
to handle your conditional test, which then feeds only acceptable nodes to map
, which then transforms them:
doc.search('a[href]').select{ |n| n['href'][/\.html$/] }.map{ |n| n['href'] }
Ruby: How do I parse links with Nokogiri with content/text all the same?
The problem is with how results
is defined. results
is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Ruby extracting links from html
Nokogiri stores Strings always as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings.
You have a conflict UTF-8 and cp850 (you are working with windows?).
You may adapt your File.read(input_filename)
Try
File.read(input_filename, :encoding => 'cp850:utf-8')
If your html-files are windows files.
If your html-files are already utf-8, the try:
File.read(input_filename, :encoding => 'utf-8')
Another solution may be a Encoding.default_external = 'utf-8'
at the begin of your code. (I wouldn't recommend it, use it only for small scripts).
Related Topics
Simple Ruby Input Validation Library
How to Access Class Variables in Ruby 1.9
Write and Read a File with Utf-8 Encoding
Ruby Equivalent of Perl Data::Dumper
Using Soap and Other Standard Libraries in Ruby 1.9.2
How to Run Phantomjs on Heroku
How to Enable Tls for Redis 6 on Sidekiq
How to Define a Ruby Singleton Method Using a Block
Sinatra Static Assets Are Not Found When Using Rackup
Is Alias_Method_Chain Synonymous with Alias_Method
Circular Dependency Detected While Autoloading Constant When Loading Constant
How to Use Ajax Send Data to Controller in Ruby on Rails
How to Check for Stdin Input in a Ruby Script
Pry Not Stopping When Called from a Ruby Script That Reads from Stdin