Getting All Links of a Webpage Using Ruby

Getting all links of a webpage using Ruby

why you dont use groups in your pattern?
e.g.

/http[s]?:\/\/(.+)/i

so the first group will already be the link you searched for.

Extract all links from web page

You can do this using Ruby's built-in URI class. Look at the extract method.

It's not as smart as what you could write using Nokogiri and looking in anchors, images, scripts, on_click handlers, etc., but it's a good and fast starting point.

For instance, looking at the content of this question's page:

require 'open-uri'
require 'uri'

URI.extract(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read).grep(/^https?:/)
# => ["http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6",
# "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
# "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
# "https://stackauth.com",
# "http://chat.stackoverflow.com",
# "http://blog.stackexchange.com",
# "http://schema.org/Article",
# "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
# "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
# "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
# "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
# "http://stackexchange.com/legal/privacy-policy'",
# "http://stackexchange.com/legal/terms-of-service'",
# "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
# "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
# "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
# "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
# "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
# "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
# "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
# "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
# "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
# "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
# "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
# "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
# "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
# "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
# "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
# "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
# "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
# "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
# "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
# "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
# "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
# "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
# "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
# "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
# "http://blog.stackexchange.com?blb=1",
# "http://chat.stackoverflow.com",
# "http://data.stackexchange.com",
# "http://stackexchange.com/legal",
# "http://stackexchange.com/legal/privacy-policy",
# "http://stackexchange.com/about/hiring",
# "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
# "http://meta.stackoverflow.com",
# "http://stackoverflow.com",
# "http://serverfault.com",
# "http://superuser.com",
# "http://webapps.stackexchange.com",
# "http://askubuntu.com",
# "http://webmasters.stackexchange.com",
# "http://gamedev.stackexchange.com",
# "http://tex.stackexchange.com",
# "http://programmers.stackexchange.com",
# "http://unix.stackexchange.com",
# "http://apple.stackexchange.com",
# "http://wordpress.stackexchange.com",
# "http://gis.stackexchange.com",
# "http://electronics.stackexchange.com",
# "http://android.stackexchange.com",
# "http://security.stackexchange.com",
# "http://dba.stackexchange.com",
# "http://drupal.stackexchange.com",
# "http://sharepoint.stackexchange.com",
# "http://ux.stackexchange.com",
# "http://mathematica.stackexchange.com",
# "http://stackexchange.com/sites#technology",
# "http://photo.stackexchange.com",
# "http://scifi.stackexchange.com",
# "http://cooking.stackexchange.com",
# "http://diy.stackexchange.com",
# "http://stackexchange.com/sites#lifearts",
# "http://english.stackexchange.com",
# "http://skeptics.stackexchange.com",
# "http://judaism.stackexchange.com",
# "http://travel.stackexchange.com",
# "http://christianity.stackexchange.com",
# "http://gaming.stackexchange.com",
# "http://bicycles.stackexchange.com",
# "http://rpg.stackexchange.com",
# "http://stackexchange.com/sites#culturerecreation",
# "http://math.stackexchange.com",
# "http://stats.stackexchange.com",
# "http://cstheory.stackexchange.com",
# "http://physics.stackexchange.com",
# "http://mathoverflow.net",
# "http://stackexchange.com/sites#science",
# "http://stackapps.com",
# "http://meta.stackoverflow.com",
# "http://area51.stackexchange.com",
# "http://careers.stackoverflow.com",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://blog.stackoverflow.com/2009/06/attribution-required/",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif",
# "https:",
# "https:'==document.location.protocol,",
# "https://ssl",
# "http://www",
# "https://secure",
# "http://edge",
# "https:",
# "https://sb",
# "http://b"]

There are a lot of other entries, but using grep filters them out using a simple /^https?:/ pattern.

A simple starting point with Nokogiri is:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read)
urls = doc.search('a, img').map{ |tag|
case tag.name.downcase
when 'a'
tag['href']
when 'img'
tag['src']
end
}

urls
# => ["//stackexchange.com/sites",
# "http://chat.stackoverflow.com",
# "http://blog.stackexchange.com",
# "//stackoverflow.com",
# "//meta.stackoverflow.com",
# "//careers.stackoverflow.com",
# "//stackexchange.com",
# "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
# "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
# "/tour",
# "/help",
# "//careers.stackoverflow.com",
# "/",
# "/questions",
# "/tags",
# "/about",
# "/users",
# "/questions/ask",
# "/about",
# nil,
# "/questions/21069348/extract-all-links-from-web-page",
# nil,
# nil,
# "#",
# "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
# "/questions/tagged/html",
# "/questions/tagged/ruby-on-rails",
# "/questions/tagged/ruby",
# "/questions/tagged/regex",
# "/questions/tagged/hyperlink",
# "/q/21069348",
# "/posts/21069348/edit",
# "/users/2886945/ivan-denisov",
# "/users/2886945/ivan-denisov",
# "/users/2767755/arup-rakshit",
# "/users/2886945/ivan-denisov",
# nil,
# nil,
# "/questions/21069348/extract-all-links-from-web-page?answertab=active#tab-top",
# "/questions/21069348/extract-all-links-from-web-page?answertab=oldest#tab-top",
# "/questions/21069348/extract-all-links-from-web-page?answertab=votes#tab-top",
# nil,
# nil,
# nil,
# "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
# "/a/21069456",
# "/posts/21069456/revisions",
# "/users/128421/the-tin-man",
# "/users/128421/the-tin-man",
# nil,
# nil,
# nil,
# nil,
# "http://regex101.com/r/hN4dI0",
# "/a/21069536",
# "/users/1214800/r3mus",
# "/users/1214800/r3mus",
# nil,
# nil,
# "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%23new-answer",
# "#",
# "http://stackexchange.com/legal/privacy-policy",
# "http://stackexchange.com/legal/terms-of-service",
# "/questions/tagged/html",
# "/questions/tagged/ruby-on-rails",
# "/questions/tagged/ruby",
# "/questions/tagged/regex",
# "/questions/tagged/hyperlink",
# "/questions/ask",
# "/questions/tagged/html",
# "/questions/tagged/ruby-on-rails",
# "/questions/tagged/ruby",
# "/questions/tagged/regex",
# "/questions/tagged/hyperlink",
# "?lastactivity",
# "/q/21052437",
# "/questions/21052437/are-these-two-lines-the-same-vs",
# "/q/6700367",
# "/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
# "/q/430966",
# "/questions/430966/regex-for-links-in-html-text",
# "/q/3703712",
# "/questions/3703712/extract-all-links-from-a-html-page-exclude-links-from-a-specific-table",
# "/q/5120171",
# "/questions/5120171/extract-links-from-a-web-page",
# "/q/6816138",
# "/questions/6816138/extract-absolute-links-from-a-page-uisng-htmlparser",
# "/q/10177910",
# "/questions/10177910/php-regular-expression-extracting-html-links",
# "/q/10217857",
# "/questions/10217857/extracting-background-images-from-a-web-page-parsing-htmlcss",
# "/q/11300496",
# "/questions/11300496/how-to-extract-a-link-from-head-tag-of-a-remote-page-using-curl",
# "/q/11307491",
# "/questions/11307491/how-to-extract-all-links-on-a-page-using-crawler4j",
# "/q/17712493",
# "/questions/17712493/extract-links-from-bbcode-with-ruby",
# "/q/20290869",
# "/questions/20290869/strip-away-html-tags-from-extracted-links",
# "//stackexchange.com/questions?tab=hot",
# "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
# "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
# "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
# "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
# "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
# "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
# "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
# "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
# "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
# "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
# "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
# "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
# "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
# "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
# "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
# "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
# "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
# "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
# "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
# "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
# "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
# "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
# "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
# "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
# "#",
# "/feeds/question/21069348",
# "/about",
# "/help",
# "/help/badges",
# "http://blog.stackexchange.com?blb=1",
# "http://chat.stackoverflow.com",
# "http://data.stackexchange.com",
# "http://stackexchange.com/legal",
# "http://stackexchange.com/legal/privacy-policy",
# "http://stackexchange.com/about/hiring",
# "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
# nil,
# "/contact",
# "http://meta.stackoverflow.com",
# "http://stackoverflow.com",
# "http://serverfault.com",
# "http://superuser.com",
# "http://webapps.stackexchange.com",
# "http://askubuntu.com",
# "http://webmasters.stackexchange.com",
# "http://gamedev.stackexchange.com",
# "http://tex.stackexchange.com",
# "http://programmers.stackexchange.com",
# "http://unix.stackexchange.com",
# "http://apple.stackexchange.com",
# "http://wordpress.stackexchange.com",
# "http://gis.stackexchange.com",
# "http://electronics.stackexchange.com",
# "http://android.stackexchange.com",
# "http://security.stackexchange.com",
# "http://dba.stackexchange.com",
# "http://drupal.stackexchange.com",
# "http://sharepoint.stackexchange.com",
# "http://ux.stackexchange.com",
# "http://mathematica.stackexchange.com",
# "http://stackexchange.com/sites#technology",
# "http://photo.stackexchange.com",
# "http://scifi.stackexchange.com",
# "http://cooking.stackexchange.com",
# "http://diy.stackexchange.com",
# "http://stackexchange.com/sites#lifearts",
# "http://english.stackexchange.com",
# "http://skeptics.stackexchange.com",
# "http://judaism.stackexchange.com",
# "http://travel.stackexchange.com",
# "http://christianity.stackexchange.com",
# "http://gaming.stackexchange.com",
# "http://bicycles.stackexchange.com",
# "http://rpg.stackexchange.com",
# "http://stackexchange.com/sites#culturerecreation",
# "http://math.stackexchange.com",
# "http://stats.stackexchange.com",
# "http://cstheory.stackexchange.com",
# "http://physics.stackexchange.com",
# "http://mathoverflow.net",
# "http://stackexchange.com/sites#science",
# "http://stackapps.com",
# "http://meta.stackoverflow.com",
# "http://area51.stackexchange.com",
# "http://careers.stackoverflow.com",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://blog.stackoverflow.com/2009/06/attribution-required/",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
# "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
# "http://i.stack.imgur.com/fmgha.jpg?s=32&g=1",
# "/posts/21069348/ivc/8228",
# "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif"]

That uses a case statement to apply a bit of "smarts" to know which field should be retrieved from a particular type of tag. More work would need to be done, since an anchor could use an on_click, and there could be other tags being used for JavaScript events.

Extracting all URLs from a page using Ruby

Remember the URL doesn't have to start with "http" - it could be a relative URL, the path to the current page. IMO it is the best to use Nokogiri to parse the HTML:

require 'open-uri'
require 'nokogiri'
reqt = open("http://www.google.com")
doc = Nokogiri::HTML(reqt)
doc.xpath('//a[@href]').each do |a|
puts a.attr('href')
end

But if you really want to find only the absolute URLs, add a simple condition:

 puts a.attr('href') if a.attr('href') =~ /^http/i

How to select all links in a page and store it in an array in capybara?

When(/^I search for all links on homepage$/) do
within(".wrapper") do
all_links = all("a").map(&:text) # get text for all links
all_links.each do |i|
puts i
end
end
end

Scrape URLs From Web

There's no reason to use regex to do this at all. Here's a solution using Nokogiri, which is the usual Ruby HTML/XML parser:

html = <<EOT
<p><a href="http://www.example.com/foo">foo</a></p>
<p><a href='http://www.example.com/foo1'>foo1</p></a>
<p><a href=http://www.example.com/foo2>foo2</a></p>
<p><a href = http://www.example.com/bar>bar</p>
<p><a
href="http://www.example.com/foobar"
>foobar</a></p>
<p><a
href="http://www.example.com/foobar2"
>foobar2</p>
EOT

require 'nokogiri'

doc = Nokogiri::HTML(html)

links = Hash[
*doc.search('a').map { |a|
[
a['href'],
a.content
]
}.flatten
]

require 'pp'
pp links
# >> {"http://www.example.com/foo"=>"foo",
# >> "http://www.example.com/foo1"=>"foo1",
# >> "http://www.example.com/foo2"=>"foo2",
# >> "http://www.example.com/bar"=>"bar",
# >> "http://www.example.com/foobar"=>"foobar",
# >> "http://www.example.com/foobar2"=>"foobar2"}

This returns a hash of URLs as keys with the related content of the <a> tag as the value. That means you'll only capture unique URLs, throwing away duplicates. If you want all URLs use:

links = doc.search('a').map { |a| 
[
a['href'],
a.content
]
}

which results in:

# >> [["http://www.example.com/foo", "foo"],
# >> ["http://www.example.com/foo1", "foo1"],
# >> ["http://www.example.com/foo2", "foo2"],
# >> ["http://www.example.com/bar", "bar"],
# >> ["http://www.example.com/foobar", "foobar"],
# >> ["http://www.example.com/foobar2", "foobar2"]]

I used a CSS accessor 'a' to locate the tags. I could use 'a[href]' if I wanted to grab only links, ignoring anchors.

Regex are very fragile when dealing with HTML and XML because the markup formats are too freeform; They can vary in their format while remaining valid, especially HTML, which can vary wildly in its "correctness". If you don't own the generation of the file being parsed, then your code is at the mercy of whoever does generate it when using regex; A simple change in the file can break the pattern badly, resulting in a continual maintenance headache.

A parser, because it actually understands the internal structure of the file, can withstand those changes. Notice that I deliberately created some malformed HTML but the code didn't care. Compare the simplicity of the parser version vs. a regex solution and think of long term maintainability.



Related Topics



Leave a reply



Submit