Getting all links of a webpage using Ruby
why you dont use groups in your pattern?
e.g.
/http[s]?:\/\/(.+)/i
so the first group will already be the link you searched for.
Extract all links from web page
You can do this using Ruby's built-in URI class. Look at the extract
method.
It's not as smart as what you could write using Nokogiri and looking in anchors, images, scripts, on_click
handlers, etc., but it's a good and fast starting point.
For instance, looking at the content of this question's page:
require 'open-uri'
require 'uri'
URI.extract(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read).grep(/^https?:/)
# => ["http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6",
# "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
# "http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page",
# "https://stackauth.com",
# "http://chat.stackoverflow.com",
# "http://blog.stackexchange.com",
# "http://schema.org/Article",
# "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
# "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
# "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
# "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
# "http://stackexchange.com/legal/privacy-policy'",
# "http://stackexchange.com/legal/terms-of-service'",
# "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
# "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
# "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
# "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
# "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
# "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
# "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
# "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
# "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
# "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
# "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
# "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
# "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
# "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
# "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
# "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
# "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
# "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
# "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
# "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
# "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
# "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
# "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
# "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
# "http://blog.stackexchange.com?blb=1",
# "http://chat.stackoverflow.com",
# "http://data.stackexchange.com",
# "http://stackexchange.com/legal",
# "http://stackexchange.com/legal/privacy-policy",
# "http://stackexchange.com/about/hiring",
# "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
# "http://meta.stackoverflow.com",
# "http://stackoverflow.com",
# "http://serverfault.com",
# "http://superuser.com",
# "http://webapps.stackexchange.com",
# "http://askubuntu.com",
# "http://webmasters.stackexchange.com",
# "http://gamedev.stackexchange.com",
# "http://tex.stackexchange.com",
# "http://programmers.stackexchange.com",
# "http://unix.stackexchange.com",
# "http://apple.stackexchange.com",
# "http://wordpress.stackexchange.com",
# "http://gis.stackexchange.com",
# "http://electronics.stackexchange.com",
# "http://android.stackexchange.com",
# "http://security.stackexchange.com",
# "http://dba.stackexchange.com",
# "http://drupal.stackexchange.com",
# "http://sharepoint.stackexchange.com",
# "http://ux.stackexchange.com",
# "http://mathematica.stackexchange.com",
# "http://stackexchange.com/sites#technology",
# "http://photo.stackexchange.com",
# "http://scifi.stackexchange.com",
# "http://cooking.stackexchange.com",
# "http://diy.stackexchange.com",
# "http://stackexchange.com/sites#lifearts",
# "http://english.stackexchange.com",
# "http://skeptics.stackexchange.com",
# "http://judaism.stackexchange.com",
# "http://travel.stackexchange.com",
# "http://christianity.stackexchange.com",
# "http://gaming.stackexchange.com",
# "http://bicycles.stackexchange.com",
# "http://rpg.stackexchange.com",
# "http://stackexchange.com/sites#culturerecreation",
# "http://math.stackexchange.com",
# "http://stats.stackexchange.com",
# "http://cstheory.stackexchange.com",
# "http://physics.stackexchange.com",
# "http://mathoverflow.net",
# "http://stackexchange.com/sites#science",
# "http://stackapps.com",
# "http://meta.stackoverflow.com",
# "http://area51.stackexchange.com",
# "http://careers.stackoverflow.com",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://blog.stackoverflow.com/2009/06/attribution-required/",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif",
# "https:",
# "https:'==document.location.protocol,",
# "https://ssl",
# "http://www",
# "https://secure",
# "http://edge",
# "https:",
# "https://sb",
# "http://b"]
There are a lot of other entries, but using grep
filters them out using a simple /^https?:/
pattern.
A simple starting point with Nokogiri is:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/21069348/extract-all-links-from-web-page/21069456#21069456').read)
urls = doc.search('a, img').map{ |tag|
case tag.name.downcase
when 'a'
tag['href']
when 'img'
tag['src']
end
}
urls
# => ["//stackexchange.com/sites",
# "http://chat.stackoverflow.com",
# "http://blog.stackexchange.com",
# "//stackoverflow.com",
# "//meta.stackoverflow.com",
# "//careers.stackoverflow.com",
# "//stackexchange.com",
# "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
# "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%2f21069456",
# "/tour",
# "/help",
# "//careers.stackoverflow.com",
# "/",
# "/questions",
# "/tags",
# "/about",
# "/users",
# "/questions/ask",
# "/about",
# nil,
# "/questions/21069348/extract-all-links-from-web-page",
# nil,
# nil,
# "#",
# "http://stackoverflow.com/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
# "/questions/tagged/html",
# "/questions/tagged/ruby-on-rails",
# "/questions/tagged/ruby",
# "/questions/tagged/regex",
# "/questions/tagged/hyperlink",
# "/q/21069348",
# "/posts/21069348/edit",
# "/users/2886945/ivan-denisov",
# "/users/2886945/ivan-denisov",
# "/users/2767755/arup-rakshit",
# "/users/2886945/ivan-denisov",
# nil,
# nil,
# "/questions/21069348/extract-all-links-from-web-page?answertab=active#tab-top",
# "/questions/21069348/extract-all-links-from-web-page?answertab=oldest#tab-top",
# "/questions/21069348/extract-all-links-from-web-page?answertab=votes#tab-top",
# nil,
# nil,
# nil,
# "http://www.ruby-doc.org/stdlib-2.1.0/libdoc/uri/rdoc/URI.html#method-c-extract",
# "/a/21069456",
# "/posts/21069456/revisions",
# "/users/128421/the-tin-man",
# "/users/128421/the-tin-man",
# nil,
# nil,
# nil,
# nil,
# "http://regex101.com/r/hN4dI0",
# "/a/21069536",
# "/users/1214800/r3mus",
# "/users/1214800/r3mus",
# nil,
# nil,
# "/users/login?returnurl=%2fquestions%2f21069348%2fextract-all-links-from-web-page%23new-answer",
# "#",
# "http://stackexchange.com/legal/privacy-policy",
# "http://stackexchange.com/legal/terms-of-service",
# "/questions/tagged/html",
# "/questions/tagged/ruby-on-rails",
# "/questions/tagged/ruby",
# "/questions/tagged/regex",
# "/questions/tagged/hyperlink",
# "/questions/ask",
# "/questions/tagged/html",
# "/questions/tagged/ruby-on-rails",
# "/questions/tagged/ruby",
# "/questions/tagged/regex",
# "/questions/tagged/hyperlink",
# "?lastactivity",
# "/q/21052437",
# "/questions/21052437/are-these-two-lines-the-same-vs",
# "/q/6700367",
# "/questions/6700367/getting-all-links-of-a-webpage-using-ruby",
# "/q/430966",
# "/questions/430966/regex-for-links-in-html-text",
# "/q/3703712",
# "/questions/3703712/extract-all-links-from-a-html-page-exclude-links-from-a-specific-table",
# "/q/5120171",
# "/questions/5120171/extract-links-from-a-web-page",
# "/q/6816138",
# "/questions/6816138/extract-absolute-links-from-a-page-uisng-htmlparser",
# "/q/10177910",
# "/questions/10177910/php-regular-expression-extracting-html-links",
# "/q/10217857",
# "/questions/10217857/extracting-background-images-from-a-web-page-parsing-htmlcss",
# "/q/11300496",
# "/questions/11300496/how-to-extract-a-link-from-head-tag-of-a-remote-page-using-curl",
# "/q/11307491",
# "/questions/11307491/how-to-extract-all-links-on-a-page-using-crawler4j",
# "/q/17712493",
# "/questions/17712493/extract-links-from-bbcode-with-ruby",
# "/q/20290869",
# "/questions/20290869/strip-away-html-tags-from-extracted-links",
# "//stackexchange.com/questions?tab=hot",
# "http://superuser.com/questions/698312/if-32-bit-machines-can-only-handle-numbers-up-to-232-why-can-i-write-100000000",
# "http://scifi.stackexchange.com/questions/47868/why-did-smeagol-become-addicted-to-the-ring-when-bilbo-did-not",
# "http://english.stackexchange.com/questions/145672/idiom-for-trying-and-failing-falling-short-and-being-disapproved",
# "http://math.stackexchange.com/questions/634191/are-the-integers-closed-under-addition-really",
# "http://codegolf.stackexchange.com/questions/18254/how-to-write-a-c-program-for-multiplication-without-using-and-operator",
# "http://tex.stackexchange.com/questions/153563/how-to-align-terms-in-alignat-environment",
# "http://rpg.stackexchange.com/questions/31426/how-do-have-interesting-events-happen-after-a-success",
# "http://math.stackexchange.com/questions/630339/pedagogy-how-to-cure-students-of-the-law-of-universal-linearity",
# "http://codegolf.stackexchange.com/questions/17005/produce-the-number-2014-without-any-numbers-in-your-source-code",
# "http://academia.stackexchange.com/questions/15595/why-are-so-many-badly-written-papers-still-published",
# "http://tex.stackexchange.com/questions/153598/how-to-draw-empty-nodes-in-tikz-qtree",
# "http://english.stackexchange.com/questions/145157/a-formal-way-to-say-i-dont-want-to-sound-too-cocky",
# "http://physics.stackexchange.com/questions/93256/is-it-possible-to-split-baryons-and-extract-useable-energy-out-of-it",
# "http://mathematica.stackexchange.com/questions/40213/counting-false-values-at-the-ends-of-a-list",
# "http://electronics.stackexchange.com/questions/96139/difference-between-a-bus-and-a-wire",
# "http://aviation.stackexchange.com/questions/921/why-do-some-aircraft-have-multiple-ailerons-per-wing",
# "http://stackoverflow.com/questions/21052437/are-these-two-lines-the-same-vs",
# "http://biology.stackexchange.com/questions/14414/if-there-are-no-human-races-why-do-human-populations-have-several-distinct-phen",
# "http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems",
# "http://codegolf.stackexchange.com/questions/18028/largest-number-printable",
# "http://unix.stackexchange.com/questions/108858/seek-argument-in-command-dd",
# "http://linguistics.stackexchange.com/questions/6375/can-the-chinese-script-be-used-to-record-non-chinese-languages",
# "http://rpg.stackexchange.com/questions/31346/techniques-for-making-undead-scary-again",
# "http://math.stackexchange.com/questions/632705/why-are-mathematical-proofs-that-rely-on-computers-controversial",
# "#",
# "/feeds/question/21069348",
# "/about",
# "/help",
# "/help/badges",
# "http://blog.stackexchange.com?blb=1",
# "http://chat.stackoverflow.com",
# "http://data.stackexchange.com",
# "http://stackexchange.com/legal",
# "http://stackexchange.com/legal/privacy-policy",
# "http://stackexchange.com/about/hiring",
# "http://engine.adzerk.net/r?e=eyJhdiI6NDE0LCJhdCI6MjAsImNtIjo5NTQsImNoIjoxMTc4LCJjciI6Mjc3NiwiZG0iOjQsImZjIjoyODYyLCJmbCI6Mjc1MSwibnciOjIyLCJydiI6MCwicHIiOjExNSwic3QiOjAsInVyIjoiaHR0cDovL3N0YWNrb3ZlcmZsb3cuY29tL2Fib3V0L2NvbnRhY3QiLCJyZSI6MX0&s=hRods5B22XvRBwWIwtIMekcyNF8",
# nil,
# "/contact",
# "http://meta.stackoverflow.com",
# "http://stackoverflow.com",
# "http://serverfault.com",
# "http://superuser.com",
# "http://webapps.stackexchange.com",
# "http://askubuntu.com",
# "http://webmasters.stackexchange.com",
# "http://gamedev.stackexchange.com",
# "http://tex.stackexchange.com",
# "http://programmers.stackexchange.com",
# "http://unix.stackexchange.com",
# "http://apple.stackexchange.com",
# "http://wordpress.stackexchange.com",
# "http://gis.stackexchange.com",
# "http://electronics.stackexchange.com",
# "http://android.stackexchange.com",
# "http://security.stackexchange.com",
# "http://dba.stackexchange.com",
# "http://drupal.stackexchange.com",
# "http://sharepoint.stackexchange.com",
# "http://ux.stackexchange.com",
# "http://mathematica.stackexchange.com",
# "http://stackexchange.com/sites#technology",
# "http://photo.stackexchange.com",
# "http://scifi.stackexchange.com",
# "http://cooking.stackexchange.com",
# "http://diy.stackexchange.com",
# "http://stackexchange.com/sites#lifearts",
# "http://english.stackexchange.com",
# "http://skeptics.stackexchange.com",
# "http://judaism.stackexchange.com",
# "http://travel.stackexchange.com",
# "http://christianity.stackexchange.com",
# "http://gaming.stackexchange.com",
# "http://bicycles.stackexchange.com",
# "http://rpg.stackexchange.com",
# "http://stackexchange.com/sites#culturerecreation",
# "http://math.stackexchange.com",
# "http://stats.stackexchange.com",
# "http://cstheory.stackexchange.com",
# "http://physics.stackexchange.com",
# "http://mathoverflow.net",
# "http://stackexchange.com/sites#science",
# "http://stackapps.com",
# "http://meta.stackoverflow.com",
# "http://area51.stackexchange.com",
# "http://careers.stackoverflow.com",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://blog.stackoverflow.com/2009/06/attribution-required/",
# "http://creativecommons.org/licenses/by-sa/3.0/",
# "http://i.stack.imgur.com/IgtEd.jpg?s=32&g=1",
# "https://www.gravatar.com/avatar/71770d043c0f7e3c7bc5f74190015c26?s=32&d=identicon&r=PG",
# "http://i.stack.imgur.com/fmgha.jpg?s=32&g=1",
# "/posts/21069348/ivc/8228",
# "http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif"]
That uses a case
statement to apply a bit of "smarts" to know which field should be retrieved from a particular type of tag. More work would need to be done, since an anchor could use an on_click
, and there could be other tags being used for JavaScript events.
Extracting all URLs from a page using Ruby
Remember the URL doesn't have to start with "http" - it could be a relative URL, the path to the current page. IMO it is the best to use Nokogiri to parse the HTML:
require 'open-uri'
require 'nokogiri'
reqt = open("http://www.google.com")
doc = Nokogiri::HTML(reqt)
doc.xpath('//a[@href]').each do |a|
puts a.attr('href')
end
But if you really want to find only the absolute URLs, add a simple condition:
puts a.attr('href') if a.attr('href') =~ /^http/i
How to select all links in a page and store it in an array in capybara?
When(/^I search for all links on homepage$/) do
within(".wrapper") do
all_links = all("a").map(&:text) # get text for all links
all_links.each do |i|
puts i
end
end
end
Scrape URLs From Web
There's no reason to use regex to do this at all. Here's a solution using Nokogiri, which is the usual Ruby HTML/XML parser:
html = <<EOT
<p><a href="http://www.example.com/foo">foo</a></p>
<p><a href='http://www.example.com/foo1'>foo1</p></a>
<p><a href=http://www.example.com/foo2>foo2</a></p>
<p><a href = http://www.example.com/bar>bar</p>
<p><a
href="http://www.example.com/foobar"
>foobar</a></p>
<p><a
href="http://www.example.com/foobar2"
>foobar2</p>
EOT
require 'nokogiri'
doc = Nokogiri::HTML(html)
links = Hash[
*doc.search('a').map { |a|
[
a['href'],
a.content
]
}.flatten
]
require 'pp'
pp links
# >> {"http://www.example.com/foo"=>"foo",
# >> "http://www.example.com/foo1"=>"foo1",
# >> "http://www.example.com/foo2"=>"foo2",
# >> "http://www.example.com/bar"=>"bar",
# >> "http://www.example.com/foobar"=>"foobar",
# >> "http://www.example.com/foobar2"=>"foobar2"}
This returns a hash of URLs as keys with the related content of the <a>
tag as the value. That means you'll only capture unique URLs, throwing away duplicates. If you want all URLs use:
links = doc.search('a').map { |a|
[
a['href'],
a.content
]
}
which results in:
# >> [["http://www.example.com/foo", "foo"],
# >> ["http://www.example.com/foo1", "foo1"],
# >> ["http://www.example.com/foo2", "foo2"],
# >> ["http://www.example.com/bar", "bar"],
# >> ["http://www.example.com/foobar", "foobar"],
# >> ["http://www.example.com/foobar2", "foobar2"]]
I used a CSS accessor 'a'
to locate the tags. I could use 'a[href]'
if I wanted to grab only links, ignoring anchors.
Regex are very fragile when dealing with HTML and XML because the markup formats are too freeform; They can vary in their format while remaining valid, especially HTML, which can vary wildly in its "correctness". If you don't own the generation of the file being parsed, then your code is at the mercy of whoever does generate it when using regex; A simple change in the file can break the pattern badly, resulting in a continual maintenance headache.
A parser, because it actually understands the internal structure of the file, can withstand those changes. Notice that I deliberately created some malformed HTML but the code didn't care. Compare the simplicity of the parser version vs. a regex solution and think of long term maintainability.
Related Topics
Cloning an Array with Its Content
Get All Local Variables or Available Methods from Irb
Argument Out of Range Rails 4 and Bootstrap3-Datetimepicker-Rails
Phonegap and Rails 3: How to Interact with a Rails 3 App
How to Replace an Array's Element
Why Is Ruby Throwing a Segmentation Fault on Only My System, and Only in This Rails Application
Using Rest-Client to Download a File to Disk Without Loading It All in Memory First
Using Ruby CSV to Extract One Column
Ruby - Open File, Find and Replace Multiple Lines
Ruby on Rails - Generating Bit.Ly Style Identifiers
Set Compression Level When Generating a Zip File Using Rubyzip
Setting the Environment in Gemfile for Bundling Install/Update Based on a Customize File
Ruby Gem Installation Error After Osx Yosemite and Xcode 6 Installation
How to Increment an Integer in Ruby
How to Daemonize Rails Rake Task on Elastic Beanstalk Start Up