Check If Url Exists in Ruby

Check if URL exists in Ruby

Use the Net::HTTP library.

require "net/http"
url = URI.parse("http://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)

At this point res is a Net::HTTPResponse object containing the result of the request. You can then check the response code:

do_something_with_it(url) if res.code == "200"

Note: To check for https based url, use_ssl attribute should be true as:

require "net/http"
url = URI.parse("https://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)

How to test if a URL exists

Why not just put your code in a begin rescue block like this:

  begin
url = URI.parse("http://www.someurl.lc/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
puts res.code
rescue => e
puts "Exception: #{e}"
# do the next thing
end

Update

You should not rescue all the standard errors. You can rescue specific errors like this:

  begin
url = URI.parse("http://www.someurl.lc/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
puts res.code
rescue SocketError => e
puts "Exception: #{e}"
# do the next thing
end

You should start with only rescuing SocketError, and keep adding other error classes if any as sawa mentioned in the comment.

How to check if a URL is valid

Notice:

As pointed by @CGuess, there's a bug with this issue and it's been documented for over 9 years now that validation is not the purpose of this regular expression (see https://bugs.ruby-lang.org/issues/6520).


Use the URI module distributed with Ruby:

require 'uri'

if url =~ URI::regexp
# Correct URL
end

Like Alexander Günther said in the comments, it checks if a string contains a URL.

To check if the string is a URL, use:

url =~ /\A#{URI::regexp}\z/

If you only want to check for web URLs (http or https), use this:

url =~ /\A#{URI::regexp(['http', 'https'])}\z/

How do I check if a file exists using its URL without downloading it?

If you want to use Rubys included Net::HTTP then you can do it this way:

uri = URI(url)

request = Net::HTTP.new uri.host
response= request.request_head uri.path
return response.code.to_i == 200

Recurring job to check if url exists

Writing and running a simple link checker is easy. Doing that for 1000s of links quickly, without redundancy, and handling dead and slow-responding links without bogging down your entire system gets harder.

I'd use three threads, plus two queues:

  1. A dispatcher thread that only reads from the database. It is responsible for finding and queuing URLs to be checked in to a "to be checked" queue.
  2. A worker thread that consumes from the first queue and pushes results into the "updated URL results" queue.
  3. An updater/consumer thread that takes the result of a thread in #2 and updates the database.

Ruby has some built-in classes to help:

  • Thread
  • Queue

I'd highly recommend Typhoeus and Hydra for use in the middle thread. The documentation for these two classes cover a lot of what you need to do as far as handling multiple threads running in parallel.

I wouldn't write this code as part of a Rails application. There is no value added by Rails to this, nor is it necessary. I would either require Active Record and piggy-back on the existing database.yaml settings and models, or use Rails' "runner" to run the code as an adjunct to the Rails code.

Or, I'd write a small, application-specific, piece of code to run on a different server to avoid bogging down the Rails server. Using something like MySQL or PostgreSQL drivers would let you talk to the same database that Rails uses. In this case I'd use the Sequel gem to act as the ORM, but that's because I prefer it over Active Record.

There are a lot of things to consider as you write this code, including retries of failed URLs, sensing redirections and updating the source URLs to reflect them to avoid wasting time, and not beating up the hosting servers causing you to be banned.

I've written several apps for this purpose over the years and doing it right takes a lot of forethought, so think out your design up front otherwise you could end up with some major rewrites later on.



Related Topics



Leave a reply



Submit