Check if URL exists in Ruby
Use the Net::HTTP library.
require "net/http"
url = URI.parse("http://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
At this point res
is a Net::HTTPResponse object containing the result of the request. You can then check the response code:
do_something_with_it(url) if res.code == "200"
Note: To check for https
based url, use_ssl
attribute should be true
as:
require "net/http"
url = URI.parse("https://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)
How to test if a URL exists
Why not just put your code in a begin rescue
block like this:
begin
url = URI.parse("http://www.someurl.lc/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
puts res.code
rescue => e
puts "Exception: #{e}"
# do the next thing
end
Update
You should not rescue all the standard errors. You can rescue specific errors like this:
begin
url = URI.parse("http://www.someurl.lc/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
puts res.code
rescue SocketError => e
puts "Exception: #{e}"
# do the next thing
end
You should start with only rescuing SocketError
, and keep adding other error classes if any as sawa mentioned in the comment.
How to check if a URL is valid
Notice:
As pointed by @CGuess, there's a bug with this issue and it's been documented for over 9 years now that validation is not the purpose of this regular expression (see https://bugs.ruby-lang.org/issues/6520).
Use the URI
module distributed with Ruby:
require 'uri'
if url =~ URI::regexp
# Correct URL
end
Like Alexander Günther said in the comments, it checks if a string contains a URL.
To check if the string is a URL, use:
url =~ /\A#{URI::regexp}\z/
If you only want to check for web URLs (http
or https
), use this:
url =~ /\A#{URI::regexp(['http', 'https'])}\z/
How do I check if a file exists using its URL without downloading it?
If you want to use Rubys included Net::HTTP
then you can do it this way:
uri = URI(url)
request = Net::HTTP.new uri.host
response= request.request_head uri.path
return response.code.to_i == 200
Recurring job to check if url exists
Writing and running a simple link checker is easy. Doing that for 1000s of links quickly, without redundancy, and handling dead and slow-responding links without bogging down your entire system gets harder.
I'd use three threads, plus two queues:
- A dispatcher thread that only reads from the database. It is responsible for finding and queuing URLs to be checked in to a "to be checked" queue.
- A worker thread that consumes from the first queue and pushes results into the "updated URL results" queue.
- An updater/consumer thread that takes the result of a thread in #2 and updates the database.
Ruby has some built-in classes to help:
- Thread
- Queue
I'd highly recommend Typhoeus and Hydra for use in the middle thread. The documentation for these two classes cover a lot of what you need to do as far as handling multiple threads running in parallel.
I wouldn't write this code as part of a Rails application. There is no value added by Rails to this, nor is it necessary. I would either require Active Record and piggy-back on the existing database.yaml settings and models, or use Rails' "runner" to run the code as an adjunct to the Rails code.
Or, I'd write a small, application-specific, piece of code to run on a different server to avoid bogging down the Rails server. Using something like MySQL or PostgreSQL drivers would let you talk to the same database that Rails uses. In this case I'd use the Sequel gem to act as the ORM, but that's because I prefer it over Active Record.
There are a lot of things to consider as you write this code, including retries of failed URLs, sensing redirections and updating the source URLs to reflect them to avoid wasting time, and not beating up the hosting servers causing you to be banned.
I've written several apps for this purpose over the years and doing it right takes a lot of forethought, so think out your design up front otherwise you could end up with some major rewrites later on.
Related Topics
How to Capture Stdout to a String
How to Implement a Short Url Like the Urls in Twitter
String.Force_Encoding() in Ruby 1.8.7 (Or Rails 2.X)
Can Ruby Print Out Time Difference (Duration) Readily
Difference Between Gemfile and Gemfile.Lock in Ruby on Rails
Rails: Specified 'Mysql2' for Database Adapter But the Gem Is Not Loaded
Ruby Differences Between += and << to Concatenate a String
How to Dynamically Call a Math Operator in Ruby
How to Replace a Hash Key with Another Key
Where Is the Rails Method That Converts Data from 'Datetime_Select' into a Datetime Object
Ruby: Easiest Way to Filter Hash Keys
What Is a Regex to Match a String Not At the End of a Line
How to Understand the Difference Between Class_Eval() and Instance_Eval()
Libmysqlclient.So.15: Cannot Open Shared Object File: No Such File or Directory
Ruby:Difference Between Instance and Local Variables in Ruby