How to handle 404 not found errors in Nokogiri
It doesn't work, because you are not rescuing part of code (it's open(url)
call) that raises an error in case of finding 404 status. The following code should work:
url = 'http://yoursite/page/38475'
begin
file = open(url)
doc = Nokogiri::HTML(file) do
# handle doc
end
rescue OpenURI::HTTPError => e
if e.message == '404 Not Found'
# handle 404 error
else
raise e
end
end
BTW, about rescuing Exception
:Why is it a bad style to `rescue Exception => e` in Ruby?
Getting the contents of a 404 error page response ruby
This doesn't seem to be stated clearly enough in the docs, but HttpError has an io attribute, which you can treat as a read only file as far as i know.
require 'open-uri'
begin
response = open('http://google.com/blahblah')
rescue => e
puts e # Error message
puts e.io.status # Http Error code
puts e.io.readlines # Http response body
end
404 not found, but can access normally from web browser
You're getting 404 Not Found (OpenURI::HTTPError)
, so, if you want to allow your code to continue, rescue for that exception. Something like this should work:
require 'nokogiri'
require 'open-uri'
URLS = %w[
http://www.moxyst.com/fashion/men-clothing/underwear.html
]
URLs.each do |url|
begin
doc = Nokogiri::HTML(open(url))
rescue OpenURI::HTTPError => e
puts "Can't access #{ url }"
puts e.message
puts
next
end
puts doc.to_html
end
You can use more generic exceptions, but then you run into problems getting weird output or might handle an unrelated problem in a way that causes more problems, so you'll need to figure out the granularity you need.You could even sniff either the HTTPd headers, the status of the response, or look at the exception message if you want even more control and want to do something different for a 401 or a 404.
Well, that could be something happening on the server side: Perhaps they don't like the UserAgent string you're sending? The OpenURI documentation shows how to change that header:I can access this from a web browser, I just don't get it at all.
Additional header fields can be specified by an optional hash argument.
open("http://www.ruby-lang.org/en/",
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "foo@bar.invalid",
"Referer" => "http://www.ruby-lang.org/") {|f|
# ...
}
rescue Nokogiri error
Turns out a simple rescue StandardError did the trick.
In RoR, how do I catch an exception if I get no response from a server?
This is generic sample how you can define timeout durations for HTTP connection, and perform several retries in case of any error while fetching content (edited)
require 'open-uri'
require 'nokogiri'
url = "http://localhost:3000/r503"
openuri_params = {
# set timeout durations for HTTP connection
# default values for open_timeout and read_timeout is 60 seconds
:open_timeout => 1,
:read_timeout => 1,
}
attempt_count = 0
max_attempts = 3
begin
attempt_count += 1
puts "attempt ##{attempt_count}"
content = open(url, openuri_params).read
rescue OpenURI::HTTPError => e
# it's 404, etc. (do nothing)
rescue SocketError, Net::ReadTimeout => e
# server can't be reached or doesn't send any respones
puts "error: #{e}"
sleep 3
retry if attempt_count < max_attempts
else
# connection was successful,
# content is fetched,
# so here we can parse content with Nokogiri,
# or call a helper method, etc.
doc = Nokogiri::HTML(content)
p doc
end
Why does OpenURI return a 404, when the parsed URL works fine in browser?
I found out that the problem was with the server of the URL I was trying to parse. They rejected the default User-Agent used by OpenURI.
From the documentation on OpenURI, it says that additional header fields can be specified by an optional hash argument:
open("http://www.ruby-lang.org/en/",
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "foo@bar.invalid",
"Referer" => "http://www.ruby-lang.org/") {|f|
# ...
}
I just used a different User-Agent and everything worked fine. How should my scraping stack handle 404 errors?
TL;DR
Use out-of-band error handling and a different conceptual scraping model to speed up operations.Exceptions Are Not for Common Conditions
There are a number of other answers that address how to handle exceptions for your use case. I'm taking a different approach by saying that handling exceptions is fundamentally the wrong approach here for a number of reasons.In his book Exceptional Ruby, Avdi Grimm provides some benchmarks showing the performance of exceptions as ~156% slower than using alternative coding techniques such as early returns.
In The Pragmatic Programmer: From Journeyman to Master, the authors state "[E]xceptions should be reserved for unexpected events." In your case, 404 errors are undesirable, but are not at all unexpected--in fact, handling 404 errors is a core consideration!
One Alternative: A Faster, More Atomic Process
You have a lot of options here, but the one I'm going to recommend is to handle 404 status codes as a normal result. This allows you to "fail fast," but also allows you to retry pages or remove URLs from your queue at a later time.Consider this example schema:
ActiveRecord::Schema.define(:version => 20120718124422) do
create_table "webcrawls", :force => true do |t|
t.text "raw_html"
t.integer "retries"
t.integer "status_code"
t.text "parsed_data"
t.datetime "created_at", :null => false
t.datetime "updated_at", :null => false
end
end
The idea here is that you would simply treat the entire scrape as an atomic process. For example:Did you get the page?
Great, store the raw page and the successful status code. You can even parse the raw HTML later, in order to complete your scrapes as fast as possible.
Did you get a 404?
Fine, store the error page and the status code. Move on quickly!
By keeping track of your retry counts, you could even differentiate between transient errors and more permanent errors. This allows you to set thresholds for different actions, depending on the frequency of scraping failures for a given URL.
This approach also has the added benefit of leveraging the database to manage concurrent writes and share results between processes. This would allow you to parcel out work (perhaps with a message queue or chunked data files) among multiple systems or processes.
Final Thoughts: Scaling Up and Out
Spending less time on retries or error handling during the initial scrape should speed up your process significantly. However, some tasks are just too big for a single-machine or single-process approach. If your process speedup is still insufficient for your needs, you may want to consider a less linear approach using one or more of the following:- Forking background processes.
- Using dRuby to split work among multiple processes or machines.
- Maximizing core usage by spawning multiple external processes using GNU parallel.
- Something else that isn't a monolithic, sequential process.
Rails/nokogiri_parse multiples sites handling errors
You can try to put each one of those within a begin -- rescue block, so it doesn't fail if one of them is unavailable. Then you can handle those exceptions if necessary.
begin
docvariable1 = Nokogiri::HTML(open("http://www.site1.com/"))
@variable1 = {}
docvariable1.xpath('//div[6]/h3/a').each do |link|
@variable1[link.text.strip] = link['href']
end
rescue
# Handle exception
end
Related Topics
I Am Getting This Gem Install Error for Kgio Gem When I Do a Bundle Install
Automatically Adding Proxy to All Http Connections in Ruby
Operator Precedence of Unary Operators
Running Rails Console with Bundle Exec
Do I Need to Install Passenger as a Regular Gem Even Though My App Uses Bundler
How to Split a String Containing Both Delimiter and The Escaped Delimiter
No Such File to Load - Mechanize
Implementing a Lookup Table in Rails
Rake Aborted: Could Not Find Rspec
How to Override a Variable in a Ruby Subclass Without Affecting The Superclass
How to Create Thor::Group Generators as Args of My_Command
Ruby: Class C Includes Module M; Including Module N in M Does Not Affect C. What Gives
How to Have Two Columns in One Table Point to The Same Column in Another with Activerecord
After_Save Callback to Set The Updated_By Column to The Current_User