How to Handle 404 Errors with Ruby Http::Net

How to handle 404 errors with Ruby HTTP::Net?

Rewrite your code like this:

uri = URI.parse(url)
result = Net::HTTP.start(uri.host, uri.port) { |http| http.get(uri.path) }
puts result.code
puts result.body

That will print the status code followed by the body.

Getting the contents of a 404 error page response ruby

This doesn't seem to be stated clearly enough in the docs, but HttpError has an io attribute, which you can treat as a read only file as far as i know.

require 'open-uri'

begin
response = open('http://google.com/blahblah')
rescue => e
puts e # Error message
puts e.io.status # Http Error code
puts e.io.readlines # Http response body
end

How to handle 404 not found errors in Nokogiri

It doesn't work, because you are not rescuing part of code (it's open(url) call) that raises an error in case of finding 404 status. The following code should work:

url = 'http://yoursite/page/38475'
begin
file = open(url)
doc = Nokogiri::HTML(file) do
# handle doc
end
rescue OpenURI::HTTPError => e
if e.message == '404 Not Found'
# handle 404 error
else
raise e
end
end

BTW, about rescuing Exception:
Why is it a bad style to `rescue Exception => e` in Ruby?

How should my scraping stack handle 404 errors?

TL;DR

Use out-of-band error handling and a different conceptual scraping model to speed up operations.

Exceptions Are Not for Common Conditions

There are a number of other answers that address how to handle exceptions for your use case. I'm taking a different approach by saying that handling exceptions is fundamentally the wrong approach here for a number of reasons.

  1. In his book Exceptional Ruby, Avdi Grimm provides some benchmarks showing the performance of exceptions as ~156% slower than using alternative coding techniques such as early returns.

  2. In The Pragmatic Programmer: From Journeyman to Master, the authors state "[E]xceptions should be reserved for unexpected events." In your case, 404 errors are undesirable, but are not at all unexpected--in fact, handling 404 errors is a core consideration!

In short, you need a different approach. Preferably, the alternative approach should provide out-of-band error handling and prevent your process from blocking on retries.

One Alternative: A Faster, More Atomic Process

You have a lot of options here, but the one I'm going to recommend is to handle 404 status codes as a normal result. This allows you to "fail fast," but also allows you to retry pages or remove URLs from your queue at a later time.

Consider this example schema:

ActiveRecord::Schema.define(:version => 20120718124422) do
create_table "webcrawls", :force => true do |t|
t.text "raw_html"
t.integer "retries"
t.integer "status_code"
t.text "parsed_data"
t.datetime "created_at", :null => false
t.datetime "updated_at", :null => false
end
end

The idea here is that you would simply treat the entire scrape as an atomic process. For example:

  • Did you get the page?

    Great, store the raw page and the successful status code. You can even parse the raw HTML later, in order to complete your scrapes as fast as possible.

  • Did you get a 404?

    Fine, store the error page and the status code. Move on quickly!

When your process is done crawling URLs, you can then use an ActiveRecord lookup to find all the URLs that recently returned a 404 status so that you can take appropriate action. Perhaps you want to retry the page, log a message, or simply remove the URL from your list of URLs to scrape--"appropriate action" is up to you.

By keeping track of your retry counts, you could even differentiate between transient errors and more permanent errors. This allows you to set thresholds for different actions, depending on the frequency of scraping failures for a given URL.

This approach also has the added benefit of leveraging the database to manage concurrent writes and share results between processes. This would allow you to parcel out work (perhaps with a message queue or chunked data files) among multiple systems or processes.

Final Thoughts: Scaling Up and Out

Spending less time on retries or error handling during the initial scrape should speed up your process significantly. However, some tasks are just too big for a single-machine or single-process approach. If your process speedup is still insufficient for your needs, you may want to consider a less linear approach using one or more of the following:

  • Forking background processes.
  • Using dRuby to split work among multiple processes or machines.
  • Maximizing core usage by spawning multiple external processes using GNU parallel.
  • Something else that isn't a monolithic, sequential process.

Optimizing the application logic should suffice for the common case, but if not, scaling up to more processes or out to more servers. Scaling out will certainly be more work, but will also expand the processing options available to you.

Why do I get 404 with a valid url using Net::HTTP.post_form?

I could not solve the question using Net/HTTP library solely; however, Mechanize gem as Tin Man suggested in the comments solves the problem and successfully posts the data into the server.

It is also more flexible and easier in terms of following redirection. Hence, if anyone runs into this problem like I did, I recommend them using the Mechanize gem.

Mechanize HTTP Not found 404 Link

You could use the Mechanize::ResponseCodeError exception:

This error is raised when Mechanize encounters a response code it does
not know how to handle. Currently, this exception will be thrown if
Mechanize encounters response codes other than 200, 301, or 302. Any
other response code is up to the user to handle.

And to move the rescue inside the each block, this way you go to the url, save the image, and in case the resource can't be found, print the response code.

[
'http://www.rockauto.com/Images/whatsnew1.jpg?1512928800',
'http://www.rockauto.com/info/915/FCA6366_Fronp__ra_p.jpg',
'http://www.rockauto.com/Images/whatsnew2.jpg?1512928800'
].each do |url|
begin
agent.get(url).save
rescue Mechanize::ResponseCodeError => e
puts e.response_code
end
end

There are two working urls, the one in the middle isn't working, and you should get the two images corresponding to each working url.

Handling different classes of 404 errors

Probably easiest is to catch everything and look at the message:

def update_ended
fetch_page(...)
rescue Exception => e
case e.message
when /404/ then puts '404!'
when /500/ then puts '500!'
else puts 'IDK!'
end
end

Ruby Open-URI library aborted in 404 HTTP error code

Rescuing the OpenURI::HTTPError is perfectly reasonable. See this related answer: https://stackoverflow.com/a/7495389/289274

But if you would rather not deal with exception handling, here's how you can do it with Net::HTTP:
https://stackoverflow.com/a/2797443/289274

Net::HTTP vs REST Client gem: How do they handle bad websites / 404

From the GitHub README:

  • for result codes between 200 and 207, a RestClient::Response will be returned
  • for result codes 301, 302 or 307, the redirection will be followed if the request is a GET or a HEAD
  • for result code 303, the redirection will be followed and the request transformed into a GET
  • for other cases, a RestClient::Exception holding the Response will be raised; a specific exception class will be thrown for known error codes
  • call .response on the exception to get the server's response

So yes, this is expected behavior, the response object can be retrieved with e.response.

How to redirect to a 404 in Rails?

Don't render 404 yourself, there's no reason to; Rails has this functionality built in already. If you want to show a 404 page, create a render_404 method (or not_found as I called it) in ApplicationController like this:

def not_found
raise ActionController::RoutingError.new('Not Found')
end

Rails also handles AbstractController::ActionNotFound, and ActiveRecord::RecordNotFound the same way.

This does two things better:

1) It uses Rails' built in rescue_from handler to render the 404 page, and
2) it interrupts the execution of your code, letting you do nice things like:

  user = User.find_by_email(params[:email]) or not_found
user.do_something!

without having to write ugly conditional statements.

As a bonus, it's also super easy to handle in tests. For example, in an rspec integration test:

# RSpec 1

lambda {
visit '/something/you/want/to/404'
}.should raise_error(ActionController::RoutingError)

# RSpec 2+

expect {
get '/something/you/want/to/404'
}.to raise_error(ActionController::RoutingError)

And minitest:

assert_raises(ActionController::RoutingError) do 
get '/something/you/want/to/404'
end

OR refer more info from Rails render 404 not found from a controller action



Related Topics



Leave a reply



Submit