Robustly Call a Flaky API: Proper Error Handling with Net::Http

Robustly call a flaky API: proper error handling with Net::HTTP

Exceptions are meaningful, and Net::HTTP offers specific exceptions for different sorts of cases. So if you want to handle them each in a particular way, you can.

That article says that handling those specific exceptions is better/safer than handling rescue Exception, and that's very true. BUT, rescue Exception is different from rescue by itself, which is the equivalent to rescue StandardError, which is what you should usually do by default if you don't have a reason to do anything else.

Rescuing top-level Exception will rescue anything that could possibly happen in the entire execution stack, including some part of ruby running out of disk or memory or having some obscure system-related IO problem.

So, as far as "what to rescue", you're generally better off if you change your code to rescue. You'll catch everything you want to, and nothing that you don't want to. However, in this particular case, there is one lone exception in that guy's list that is NOT a descendent of StandardError:

def parents(obj)
  ( (obj.superclass ? parents(obj.superclass) : []) << obj)
end

[Timeout::Error, Errno::EINVAL, Errno::ECONNRESET, EOFError, Net::HTTPBadResponse,
  Net::HTTPHeaderSyntaxError, Net::ProtocolError].inject([]) do |a,c|
  parents(c).include?(StandardError) ? a : a << c
end
# Timeout::Error < Interrupt

parents(Timeout::Error)
# [ Object, Exception < Object, SignalException < Exception,
#   Interrupt < SignalException, Timeout::Error < Interrupt ]

So you could change your code to rescue StandardError, Timeout::Error => e and you'll cover all the cases mentioned in that article, and more, but not the stuff that you don't want to cover. (the => e is not required, but more on that below).

Now, as far as your actual technique for dealing with the flakey API -- the question is, what's the problem with the API that you are dealing with? Badly formatted responses? No responses? Is the problem at the HTTP level or in the data you are getting back?

Maybe you don't yet know, or you don't yet care, but you know that retrying tends to get the job done. In that case, I would at least recommend logging the exceptions. Hoptoad has a free plan, and has some sort of thing like Hoptoad.notify(e) -- I can't remember if that's the exact invocation. Or you can email it or log it, using e.message and e.stacktrace.

How to specify a read timeout for a Net::HTTP::Post.new request in Ruby 2

Solved via this stackoverflow answer

I've changed my

response = Net::HTTP.start(url.host, url.port) {|http| http.request(request)}

line to be

response = Net::HTTP.start(url.host, url.port, :read_timeout => 500) {|http| http.request(request)}

and this seems to have got around this problem.

Compiling an application for use in highly radioactive environments

Working for about 4-5 years with software/firmware development and environment testing of miniaturized satellites*, I would like to share my experience here.

*(miniaturized satellites are a lot more prone to single event upsets than bigger satellites due to its relatively small, limited sizes for its electronic components)

To be very concise and direct: there is no mechanism to recover from detectable, erroneous
situation by the software/firmware itself without, at least, one
copy of minimum working version of the software/firmware somewhere for recovery purpose - and with the hardware supporting the recovery (functional).

Now, this situation is normally handled both in the hardware and software level. Here, as you request, I will share what we can do in the software level.

...recovery purpose.... Provide ability to update/recompile/reflash your software/firmware in real environment. This is an almost must-have feature for any software/firmware in highly ionized environment. Without this, you could have redundant software/hardware as many as you want but at one point, they are all going to blow up. So, prepare this feature!
...minimum working version... Have responsive, multiple copies, minimum version of the software/firmware in your code. This is like Safe mode in Windows. Instead of having only one, fully functional version of your software, have multiple copies of the minimum version of your software/firmware. The minimum copy will usually having much less size than the full copy and almost always have only the following two or three features:
1. capable of listening to command from external system,
2. capable of updating the current software/firmware,
3. capable of monitoring the basic operation's housekeeping data.
...copy... somewhere... Have redundant software/firmware somewhere.
1. You could, with or without redundant hardware, try to have redundant software/firmware in your ARM uC. This is normally done by having two or more identical software/firmware in separate addresses which sending heartbeat to each other - but only one will be active at a time. If one or more software/firmware is known to be unresponsive, switch to the other software/firmware. The benefit of using this approach is we can have functional replacement immediately after an error occurs - without any contact with whatever external system/party who is responsible to detect and to repair the error (in satellite case, it is usually the Mission Control Centre (MCC)).
  Strictly speaking, without redundant hardware, the disadvantage of doing this is you actually cannot eliminate all single point of failures. At the very least, you will still have one single point of failure, which is the switch itself (or often the beginning of the code). Nevertheless, for a device limited by size in a highly ionized environment (such as pico/femto satellites), the reduction of the single point of failures to one point without additional hardware will still be worth considering. Somemore, the piece of code for the switching would certainly be much less than the code for the whole program - significantly reducing the risk of getting Single Event in it.
2. But if you are not doing this, you should have at least one copy in your external system which can come in contact with the device and update the software/firmware (in the satellite case, it is again the mission control centre).
3. You could also have the copy in your permanent memory storage in your device which can be triggered to restore the running system's software/firmware
...detectable erroneous situation.. The error must be detectable, usually by the hardware error correction/detection circuit or by a small piece of code for error correction/detection. It is best to put such code small, multiple, and independent from the main software/firmware. Its main task is only for checking/correcting. If the hardware circuit/firmware is reliable (such as it is more radiation hardened than the rests - or having multiple circuits/logics), then you might consider making error-correction with it. But if it is not, it is better to make it as error-detection. The correction can be by external system/device. For the error correction, you could consider making use of a basic error correction algorithm like Hamming/Golay23, because they can be implemented more easily both in the circuit/software. But it ultimately depends on your team's capability. For error detection, normally CRC is used.
...hardware supporting the recovery Now, comes to the most difficult aspect on this issue. Ultimately, the recovery requires the hardware which is responsible for the recovery to be at least functional. If the hardware is permanently broken (normally happen after its Total ionizing dose reaches certain level), then there is (sadly) no way for the software to help in recovery. Thus, hardware is rightly the utmost importance concern for a device exposed to high radiation level (such as satellite).

In addition to the suggestion for above anticipating firmware's error due to single event upset, I would also like to suggest you to have:

Error detection and/or error correction algorithm in the inter-subsystem communication protocol. This is another almost must have in order to avoid incomplete/wrong signals received from other system
Filter in your ADC reading. Do not use the ADC reading directly. Filter it by median filter, mean filter, or any other filters - never trust single reading value. Sample more, not less - reasonably.

Using read with inotify

Basic usage

According to inotify(7), you can use the FIONREAD ioctl to find out how much data is available to be read and size your buffer accordingly. Here's some (very rough) code that can accomplish this:

unsigned int avail;
ioctl(inotify_fd, FIONREAD, &avail);

char buffer[avail];
read(fd, buffer, avail);

int offset = 0;
while (offset < avail) {
    struct inotify_event *event = (inotify_event*)(buffer + offset);

    // Insert logic here
    my_process_inotify_event(event);

    offset = offset + sizeof(inotify_event) + event->len;
}

More robust usage

inotify-tools provides a higher-level interface to inotify. You can use it instead of accessing inotify, or you can see how it implements inotifytools_next_events to safely and robustly read all available events.

Partial events and truncation

In response to your questions about truncation, I do not think that the kernel will ever return a partial inotify_event or truncate an inotify_event if the buffer given is too small for all events. The following paragraph from the inotify(7) manpage suggests this:

The behavior when the buffer given to read(2) is too small to return information about the next event depends on the kernel version: in kernels before 2.6.21, read(2) returns 0; since kernel 2.6.21, read(2) fails with the error EINVAL.

As do the following comments from inotifytools.c:

// oh... no.  this can't be happening.  An incomplete event.
// Copy what we currently have into first element, call self to
// read remainder.
// oh, and they BETTER NOT overlap.
// Boy I hope this code works.
// But I think this can never happen due to how inotify is written.

Robustly Call a Flaky API: Proper Error Handling with Net::Http