Long Running Delayed_Job Jobs Stay Locked After a Restart on Heroku

Long running delayed_job jobs stay locked after a restart on Heroku

TLDR:

Put this at the top of your job method:

begin
  term_now = false
  old_term_handler = trap 'TERM' do
    term_now = true
    old_term_handler.call
  end

AND

Make sure this is called at least once every ten seconds:

  if term_now
    puts 'told to terminate'
    return true
  end

AND

At the end of your method, put this:

ensure
  trap 'TERM', old_term_handler
end

Explanation:

I was having the same problem and came upon this Heroku article.

The job contained an outer loop, so I followed the article and added a trap('TERM') and exit. However delayed_job picks that up as failed with SystemExit and marks the task as failed.

With the SIGTERM now trapped by our trap the worker's handler isn't called and instead it immediately restarts the job and then gets SIGKILL a few seconds later. Back to square one.

I tried a few alternatives to exit:

A return true marks the job as successful (and removes it from the queue), but suffers from the same problem if there's another job waiting in the queue.
Calling exit! will successfully exit the job and the worker, but it doesn't allow the worker to remove the job from the queue, so you still have the 'orphaned locked jobs' problem.

My final solution was the one given at at the top of my answer, it comprises of three parts:

Before we start the potentially long job we add a new interrupt handler for 'TERM' by doing a trap (as described in the Heroku article), and we use it to set term_now = true.
But we must also grab the old_term_handler which the delayed job worker code set (which is returned by trap) and remember to call it.
We still must ensure that we return control to Delayed:Job:Worker with sufficient time for it to clean up and shutdown, so we should check term_now at least (just under) every ten seconds and return if it is true.
You can either return true or return false depending on whether you want the job to be considered successful or not.
Finally it is vital to remember to remove your handler and install back the Delayed:Job:Worker one when you have finished. If you fail to do this you will keep a dangling reference to the one we added, which can result in a memory leak if you add another one on top of that (for example, when the worker starts this job again).

How do I handle long running jobs on Heroku?

Here's the proper answer, you listen for SIGTERM (I'm using DJ here) and then gracefully rescue. It's important that the jobs are idempotent.

Long running delayed_job jobs stay locked after a restart on Heroku

class WithdrawPaymentsJob

  def perform
    begin
      term_now = false
      old_term_handler = trap('TERM') { term_now = true; old_term_handler.call }

      loop do

        puts 'doing long running job'
        sleep 1

        if term_now
          raise 'Gracefully terminating job early...'
        end
      end

    ensure
      trap('TERM', old_term_handler)
    end
  end

end

Here's how you solve it with Que:

    if Que.worker_count.zero?
      raise 'Gracefully terminating job early...'
    end

Heroku delayed_job workers killed during deployment

Looks like you can configure DJ to handle SIGTERM and mark the in-progress jobs as failed (so they'll be restarted again):

Use this setting to throw an exception on TERM signals by adding this in your initializer:
Delayed::Worker.raise_signal_exceptions = :term

More info in this answer:
https://stackoverflow.com/a/16811844/1715829

Why are my delayed_job jobs re-running even though I tell them not to?

I've discovered I was reading the old/wrong wiki. The correct way to set this is

Delayed::Worker.max_attempts = 1

How to gracefully restart delayed_job consumers?

Came up with a solution that works.

I have a base class that all of my delayed jobs inherit from called BaseJob:

class BaseJob
  attr_accessor :live_hash

  def before(job)
    # check to make sure that the version of code here is the right version of code
    resp = HTTParty.get("#{Rails.application.config.root_url}/revision")
    self.live_hash = resp.body.strip
  end

  def should_perform()
    return self.live_hash == GIT_REVISION
  end

  def perform()
    if self.should_perform == true
      self.safe_perform()
    end
  end

  def safe_perform()
    # override this method in subclasses
  end

  def success(job)
    if self.should_perform == false
      # log stats here about a failure

      # enqueue a new job of the same kind
      new_job = DelayedJob.new
      new_job.priority = job.priority
      new_job.handler = job.handler
      new_job.queue = job.queue
      new_job.run_at = job.run_at
      new_job.save
      job.delete

      # restart the delayed job system
      %x("export RAILS_ENV=#{Rails.env} && ./script/delayed_job stop")
    else
      # log stats here about a success
    end
  end

end

All base classes inherit from BaseJob and override safe_perform to actually do their work. A few assumptions about the above code:

Rails.application.config.root_url points to the root of your app (ie: www.myapp.com)
There is a route exposed called /revision (ie: www.myapp.com/revision)
There is a global constant called GIT_REVISION that your app knows about

What I ended up doing was putting the output of git rev-parse HEAD in a file and pushing that with the code. That gets loaded in upon startup so it's available in the web version as well as in the delayed_job consumers.

When we deploy code via Capistrano, we no longer stop, start, or restart delayed_job consumers. We install a cronjob on consumer nodes that runs every minute and determines if a delayed_job process is running. If one isn't, then a new one will be spawned.

As a result of all of this, all of the following conditions are met:

Pushing code doesn't wait on delayed_job to restart/force kill anymore. Existing jobs that are running are left alone when new code is pushed.
We can detect when a job begins if the consumer is running old code. The job gets requeued and the consumer kills itself.
When a delayed_job dies, a new one is spawned via a cronjob with new code (by the nature of starting delayed_job, it has new code).
If you're paranoid about killing delayed_job consumers, install a nagios check that does the same thing as the cron job but alerts you when a delayed_job process hasn't been running for 5 minutes.

Why are my delayed_job jobs re-running even though I tell them not to?

I've discovered I was reading the old/wrong wiki. The correct way to set this is

Delayed::Worker.max_attempts = 1

Starting heroku delayed_job workers

If you go to your app on Heroku, under the resource tab, you will see active dynos.

Alternatively, from the terminal, if you run:

$ heroku ps

You will get a list of all your active dynos.

So, provided that there is no worker dyno, you can now add one and scale it up.

Or if there is a worker dyno but at a 0 (zero) scale, then you can scale it up to one with your command above:

$ heroku ps:scale worker=1

Note however that a worker dyno will work with/for a scheduler addon.

Long Running Delayed_Job Jobs Stay Locked After a Restart on Heroku