Long running delayed_job jobs stay locked after a restart on Heroku
TLDR:
Put this at the top of your job method:
begin
term_now = false
old_term_handler = trap 'TERM' do
term_now = true
old_term_handler.call
end
AND
Make sure this is called at least once every ten seconds:
if term_now
puts 'told to terminate'
return true
end
AND
At the end of your method, put this:
ensure
trap 'TERM', old_term_handler
end
Explanation:
I was having the same problem and came upon this Heroku article.
The job contained an outer loop, so I followed the article and added a trap('TERM')
and exit
. However delayed_job
picks that up as failed with SystemExit
and marks the task as failed.
With the SIGTERM
now trapped by our trap
the worker's handler isn't called and instead it immediately restarts the job and then gets SIGKILL
a few seconds later. Back to square one.
I tried a few alternatives to exit
:
A
return true
marks the job as successful (and removes it from the queue), but suffers from the same problem if there's another job waiting in the queue.Calling
exit!
will successfully exit the job and the worker, but it doesn't allow the worker to remove the job from the queue, so you still have the 'orphaned locked jobs' problem.
My final solution was the one given at at the top of my answer, it comprises of three parts:
Before we start the potentially long job we add a new interrupt handler for
'TERM'
by doing atrap
(as described in the Heroku article), and we use it to setterm_now = true
.But we must also grab the
old_term_handler
which the delayed job worker code set (which is returned bytrap
) and remember tocall
it.We still must ensure that we return control to
Delayed:Job:Worker
with sufficient time for it to clean up and shutdown, so we should checkterm_now
at least (just under) every ten seconds andreturn
if it istrue
.You can either
return true
orreturn false
depending on whether you want the job to be considered successful or not.Finally it is vital to remember to remove your handler and install back the
Delayed:Job:Worker
one when you have finished. If you fail to do this you will keep a dangling reference to the one we added, which can result in a memory leak if you add another one on top of that (for example, when the worker starts this job again).
How do I handle long running jobs on Heroku?
Here's the proper answer, you listen for SIGTERM (I'm using DJ
here) and then gracefully rescue. It's important that the jobs are idempotent.
Long running delayed_job jobs stay locked after a restart on Heroku
class WithdrawPaymentsJob
def perform
begin
term_now = false
old_term_handler = trap('TERM') { term_now = true; old_term_handler.call }
loop do
puts 'doing long running job'
sleep 1
if term_now
raise 'Gracefully terminating job early...'
end
end
ensure
trap('TERM', old_term_handler)
end
end
end
Here's how you solve it with Que
:
if Que.worker_count.zero?
raise 'Gracefully terminating job early...'
end
Heroku delayed_job workers killed during deployment
Looks like you can configure DJ to handle SIGTERM and mark the in-progress jobs as failed (so they'll be restarted again):
Use this setting to throw an exception on TERM signals by adding this in your initializer:
Delayed::Worker.raise_signal_exceptions = :term
More info in this answer:
https://stackoverflow.com/a/16811844/1715829
Why are my delayed_job jobs re-running even though I tell them not to?
I've discovered I was reading the old/wrong wiki. The correct way to set this is
Delayed::Worker.max_attempts = 1
How to gracefully restart delayed_job consumers?
Came up with a solution that works.
I have a base class that all of my delayed jobs inherit from called BaseJob
:
class BaseJob
attr_accessor :live_hash
def before(job)
# check to make sure that the version of code here is the right version of code
resp = HTTParty.get("#{Rails.application.config.root_url}/revision")
self.live_hash = resp.body.strip
end
def should_perform()
return self.live_hash == GIT_REVISION
end
def perform()
if self.should_perform == true
self.safe_perform()
end
end
def safe_perform()
# override this method in subclasses
end
def success(job)
if self.should_perform == false
# log stats here about a failure
# enqueue a new job of the same kind
new_job = DelayedJob.new
new_job.priority = job.priority
new_job.handler = job.handler
new_job.queue = job.queue
new_job.run_at = job.run_at
new_job.save
job.delete
# restart the delayed job system
%x("export RAILS_ENV=#{Rails.env} && ./script/delayed_job stop")
else
# log stats here about a success
end
end
end
All base classes inherit from BaseJob
and override safe_perform
to actually do their work. A few assumptions about the above code:
Rails.application.config.root_url
points to the root of your app (ie: www.myapp.com)- There is a route exposed called
/revision
(ie: www.myapp.com/revision) - There is a global constant called
GIT_REVISION
that your app knows about
What I ended up doing was putting the output of git rev-parse HEAD
in a file and pushing that with the code. That gets loaded in upon startup so it's available in the web version as well as in the delayed_job consumers.
When we deploy code via Capistrano, we no longer stop, start, or restart delayed_job consumers. We install a cronjob on consumer nodes that runs every minute and determines if a delayed_job process is running. If one isn't, then a new one will be spawned.
As a result of all of this, all of the following conditions are met:
- Pushing code doesn't wait on delayed_job to restart/force kill anymore. Existing jobs that are running are left alone when new code is pushed.
- We can detect when a job begins if the consumer is running old code. The job gets requeued and the consumer kills itself.
- When a delayed_job dies, a new one is spawned via a cronjob with new code (by the nature of starting delayed_job, it has new code).
- If you're paranoid about killing delayed_job consumers, install a nagios check that does the same thing as the cron job but alerts you when a delayed_job process hasn't been running for 5 minutes.
Why are my delayed_job jobs re-running even though I tell them not to?
I've discovered I was reading the old/wrong wiki. The correct way to set this is
Delayed::Worker.max_attempts = 1
Starting heroku delayed_job workers
If you go to your app on Heroku, under the resource
tab, you will see active dynos.
Alternatively, from the terminal, if you run:
$ heroku ps
You will get a list of all your active dynos.
So, provided that there is no worker dyno, you can now add one and scale it up.
Or if there is a worker dyno but at a 0
(zero) scale, then you can scale it up to one with your command above:
$ heroku ps:scale worker=1
Note however that a worker dyno will work with/for a scheduler addon.
Related Topics
Why Can't I Install Rails on Lion Using Rvm
How to Call a Controller'S Method from a View (As We Call from Helper Ideally)
List of Ruby Operators That Can Be Overridden/Implemented
How to Do Static Content in Rails
How to Implement a Short Url Like the Urls in Twitter
Ruby Gem For Finding Timezone of Location
Total Newbie: Instance Variables in Ruby
Ignoring Gem Because Its Extensions Are Not Built
Limit Space and Memory Used by Imagemagick
Find the Newest Record in Rails 3
How to Access Method Arguments in Ruby
How Would You Parse a Url in Ruby to Get the Main Domain
How to Write Postgresql Functions on Ruby on Rails
Haml: Append Class If Condition Is True in Haml
Set Socket Timeout in Ruby Via So_Rcvtimeo Socket Option
Rails: Skinny Controller Vs. Fat Model, or Should I Make My Controller Anorexic