Running Multiple Background Parallel Jobs with Rails

Running multiple background parallel jobs with Rails

Some thoughts...

Just because you need to read 50 sites and naturally want some parallel work does not mean that you need 50 processes or threads. You need to balance the slowdown and overhead. How about having 10 or 20 processes each read a few sites?
Depending on which Ruby you are using, be careful about the green threads, you may not get the parallel result you want
You might want to structure it like a reverse, client-side inetd, and use connect_nonblock and IO.select to get the parallel connections you want by making all the servers respond in parallel. You don't really need parallel processing of the results, you just need to get in line at all the servers in parallel, because that is where the latency really is.

So, something like this from the socket library...extend it for multiple outstanding connections...

require 'socket'
include Socket::Constants
socket = Socket.new(AF_INET, SOCK_STREAM, 0)
sockaddr = Socket.sockaddr_in(80, 'www.google.com')
begin
  socket.connect_nonblock(sockaddr)
  rescue Errno::EINPROGRESS
  IO.select(nil, [socket])
  begin
    socket.connect_nonblock(sockaddr)
    rescue Errno::EISCONN
  end
end
socket.write("GET / HTTP/1.0\r\n\r\n")
# here perhaps insert IO.select. You may not need multiple threads OR multiple
# processes with this technique, but if you do insert them here
results = socket.read

Parallel background tasks on single worker dyno

According to @radiospiel's related post, you can use foreman to start multiple processes.

1) Add foreman to your Gemfile

2) Create two files:

Procfile:

web: bundle exec unicorn -p $PORT -c ./config/unicorn.rb
worker: bundle exec foreman start -f Procfile.workers

Procfile.workers:

dj_worker: bundle exec rake jobs:work
dj_worker: bundle exec rake jobs:work
dj_worker: bundle exec rake jobs:work

I just deployed this to Heroku, works great.

Running large amount of long running background jobs in Rails

It sounds like you are limited by memory on the number of workers that you can run on your DigitalOcean host.

If you are worried about scaling, I would focus on making the workers as efficient as possible. Have you done any benchmarking to understanding where the 900MB of memory is being allocated? I'm not sure what the nature of these jobs are, but you mentioned large files. Are you reading the contents of these files into memory, or are you streaming them? Are you using a database with SQL you can tune? Are you making many small API calls when you could be using a batch endpoint? Are you assigning intermediary variables that must then be garbage collected? Can you compress the files before you send them?

Look at the job structure itself. I've found that background jobs work best with many smaller jobs rather than one larger job. This allows execution to happen in parallel, and be more load balanced across all workers. You could even have a job that generates other jobs. If you need a job to orchestrate callbacks when a group of jobs finishes there is a DelayedJobGroup plugin at https://github.com/salsify/delayed_job_groups_plugin that allows you to invoke a final job only after the sibling jobs complete. I would aim for an execution time of a single job to be under 30 seconds. This is arbitrary but it illustrates what I mean by smaller jobs.

Some hosting providers like Amazon provide spot instances where you can pay a lower price on servers that do not have guaranteed availability. These pair well with the many fewer jobs approach I mentioned earlier.

Finally, Ruby might not be the right tool for the job. There are faster languages, and if you are limited by memory, or CPU, you might consider writing these jobs and their workers in another language like Javascript, Go or Rust. These can pair well with a Ruby stack, but offload computationally expensive subroutines to faster languages.

Finally, like many scaling issues, if you have more money than time, you can always throw more hardware at it. At least for a while.

How to make multiple parallel concurrent requests with Rails and Heroku

Don't use Resque. Use Sidekiq instead.

Resque runs in a single-threaded process, meaning the workers run synchronously, while Sidekiq runs in a multithreaded process, meaning the workers run asynchronously/simutaneously in different threads.

Make sure you assign a URL to scrape per worker. It's no use if one worker scrape multiple URLs.

With Sidekiq, you can pass the link to a worker, e.g.

LINKS = [...]
LINKS.each do |link|
  ScrapeWoker.perform_async(link)
end

The perform_async doesn't actually execute the job right away. Instead, the link is just put in a queue in redis along with the worker class, and so on, and later (could be milliseconds later) workers are assigned to execute each job in queue in its own thread by running the perform instance method in ScrapeWorker. Sidekiq will make sure to retry again if exception occur during execution of a worker.

PS: You don't have pass a link to the worker. You can store the links to a table and then pass the ids of the records to workers.

More info about sidekiq

Run two scripts in separate directories, in parallel, and in foreground

I believe what you're looking for is something like Foreman:

https://github.com/ddollar/foreman

http://blog.daviddollar.org/2011/05/06/introducing-foreman.html

You can list all your commands in using a Procfile like so:

Procfile.dev:

web: bundle exec rails server
yarn: exec yarn start

And then simply run foreman start in your command line.

Rails delayed job multithreading

Yes, with DelayedJob you could add a method to your User model like this

def assign_task
  Task.create(user_id: id)
end

and process it delayed like this in your original method:

def assign_tasks
  users.find_each { |user| user.delay.assign_task }
end

Then the task generation for each user will happen in the background and – when there are enough workers – in parallel.

Btw there are other tools that support processing jobs in the background – the most common nowerdays is Sidekiq. They all have a slightly different syntax and different dependencies. Depending on your Ruby on Rails version and your requirements you might even want to use ActiveJob which ships with Rails per default.

Running Multiple Background Parallel Jobs with Rails