Running multiple background parallel jobs with Rails
Some thoughts...
Just because you need to read 50 sites and naturally want some parallel work does not mean that you need 50 processes or threads. You need to balance the slowdown and overhead. How about having 10 or 20 processes each read a few sites?
Depending on which Ruby you are using, be careful about the green threads, you may not get the parallel result you want
You might want to structure it like a reverse, client-side inetd, and use
connect_nonblock
andIO.select
to get the parallel connections you want by making all the servers respond in parallel. You don't really need parallel processing of the results, you just need to get in line at all the servers in parallel, because that is where the latency really is.
So, something like this from the socket library...extend it for multiple outstanding connections...
require 'socket'
include Socket::Constants
socket = Socket.new(AF_INET, SOCK_STREAM, 0)
sockaddr = Socket.sockaddr_in(80, 'www.google.com')
begin
socket.connect_nonblock(sockaddr)
rescue Errno::EINPROGRESS
IO.select(nil, [socket])
begin
socket.connect_nonblock(sockaddr)
rescue Errno::EISCONN
end
end
socket.write("GET / HTTP/1.0\r\n\r\n")
# here perhaps insert IO.select. You may not need multiple threads OR multiple
# processes with this technique, but if you do insert them here
results = socket.read
Parallel background tasks on single worker dyno
According to @radiospiel's related post, you can use foreman to start multiple processes.
1) Add foreman to your Gemfile
2) Create two files:
Procfile:
web: bundle exec unicorn -p $PORT -c ./config/unicorn.rb
worker: bundle exec foreman start -f Procfile.workers
Procfile.workers:
dj_worker: bundle exec rake jobs:work
dj_worker: bundle exec rake jobs:work
dj_worker: bundle exec rake jobs:work
I just deployed this to Heroku, works great.
Running large amount of long running background jobs in Rails
It sounds like you are limited by memory on the number of workers that you can run on your DigitalOcean host.
If you are worried about scaling, I would focus on making the workers as efficient as possible. Have you done any benchmarking to understanding where the 900MB of memory is being allocated? I'm not sure what the nature of these jobs are, but you mentioned large files. Are you reading the contents of these files into memory, or are you streaming them? Are you using a database with SQL you can tune? Are you making many small API calls when you could be using a batch endpoint? Are you assigning intermediary variables that must then be garbage collected? Can you compress the files before you send them?
Look at the job structure itself. I've found that background jobs work best with many smaller jobs rather than one larger job. This allows execution to happen in parallel, and be more load balanced across all workers. You could even have a job that generates other jobs. If you need a job to orchestrate callbacks when a group of jobs finishes there is a DelayedJobGroup plugin at https://github.com/salsify/delayed_job_groups_plugin that allows you to invoke a final job only after the sibling jobs complete. I would aim for an execution time of a single job to be under 30 seconds. This is arbitrary but it illustrates what I mean by smaller jobs.
Some hosting providers like Amazon provide spot instances where you can pay a lower price on servers that do not have guaranteed availability. These pair well with the many fewer jobs approach I mentioned earlier.
Finally, Ruby might not be the right tool for the job. There are faster languages, and if you are limited by memory, or CPU, you might consider writing these jobs and their workers in another language like Javascript, Go or Rust. These can pair well with a Ruby stack, but offload computationally expensive subroutines to faster languages.
Finally, like many scaling issues, if you have more money than time, you can always throw more hardware at it. At least for a while.
How to make multiple parallel concurrent requests with Rails and Heroku
Don't use Resque
. Use Sidekiq
instead.
Resque
runs in a single-threaded process, meaning the workers run synchronously, while Sidekiq
runs in a multithreaded process, meaning the workers run asynchronously/simutaneously in different threads.
Make sure you assign a URL to scrape per worker. It's no use if one worker scrape multiple URLs.
With Sidekiq, you can pass the link to a worker, e.g.
LINKS = [...]
LINKS.each do |link|
ScrapeWoker.perform_async(link)
end
The perform_async
doesn't actually execute the job right away. Instead, the link is just put in a queue in redis along with the worker class, and so on, and later (could be milliseconds later) workers are assigned to execute each job in queue in its own thread by running the perform
instance method in ScrapeWorker. Sidekiq
will make sure to retry again if exception occur during execution of a worker.
PS: You don't have pass a link to the worker. You can store the links to a table and then pass the id
s of the records to workers.
More info about sidekiq
Run two scripts in separate directories, in parallel, and in foreground
I believe what you're looking for is something like Foreman
:
https://github.com/ddollar/foreman
http://blog.daviddollar.org/2011/05/06/introducing-foreman.html
You can list all your commands in using a Procfile like so:
Procfile.dev:
web: bundle exec rails server
yarn: exec yarn start
And then simply run foreman start
in your command line.
Rails delayed job multithreading
Yes, with DelayedJob you could add a method to your User
model like this
def assign_task
Task.create(user_id: id)
end
and process it delayed like this in your original method:
def assign_tasks
users.find_each { |user| user.delay.assign_task }
end
Then the task generation for each user will happen in the background and – when there are enough workers – in parallel.
Btw there are other tools that support processing jobs in the background – the most common nowerdays is Sidekiq. They all have a slightly different syntax and different dependencies. Depending on your Ruby on Rails version and your requirements you might even want to use ActiveJob
which ships with Rails per default.
Related Topics
Understanding Method_Added for Class Methods
Circular Dependency Detected While Autoloading Constant When Loading Constant
How to Get the Version from a Gemspec File
Pry Not Stopping When Called from a Ruby Script That Reads from Stdin
How to Test for a Redirect with Rspec and Capybara
How to Lazily Evaluate an Arbitrary Variable with Chef
After Installing Ruby Gems, Running the New Gem Returns "Could Not Find" Errors
How to Get an Empty Temporary Directory in Ruby on Rails
Rails 4 User Roles and Permissions
Creating a Hash with Values as Arrays and Default Value as Empty Array
Get Server File Path with Paperclip
How to Use Truly Local Variables in Ruby Proc/Lambda