Working with Multiple Processes in Ruby

Working with multiple processes in Ruby

Combining DRb, which provides simple inter-process communication, with Queue or SizedQueue, which are both threadsafe queues, should give you what you need.

You may also want to check out beanstalkd which is also hosted on github

Multiple processes management

So I think I can provide some insight into your problem. My dev team uses a home-grown messaging que that's backed by our database. That means that messages (job metadata) are stored in our messages table.

Our rails app then creates a daemon process using the daemons gem. It makes instantiating daemon processes much simpler.There's no need to be afraid of what daemo processes are; they are just linux/unix processes that run in the background.

You specifically mention that you don't want multiple processes to write to your db It really sounds like you are concerned about deadlock issues from multiple daemons trying to read/write to the same table (please correct me if you are not, so I can modify my answer).

In order to avoid this issue, you can use row-level locking for your messages table. That way a daemon doesn't have to lock the entire table every time it wants to see if there are any jobs to pick up.

You also mention using 3 processes (I also call them daemons out of habit) to perform a task, then once those three are done, notify another process. you could possibly implement this functionality as a specific/unique message left by your 3 workers.

For example: worker A finished his job, so he writes a custom message to the special_messages_table. Workers B and C finish there task, and also write to this table. Now the entire time these daemons are processing, your third daemon would be polling the special_messages_table to see if any combination of these three jobs had finished. Once it detects that they have, it could then start.

This is just a rough outline of how you can use daemon processes to accomplish what you are asking. If you provide more details I would be happy to refine my answer. Don't be afraid of daemons!

Ruby Multi-Process Synchronization

  1. You can determine which files need to be processed before you spawn child processes.

  2. If you have access to a Redis server, then put those filenames into a Redis list, and then spawn child processes and let them pop one filename at a time and process it, until the list gets empty. If you don't have access to Redis, then you can evenly assign the files to the child processes. In neither case do you need a global lock or a mutex.

What happens when multiple processes try to write the same file?

At the Ruby level, what will happen if multiple processes try to write to the file depends on how the library uses the file: whether and how it locks the file before opening it and what mode it opens the file in. It might just work, it might raise an error, or (most likely, if the library does nothing to handle this situation) multiple writers might silently interleave writes with one another in a way that could corrupt the file, or the last writer might win.

At the Rails level, it depends on how you run Rails. If you run a single, normally configured Rails instance on a given server, you won't have any problems, since Rails itself is single-threaded by default. If you run multiple Rails instances (presumably controlled by an application server like Passenger or unicorn) you might have problems.

Assuming the library doesn't handle multiple writers for you, you can work around it in a couple of ways:

  • Run only one instance of your Rails app on each server (or docker container or chrooted environment).
  • Fork the library and change it to include the process ID in the file name. That's what I'd do.

ruby redirect multiple processes output to log

It's fine to use a very similar approach for redirecting your output to the log files, using either backticks or %x if you choose:

`cp -v some/file/that/does/not/exist some/file/name >> /path/to/stdout.log 2>>/path/to/stderr.log`

If you want to pass the values in variables, use:

`ssh root@#{host} 'sh /tmp/scripttorun' >> #{LOGDIR}/#{host}.log 2>&1`

This all assumes you're using SSH keys though. SSH won't make the connection without them, and instead will pause with a prompt to enter your password, causing your code to hang. You'll need to work around that situation using the Net::SSH gem programmatically.

To handle them in parallel, I'd recommend looking at EventMachine and EM-SSH.

What do multi-processes VS multi-threaded servers most benefit from?

Unicorn is process based, which means that each instance of ruby will have to exist in its own process. That can be in the area of 500mb's for each process, which will quickly drain system resources. Puma, being thread based, won't use the same amount of memory to THEORETICALLY attain the same amount of concurrency.

Unicorn, being that multiple processes are run, will have parallelism between the different processes. This is limited by your CPU cores (more cores can run more processes at the same time), but the kernel will switch between active processes so more than 4 or 8 processes (however many cores you have) can be run. You will be limited by your machine's memory. Until recently, ruby was not copy-on-write friendly, which meant that EVERY process had its own inherited memory (unicorn is a preforking server). Ruby 2.0 is copy-on-write friendly, which could mean that unicorn won't actually have to load all of the children processes in memory. I'm not 100% clear on this. Read about copy on write, and check out jessie storimer's awesome book 'working with unix processes'. I'm pretty sure he covered it in there.

Puma is a threaded server. MRI Ruby, because of the global interpreter lock (GIL), can only run a single CPU bound task at a time (cf. ruby tapas episode 127, parallel fib). It will context switch between the threads, but as long as it is a CPU bound task (e.g. data processing) it will only ever run a single thread of execution. This gets interesting if you run your server with a different implementation of Ruby, like JRuby or Rubinius. They do not have the GIL, and can process a great deal of information in parallel. JRuby is pretty speedy, and while Rubinius is slow compared to MRI, multithreaded Rubinius processes data faster than MRI. During non-blocking IO, however, (e.g. writing to a database, making a web request), MRI will context switch to a non-executing thread and do work there, and then switch back to the previous thread when information has been returned.

For Unicorn, I would say the bottleneck is memory and clock speed. For Puma, I would say the bottleneck is your choice of interpreter (MRI vs Rubinius or JRuby) and the type of work your server is doing (lots of cpu bound tasks vs non-blocking IO).

There are tons of great resources on this debate. Check out Jessie Storimer's books on these topics, working with ruby threads and working with unix processes; read this quick summary of preforking servers by ryan tomayko, and google around for more info.

I don't know what the best worker amount is for Unicorn or Puma in your case. The best thing to do is run performance tests and do what is right for you. There is no one size fits all. (although I think the puma standard is to use a pool of 16 threads and lock it at that)

Foreman start multiple processes?

You could use the -c or --concurrency option and just specify the processes you want to start:

$ foreman start -c process_1=1,process_2=1

Running multiple ruby processes (data import)

You don't have anything arbitrating between the scripts, or doling out the work, and you need it.

You say the files are for different databases. How do the scripts know which database? Can't you preprocess the queued files and rename them by appending something to the name? Or, have a script that determines which data goes where and then pass the names to sub-scripts that do the loading?

I'd do the later, and would probably fork the jobs but threads can do it too. Forking has some advantages but threads are easier to debug.

You don't specify enough about your system to give you code that will slide in, but this is a general idea of what to do using threads:

require 'thread'

file_queue = Queue.new
Dir['./*'].each { |f| file_queue << f }

consumers = []
2.times do |worker|
consumers << Thread.new do
loop do
break if file_queue.empty?
data_file = file_queue.pop
puts "Worker #{ worker } reading #{ data_file }. Queue size: #{ 1 + file_queue.length }\n"
num_lines = 0
File.foreach(data_file) do |li|
num_lines += 1
end
puts "Worker #{ worker } says #{ data_file } contained #{ num_lines } lines.\n"
end
end
end

consumers.each { |c| c.join }

Which, after running, shows this in the console:

Worker 1 reading ./blank.yaml. Queue size: 28
Worker 0 reading ./build_links_to_test_files.rake. Queue size: 27
Worker 0 says ./build_links_to_test_files.rake contained 68 lines.
Worker 0 reading ./call_cgi.rb. Queue size: 26
Worker 1 says ./blank.yaml contained 3 lines.
Worker 1 reading ./cgi.rb. Queue size: 25
Worker 0 says ./call_cgi.rb contained 11 lines.
Worker 1 says ./cgi.rb contained 10 lines.
Worker 0 reading ./client.rb. Queue size: 24
Worker 1 reading ./curl_test.sh. Queue size: 23
Worker 0 says ./client.rb contained 19 lines.
Worker 0 reading ./curl_test_all_post_vars.sh. Queue size: 22

That's been trimmed down, but you get the idea.

Ruby's Queue class is the key. It's like an array with icing slathered on it, which arbitrates access to the queue. Think of it this way: "consumers", i.e., Threads, put a flag in the air to receive permission to access the queue. When given that permission, they can pop or shift or modify the queue. Once they're done, the permission is given to the next thread with its flag up.

I use pop instead of shift for esoteric reasons but, if your files have to be loaded in a certain order, sort them before they're added to the queue so that order is set, then use shift.

We want to store the number of threads running so we can join them later. This lets the threads complete their tasks before the mother script ends.



Related Topics



Leave a reply



Submit