Please Introduce a Multi-Processing Library in Perl or Ruby

Please introduce a multi-processing library in Perl or Ruby

With Perl, you have options. One option is to use processes as below. I need to look up how to write the analogous program using threads but http://perldoc.perl.org/perlthrtut.html should give you an idea.

#!/usr/bin/perl

use strict;
use warnings;

use Parallel::ForkManager;

my @data = (0 .. 19);

my $pm = Parallel::ForkManager->new(4);

for my $n ( @data ) {
my $pid = $pm->start and next;
warn sprintf "%d^3 = %d\n", $n, slow_cube($n);
$pm->finish;
}

sub slow_cube {
my ($n) = @_;

sleep 1;
return $n * $n * $n;
}

__END__

The following version using threads does not use a limit on the number of threads created (because I do not know how):

#!/usr/bin/perl

use strict;
use warnings;

use threads;

my @data = (0 .. 19);
my @threads = map {
threads->new( {context => 'list' }, \&slow_cube, $_ )
} @data;

for my $thr ( @threads ) {
my ( $n, $ncubed ) = $thr->join;
print "$n^3 = $ncubed\n";
}

sub slow_cube {
my ($n) = @_;

sleep 1;
return $n, $n * $n * $n;
}

__END__

Interestingly:

TimeThis :  Command Line :  t.pl
TimeThis : Elapsed Time : 00:00:01.281

Working with multiple processes in Ruby

Combining DRb, which provides simple inter-process communication, with Queue or SizedQueue, which are both threadsafe queues, should give you what you need.

You may also want to check out beanstalkd which is also hosted on github

What is Ruby's equivalent to Python's multiprocessing module?

I have used https://github.com/grosser/parallel, and like it a lot. It will #map or #each across all the cores in your system by default. Under the hood it's a wrapper around Process.fork, which sounds like what you're asking for.

Does Ruby have any construct similar to Clojure's pmap for parallel processing?

Here's a simple little example of one way to do this. Note that there's nothing limiting the number of threads it creates at once, so you might want to create some sort of thread pool if you're running lots of threads.

[1,2,3].map{|x| Thread.start{x+1}}.map{|t| t.join.value}

Multithreading for perl code

Here is a simple example of using threads:

use strict;
use warnings;
use threads;

sub threaded_task {
threads->create(sub {
my $thr_id = threads->self->tid;
print "Starting thread $thr_id\n";
sleep 2;
print "Ending thread $thr_id\n";
threads->detach(); #End thread.
});
}

while (1)
{
threaded_task();
sleep 1;
}

This will create a thread every second. The thread itself lasts two seconds.

To learn more about threads, please see the documentation. An important consideration is that variables are not shared between threads. Duplicate copies of all your variables are made when you start a new thread.

If you need shared variables, look into threads::shared.

However, please note that the correct design depends on what you are actually trying to do. Which isn't clear from your question.

Some other comments on your code:

  • Always use strict; to help you use best practices in your code.
  • The correct way to declare a lexical variable is my $gg; rather than local $gg;. local doesn't actually create a lexical variable; it gives a localized value to a global variable. It is not something you will need to use very often.
  • Avoid giving subroutines the same name as system functions (e.g. print). This is confusing.
  • It is not recommended to use & before calling subroutines (in your case it was necessary because of conflict with a system function name, but as I said, that should be avoided).

How does event driven I/O allow multiprocessing?

Hmmm. You (the original poster) and the other answers are, I think, coming at this backwards.

You seem to grasp the event-driven part, but are getting hung up on what happens after an event fires.

The key thing to understand is that a web server generally spends very little time "processing" a request, and a whole lot of time waiting for disk and network I/O.

When a request comes in, there are generally one of two things that the server needs to do. Either load a file and send it to the client, or pass the request to something else (classically, a CGI script, these days FastCGI is more common for obvious reasons).

In either case, the server's job is computationally minimal, it's just a middle-man between the client and the disk or "something else".

That's why these servers use what is called non-blocking I/O.

The exact mechanisms vary from one operating system to another, but the key point is that a read or write request always returns instantly (or near enough). When you try to write, for example, to a socket, the system either immediately accepts what it can into a buffer, or returns something like an EWOULDBLOCK error letting you know it can't take more data right now.

Once the write has been "accepted", the program can make a note of the state of the connection (e.g. "5000 of 10000 bytes sent" or something) and move on to the next connection which is ready for action, coming back to the first after the system is ready to take more data.

This is unlike a normal blocking socket where a big write request could block for quite a while as the OS tries to send data over the network to the client.

In a sense, this isn't really different from what you might do with threaded I/O, but it has much reduced overhead in the form of memory, context switching, and general "housekeeping", and takes maximum advantage of what operating systems do best (or are supposed to, anyway): handle I/O quickly.

As for multi-processor/multi-core systems, the same principles apply. This style of server is still very efficient on each individual CPU. You just need one that will fork multiple instances of itself to take advantage of the additional processors.

Why are scripting languages (e.g. Perl, Python, and Ruby) not suitable as shell languages?

There are a couple of differences that I can think of; just thoughtstreaming here, in no particular order:

  1. Python & Co. are designed to be good at scripting. Bash & Co. are designed to be only good at scripting, with absolutely no compromise. IOW: Python is designed to be good both at scripting and non-scripting, Bash cares only about scripting.

  2. Bash & Co. are untyped, Python & Co. are strongly typed, which means that the number 123, the string 123 and the file 123 are quite different. They are, however, not statically typed, which means they need to have different literals for those, in order to keep them apart.

    Example:

                    | Ruby             | Bash    
    -----------------------------------------
    number | 123 | 123
    string | '123' | 123
    regexp | /123/ | 123
    file | File.open('123') | 123
    file descriptor | IO.open('123') | 123
    URI | URI.parse('123') | 123
    command | `123` | 123
  3. Python & Co. are designed to scale up to 10000, 100000, maybe even 1000000 line programs, Bash & Co. are designed to scale down to 10 character programs.

  4. In Bash & Co., files, directories, file descriptors, processes are all first-class objects, in Python, only Python objects are first-class, if you want to manipulate files, directories etc., you have to wrap them in a Python object first.

  5. Shell programming is basically dataflow programming. Nobody realizes that, not even the people who write shells, but it turns out that shells are quite good at that, and general-purpose languages not so much. In the general-purpose programming world, dataflow seems to be mostly viewed as a concurrency model, not so much as a programming paradigm.

I have the feeling that trying to address these points by bolting features or DSLs onto a general-purpose programming language doesn't work. At least, I have yet to see a convincing implementation of it. There is RuSH (Ruby shell), which tries to implement a shell in Ruby, there is rush, which is an internal DSL for shell programming in Ruby, there is Hotwire, which is a Python shell, but IMO none of those come even close to competing with Bash, Zsh, fish and friends.

Actually, IMHO, the best current shell is Microsoft PowerShell, which is very surprising considering that for several decades now, Microsoft has continually had the worst shells evar. I mean, COMMAND.COM? Really? (Unfortunately, they still have a crappy terminal. It's still the "command prompt" that has been around since, what? Windows 3.0?)

PowerShell was basically created by ignoring everything Microsoft has ever done (COMMAND.COM, CMD.EXE, VBScript, JScript) and instead starting from the Unix shell, then removing all backwards-compatibility cruft (like backticks for command substitution) and massaging it a bit to make it more Windows-friendly (like using the now unused backtick as an escape character instead of the backslash which is the path component separator character in Windows). After that, is when the magic happens.

They address problem 1 and 3 from above, by basically making the opposite choice compared to Python. Python cares about large programs first, scripting second. Bash cares only about scripting. PowerShell cares about scripting first, large programs second. A defining moment for me was watching a video of an interview with Jeffrey Snover (PowerShell's lead designer), when the interviewer asked him how big of a program one could write with PowerShell and Snover answered without missing a beat: "80 characters." At that moment I realized that this is finally a guy at Microsoft who "gets" shell programming (probably related to the fact that PowerShell was neither developed by Microsoft's programming language group (i.e. lambda-calculus math nerds) nor the OS group (kernel nerds) but rather the server group (i.e. sysadmins who actually use shells)), and that I should probably take a serious look at PowerShell.

Number 2 is solved by having arguments be statically typed. So, you can write just 123 and PowerShell knows whether it is a string or a number or a file, because the cmdlet (which is what shell commands are called in PowerShell) declares the types of its arguments to the shell. This has pretty deep ramifications: unlike Unix, where each command is responsible for parsing its own arguments (the shell basically passes the arguments as an array of strings), argument parsing in PowerShell is done by the shell. The cmdlets specify all their options and flags and arguments, as well as their types and names and documentation(!) to the shell, which then can perform argument parsing, tab completion, IntelliSense, inline documentation popups etc. in one centralized place. (This is not revolutionary, and the PowerShell designers acknowledge shells like the DIGITAL Command Language (DCL) and the IBM OS/400 Command Language (CL) as prior art. For anyone who has ever used an AS/400, this should sound familiar. In OS/400, you can write a shell command and if you don't know the syntax of certain arguments, you can simply leave them out and hit F4, which will bring a menu (similar to an HTML form) with labelled fields, dropdown, help texts etc. This is only possible because the OS knows about all the possible arguments and their types.) In the Unix shell, this information is often duplicated three times: in the argument parsing code in the command itself, in the bash-completion script for tab-completion and in the manpage.

Number 4 is solved by the fact that PowerShell operates on strongly typed objects, which includes stuff like files, processes, folders and so on.

Number 5 is particularly interesting, because PowerShell is the only shell I know of, where the people who wrote it were actually aware of the fact that shells are essentially dataflow engines and deliberately implemented it as a dataflow engine.

Another nice thing about PowerShell are the naming conventions: all cmdlets are named Action-Object and moreover, there are also standardized names for specific actions and specific objects. (Again, this should sound familar to OS/400 users.) For example, everything which is related to receiving some information is called Get-Foo. And everything operating on (sub-)objects is called Bar-ChildItem. So, the equivalent to ls is Get-ChildItem (although PowerShell also provides builtin aliases ls and dir – in fact, whenever it makes sense, they provide both Unix and CMD.EXE aliases as well as abbreviations (gci in this case)).

But the killer feature IMO is the strongly typed object pipelines. While PowerShell is derived from the Unix shell, there is one very important distinction: in Unix, all communication (both via pipes and redirections as well as via command arguments) is done with untyped, unstructured strings. In PowerShell, it's all strongly typed, structured objects. This is so incredibly powerful that I seriously wonder why noone else has thought of it. (Well, they have, but they never became popular.) In my shell scripts, I estimate that up to one third of the commands is only there to act as an adapter between two other commands that don't agree on a common textual format. Many of those adapters go away in PowerShell, because the cmdlets exchange structured objects instead of unstructured text. And if you look inside the commands, then they pretty much consist of three stages: parse the textual input into an internal object representation, manipulate the objects, convert them back into text. Again, the first and third stage basically go away, because the data already comes in as objects.

However, the designers have taken great care to preserve the dynamicity and flexibility of shell scripting through what they call an Adaptive Type System.

Anyway, I don't want to turn this into a PowerShell commercial. There are plenty of things that are not so great about PowerShell, although most of those have to do either with Windows or with the specific implementation, and not so much with the concepts. (E.g. the fact that it is implemented in .NET means that the very first time you start up the shell can take up to several seconds if the .NET framework is not already in the filesystem cache due to some other application that needs it. Considering that you often use the shell for well under a second, that is completely unacceptable.)

The most important point I want to make is that if you want to look at existing work in scripting languages and shells, you shouldn't stop at Unix and the Ruby/Python/Perl/PHP family. For example, Tcl was already mentioned. Rexx would be another scripting language. Emacs Lisp would be yet another. And in the shell realm there are some of the already mentioned mainframe/midrange shells such as the OS/400 command line and DCL. Also, Plan9's rc.

Share variable through ruby processes

When you fork a process then the child and parent processes's memory are separated, so you cannot share variables between them directly. So a singleton class will not work in your case.

The solution is IPC, Ruby supports both pipes and sockets, which are the two most used forms of IPC, at least on *NIX. Ruby also supports distributed objects, if you need a more transparent interface.

What you chose depends on the job. If you know you want to split you processes over several computers at some point, go with sockets or drb. If not go with pipes.

Here's a short introduction to pipes in Ruby



Related Topics



Leave a reply



Submit