Performance of Concurrent Code Using Dispatch_Group_Async Is Much Slower Than Single-Threaded Version

Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version

arc4random uses a critical section while modifying its state. The critical section is super-fast in the non-contended case (when changing from unlocked to locked), but in the contended case (when trying to lock a mutex that's already locked) it has to call the operating system and put the thread to sleep, which decreases performance much.

u_int32_t
arc4random()
{
    u_int32_t rnd;

    THREAD_LOCK();
    arc4_check_init();
    arc4_check_stir();
    rnd = arc4_getword(&rs);
    THREAD_UNLOCK();

    return (rnd);
}

where THREAD_LOCK() is defined as

#define THREAD_LOCK()                       \
    do {                            \
        if (__isthreaded)               \
            _pthread_mutex_lock(&arc4random_mtx);   \
    } while (0)

Source: Arc4 random number generator for OpenBSD

to make it faster

you could create a Arc4Random class that is a wrapper around the static arc4_* functions from arc4random.c . Then you have a random-number generator that is no longer threadsafe, but you could create one generator for each thread.

Using Grand Central Dispatch in Swift to parallelize and speed up “for loops?

A "multi-threaded iteration" can be done with dispatch_apply():

let outerCount = 100    // # of concurrent block iterations
let innerCount = 10000  // # of iterations within each block

let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
    for innerIdx in 1 ... innerCount {
       // ...
    }
}

(You have to figure out the best relation between outer and inner counts.)

There are two things to notice:

arc4random() uses an internal mutex, which makes it extremely slow when called
from several threads in parallel, see Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version. From the answers given there,
rand_r() (with separate seeds for each thread) seems to be faster alternative.
The result variable winner must not be modified from multiple threads simultaneously.
You can use an array instead where each thread updates its own element, and the results
are added afterwards. A thread-safe method has been described in https://stackoverflow.com/a/26790019/1187415.

Then it would roughly look like this:

let outerCount = 100     // # of concurrent block iterations
let innerCount = 10000   // # of iterations within each block

let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

var winners = [Int](count: outerCount, repeatedValue: 0)
winners.withUnsafeMutableBufferPointer { winnersPtr -> Void in

    dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
        var seed = arc4random() // seed for rand_r() in this "thread"

        for innerIdx in 1 ... innerCount {
            var points = 0
            var ability = 500

            for i in 1 ... 1000 {
                let chance = Int(rand_r(&seed) % 1001)
                if chance < (ability-points) { ++points }
                else {points = points - 1}
            }
            if points > 0 {
                winnersPtr[Int(outerIdx)] += 1
            }
        }
    }
}

// Add results:
let winner = reduce(winners, 0, +)
println(winner)

GCD Poor Performance

Is there a reason you're not using the GCD C API and the dispatch_* family of functions? You don't have much control over the GCD aspects of NSOperationQueue (like which queue you want to submit the blocks to). Also, I can't tell if you're using iOS or not, but NSOperationQueue does not use GCD on iOS. That might be the reason it spawned so many threads. Either way, your code will be shorter and simpler if you use the GCD API directly:

- (double) detectCollisionsInArray:(NSArray*)objects
{   
  int count = [objects count];
  if (count > 0)
  {
    double time = CFAbsoluteTimeGetCurrent();

    dispatch_group_t group = dispatch_group_create();
    dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    for (int i = 0; i < count; i++)
    {
      dispatch_group_async(group, queue, ^{
        for (int j = i + 1; j < count; j++)
        {
          dispatch_group_async(group, queue, ^{
            /** LOTS AND LOTS OF WORK FOR EACH OBJECT **/
          });
        }
      });
    }
    dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
    dispatch_release(group);
    return CFAbsoluteTimeGetCurrent() - time;
  }
  return 0;
}

You can use a dispatch_group to group all of the executions together and wait for them all to finish with dispatch_group_wait. If you don't care to know when the the blocks finish, you can ignore the group part and just use dispatch_async. The dispatch_get_global_queue function will get one of the 3 concurrent queues (low, default or high priority) for you to submit your blocks to. You shouldn't have to worry about limiting the thread count or anything like that. The GCD scheduler is supposed to do all of that for you. Just make sure you submit to a concurrent queue, which could either be one of the 3 global queues, or a queue you've created by passing DISPATCH_QUEUE_CONCURRENT to dispatch_queue_create (this is available starting OS X 10.7 and iOS 5.0).

If you're doing some file I/O in each block or taxing some other resource, you might need to reign in GCD and limit the number of blocks you're submitting to the queue at once. This will have the same effect as limiting the concurrent operation count in an NSOperationQueue. You can use a GCD semaphore to do this:

- (double) detectCollisionsInArray:(NSArray*)objects
{   
  int count = [objects count];
  if (count > 0)
  {
    double time = CFAbsoluteTimeGetCurrent();

    dispatch_group_t group = dispatch_group_create();
    dispatch_semaphore_t semaphore = dispatch_semaphore_create(10);
    dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    for (int i = 0; i < count; i++)
    {
      dispatch_semaphore_wait(semaphore, DISPATCH_TIME_FOREVER);
      dispatch_group_async(group, queue, ^{
        for (int j = i + 1; j < count; j++)
        {
          dispatch_semaphore_wait(semaphore, DISPATCH_TIME_FOREVER);
          dispatch_group_async(group, queue, ^{
            /** LOTS AND LOTS OF WORK FOR EACH OBJECT **/
            dispatch_semaphore_signal(semaphore);
          });
        }
        dispatch_semaphore_signal(semaphore);
      });
    }
    dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
    dispatch_release(group);
    dispatch_release(semaphore);
    return CFAbsoluteTimeGetCurrent() - time;
  }
  return 0;
}

Once you get the hang of it, GCD is very simple to use. I use it all over my code now.

Can anyone help to put me on the right track and perhaps provide a link to a good GCD tutorial?

Run, don't walk over to Mike Ash's blog. His series on GCD is the clearest and most concise I've seen, and it'll only take you around 30 minutes to read the whole thing. Apple's WWDC videos from 2010 on GCD And blocks are also pretty good.

What advantage(s) does dispatch_sync have over @synchronized?

~~Wow. OK -- My original performance assessment was flat out wrong. Color me stupid.~~

Not so stupid. My performance test was wrong. Fixed. Along with a deep dive into the GCD code.

Update: Code for the benchmark can be found here: https://github.com/bbum/StackOverflow Hopefully, it is correct now. :)

Update2: Added a 10 queue version of each kind of test.

OK. Rewriting the answer:

• @synchronized() has been around for a long time. It is implemented as a hash lookup to find a lock that is then locked. It is "pretty fast" -- generally fast enough -- but can be a burden under high contention (as can any synchronization primitive).

• dispatch_sync() doesn't necessarily require a lock, nor does it require the block to be copied. Specifically, in the fastpath case, the dispatch_sync() will call the block directly on the calling thread without copying the block. Even in the slowpath case, the block won't be copied as the calling thread has to block until execution anyway (the calling thread is suspended until whatever work is ahead of the dispatch_sync() is finished, then the thread is resumed). The one exception is invocation on the main queue/thread; in that case, the block still isn't copied (because the calling thread is suspended and, therefore, using a block from the stack is OK), but there is a bunch of work done to enqueue on the main queue, execute, and then resume the calling thread.

• dispatch_async() required that the block be copied as it cannot execute on the current thread nor can the current thread be blocked (because the block may immediately lock on some thread local resource that is only made available on the line of code after the dispatch_async(). While expensive, dispatch_async() moves the work off the current thread, allowing it to resume execution immediately.

End result -- dispatch_sync() is faster than @synchronized, but not by a generally meaningful amount (on a '12 iMac, nor '11 mac mini -- #s between the two are very different, btw... joys of concurrency). Using dispatch_async() is slower than both in the uncontended case, but not by much. However, use of 'dispatch_async()' is significantly faster when the resource is under contention.

@synchronized uncontended add: 0.14305 seconds
Dispatch sync uncontended add: 0.09004 seconds
Dispatch async uncontended add: 0.32859 seconds
Dispatch async uncontended add completion: 0.40837 seconds
Synchronized, 2 queue: 2.81083 seconds
Dispatch sync, 2 queue: 2.50734 seconds
Dispatch async, 2 queue: 0.20075 seconds
Dispatch async 2 queue add completion: 0.37383 seconds
Synchronized, 10 queue: 3.67834 seconds
Dispatch sync, 10 queue: 3.66290 seconds
Dispatch async, 2 queue: 0.19761 seconds
Dispatch async 10 queue add completion: 0.42905 seconds

Take the above with a grain of salt; it is a micro-benchmark of the worst kind in that it does not represent any real world common usage pattern. The "unit of work" is as follows and the execution times above represent 1,000,000 executions.

- (void) synchronizedAdd:(NSObject*)anObject
{
    @synchronized(self) {
        [_a addObject:anObject];
        [_a removeLastObject];
        _c++;
    }
}

- (void) dispatchSyncAdd:(NSObject*)anObject
{
    dispatch_sync(_q, ^{
        [_a addObject:anObject];
        [_a removeLastObject];
        _c++;
    });
}

- (void) dispatchASyncAdd:(NSObject*)anObject
{
    dispatch_async(_q, ^{
        [_a addObject:anObject];
        [_a removeLastObject];
        _c++;
    });
}

(_c is reset to 0 at the beginning of each pass and asserted to be == to the # of test cases at the end to ensure that the code is actually executing all the work before spewing the time.)

For the uncontended case:

start = [NSDate timeIntervalSinceReferenceDate];
_c = 0;
for(int i = 0; i < TESTCASES; i++ ) {
    [self synchronizedAdd:o];
}
end = [NSDate timeIntervalSinceReferenceDate];
assert(_c == TESTCASES);
NSLog(@"@synchronized uncontended add: %2.5f seconds", end - start);

For the contended, 2 queue, case (q1 and q2 are serial):

    #define TESTCASE_SPLIT_IN_2 (TESTCASES/2)
start = [NSDate timeIntervalSinceReferenceDate];
_c = 0;
dispatch_group_async(group, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_BACKGROUND, 0), ^{
    dispatch_apply(TESTCASE_SPLIT_IN_2, serial1, ^(size_t i){
        [self synchronizedAdd:o];
    });
});
dispatch_group_async(group, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_BACKGROUND, 0), ^{
    dispatch_apply(TESTCASE_SPLIT_IN_2, serial2, ^(size_t i){
        [self synchronizedAdd:o];
    });
});
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
end = [NSDate timeIntervalSinceReferenceDate];
assert(_c == TESTCASES);
NSLog(@"Synchronized, 2 queue: %2.5f seconds", end - start);

The above are simply repeated for each work unit variant (no tricksy runtime-y magic in use; copypasta FTW!).

With that in mind:

• Use @synchronized() if you like how it looks. The reality is that if your code is contending on that array, you probably have an architecture issue. Note: using @synchronized(someObject) may have unintended consequences in that it may cause additional contention if the object internally uses @synchronized(self)!

• Use dispatch_sync() with a serial queue if that is your thing. There is no overhead -- it is actually faster in both the contended and uncontended case -- and using queues are both easier to debug and easier to profile in that Instruments and the Debugger both have excellent tools for debugging queues (and they are getting better all the time) whereas debugging locks can be a pain.

• Use dispatch_async() with immutable data for heavily contended resources. I.e.:

- (void) addThing:(NSString*)thing { 
    thing = [thing copy];
    dispatch_async(_myQueue, ^{
        [_myArray addObject:thing];
    });    
}

Finally, it shouldn't really matter which one you use for maintaining the contents of an array. The cost of contention is exceedingly high for the synchronous cases. For the asynchronous case, the cost of contention goes way down, but the potential for complexity or weird performance issues goes way up.

When designing concurrent systems, it is best to keep the boundary between queues as small as possible. A big part of that is ensuring that as few resources as possible "live" on both sides of a boundary.

Why is rand() much faster than arc4random()?

Your results are reasonable.

The two functions are designed for different purposes and they are based on completely different class of pseudorandom number generators.

rand() is a pseudorandom number generator (PRNG), generally implemented as a linear congruential generator (LCG), which is reasonably fast but generally it has notoriously bad statistical properties.

arc4random() is a cryptographically secure pseudorandom number generator (CSPRNG), which is generally slower but also suitable for cryptographical use, which implies it has higher statistical quality of randomness than a PRNG. According to the documentations here, here and here:

The function when originally implemented in OpenBSD, used the RC4
cipher to produce the output stream. OpenBSD now uses a ChaCha20
stream cipher.

and

RC4 has been designed by RSA Data Security, Inc. It was posted anonymously
to the USENET and was confirmed to be equivalent by several
sources who had access to the original cipher. Since RC4 used to be a
trade secret, the cipher is now referred to as ARC4.

It is important to understand what is the difference between the two class of pseudorandom number generators.

A PRNG is designed for purposes where all you care about is (ideally) high amount of statistical randomness: fields where you want simulate randomness.

A CSPRNG is designed for purposes where the latter is simply not enough, but high amount of unpredictability (even when the generator algorithm is exactly known) is also required: the field of cryptography.

So you don't need a CSPRNG to generate numbers with high quality of statistical randomness. All you need is a high quality PRNG. rand() is nowhere to that and arc4random() is an overshoot.

See this page for simple, modern, high quality random number generators.

Use dispatch_async to analyze an array concurrently in Swift

There are a couple of ways of doing this, but there are a couple of observations before we get to that:

To try to maximize performance, if you do any concurrent processing, be aware that you are not guaranteed the order in which they will complete. Thus a simple colorList.append(color) pattern won't work if the order that they appear is important. You can either prepopulate a colorList and then have each iteration simply do colorList[i] = color or you could use a dictionary. (Obviously, if order is not important, then this is not critical.)
Because these iterations will be running concurrently, you'll need to synchronize your updating of colorList. So do your expensive analyzeColors concurrently on background queue, but use a serial queue for the updating of colorList, to ensure you don't have multiple updates stepping over each other.
When doing concurrent processing, there are points of diminishing returns. For example, taking a complex task and breaking it into 2-4 concurrent loops might yield some performance benefit, but if you start increasing the number of concurrent threads too much, you'll find that the overhead of these threads starts to adversely affect performance. So benchmark this with different degrees of concurrency and don't assume that "more threads" is always better.

In terms of how to achieve this, there are two basic techniques:

If you see Performing Loop Iterations Concurrently in the Concurrency Programming Guide: Dispatch Queues guide, they talk about dispatch_apply, which is designed precisely for this purpose, to run for loops concurrently.
```
colorList = [Int](count: 8, repeatedValue: 0)  // I don't know what type this `colorList` array is, so initialize this with whatever type makes sense for your app

let queue = dispatch_get_global_queue(QOS_CLASS_UTILITY, 0)

let qos_attr = dispatch_queue_attr_make_with_qos_class(DISPATCH_QUEUE_SERIAL, QOS_CLASS_UTILITY, 0)
let syncQueue = dispatch_queue_create("com.domain.app.sync", qos_attr)

dispatch_apply(8, queue) { iteration in
    let color = self.photoAnalyzer.analyzeColors(imageStrips[iteration])
    dispatch_sync(syncQueue) {
        colorList[iteration] = color
        return
    }
}

// you can use `colorList` here
```
Note, while these iterations run concurrently, the whole dispatch_apply loop runs synchronously with respect to the queue from which you initiated it. This means that you will not want to call the above code from the main thread (we never want to block the main thread). So will likely want to dispatch this whole thing to some background queue.
By the way, dispatch_apply is discussed in WWDC 2011 video Blocks and Grand Central Dispatch in Practice.

Another common pattern is to create a dispatch group, dispatch the tasks to a concurrent queue using that group, and specify a dispatch_group_notify to specify what you want to do when it's done.

colorList = [Int](count: 8, repeatedValue: 0)  // I don't know what type this `colorList` array is, so initialize this with whatever type makes sense for your app

let group = dispatch_group_create()
let queue = dispatch_get_global_queue(QOS_CLASS_UTILITY, 0)

let qos_attr = dispatch_queue_attr_make_with_qos_class(DISPATCH_QUEUE_SERIAL, QOS_CLASS_UTILITY, 0)
let syncQueue = dispatch_queue_create("com.domain.app.sync", qos_attr)

for i in 0 ..< 8 {
    dispatch_group_async(group, queue) {
        let color = self.photoAnalyzer.analyzeColors(imageStrips[i])
        dispatch_sync(syncQueue) {
            colorList[i] = color
            return
        }
    }
}

dispatch_group_notify(group, dispatch_get_main_queue()) {
    // use `colorList` here
}

// but not here (because the above code is running asynchronously)

This approach avoids blocking the main thread altogether, though you have to be careful to not add too many concurrent dispatched tasks (as the worker threads are a very limited resource).

In both of these examples, I created a dedicated serial queue for synchronizing the updates to colorList. That may be overkill. If you're not blocking the main queue (which you shouldn't do anyway), you could dispatch this synchronization code to the main queue (which is a serial queue). But it's probably more precise to have a dedicated serial queue for this purpose. And if this was something that I was going to be interacting with from multiple threads constantly, I'd use a reader-writer pattern. But this is probably good enough for this situation.

Performance of Concurrent Code Using Dispatch_Group_Async Is Much Slower Than Single-Threaded Version