Swift 3 Parallel For/Map Loop

How to implement a parallel map in swift

You can try something like this:

// EZSE : A parralelized map for collections, operation is non blocking
public func pmap<R>(_ each: @escaping (Self.Iterator.Element) -> R) -> [R] {
    let totalCount = indices.count
    var res = ContiguousArray<R?>(repeating: nil, count: totalCount)
    
    let queue = OperationQueue()
    queue.maxConcurrentOperationCount = totalCount
    
    let lock = NSRecursiveLock()
    indices
        .enumerated()
        .forEach { index, elementIndex in
            queue.addOperation {
                let temp = each(self[elementIndex])
                lock.lock()
                res[index] = temp
                lock.unlock()
            }
    }
    queue.waitUntilAllOperationsAreFinished()
    
    return res.map({ $0! })
}

Swift performance: map() and reduce() vs for loops

Shouldn't the built-in Array methods be faster than the naive approach
for performing such operations? Maybe somebody with more low-level knowledge than I can shed some light on the situation.

I just want to attempt to address this part of the question and more from the conceptual level (with little understanding of the nature of Swift's optimizer on my part) with a "not necessarily". It's coming more from a background in compiler design and computer architecture than deep-rooted knowledge of the nature of Swift's optimizer.

Calling Overhead

With functions like map and reduce accepting functions as inputs, it places a greater strain on the optimizer to put it one way. The natural temptation in such a case short of some very aggressive optimization is to constantly branch back and forth between the implementation of, say, map, and the closure you provided, and likewise transmit data across these disparate branches of code (through registers and stack, typically).

That kind of branching/calling overhead is very difficult for the optimizer to eliminate, especially given the flexibility of Swift's closures (not impossible but conceptually quite difficult). C++ optimizers can inline function object calls but with far more restrictions and code generation techniques required to do it where the compiler would effectively have to generate a whole new set of instructions for map for each type of function object you pass in (and with explicit aid of the programmer indicating a function template used for the code generation).

So it shouldn't be of great surprise to find that your hand-rolled loops can perform faster -- they put a great deal of less strain on the optimizer. I have seen some people cite that these higher-order functions should be able to go faster as a result of the vendor being able to do things like parallelize the loop, but to effectively parallelize the loop would first require the kind of information that would typically allow the optimizer to inline the nested function calls within to a point where they become as cheap as the hand-rolled loops. Otherwise the function/closure implementation you pass in is going to be effectively opaque to functions like map/reduce: they can only call it and pay the overhead of doing so, and cannot parallelize it since they cannot assume anything about the nature of the side effects and thread-safety in doing so.

Of course this is all conceptual -- Swift may be able to optimize these cases in the future, or it may already be able to do so now (see -Ofast as a commonly-cited way to make Swift go faster at the cost of some safety). But it does place a heavier strain on the optimizer, at the very least, to use these kinds of functions over the hand-rolled loops, and the time differences you're seeing in the first benchmark seem to reflect the kind of differences one might expect with this additional calling overhead. Best way to find out is to look at the assembly and try various optimization flags.

Standard Functions

That's not to discourage the use of such functions. They do more concisely express intent, they can boost productivity. And relying on them could allow your codebase to get faster in future versions of Swift without any involvement on your part. But they aren't necessarily always going to be faster -- it is a good general rule to think that a higher-level library function that more directly expresses what you want to do is going to be faster, but there are always exceptions to the rule (but best discovered in hindsight with a profiler in hand since it's far better to err on the side of trust than distrust here).

Artificial Benchmarks

As for your second benchmark, it is almost certainly a result of the compiler optimizing away code that has no side effects that affect user output. Artificial benchmarks have a tendency to be notoriously misleading as a result of what optimizers do to eliminate irrelevant side effects (side effects that don't affect user output, essentially). So you have to be careful there when constructing benchmarks with times that seem too good to be true that they aren't the result of the optimizer merely skipping all the work you actually wanted to benchmark. At the very least, you want your tests to output some final result gathered from the computation.

Process Array in parallel using GCD

This is a slightly different take on the approach in @Eduardo's answer, using the Array type's withUnsafeMutableBufferPointer<R>(body: (inout UnsafeMutableBufferPointer<T>) -> R) -> R method. That method's documentation states:

Call body(p), where p is a pointer to the Array's mutable contiguous storage. If no such storage exists, it is first created.
Often, the optimizer can eliminate bounds- and uniqueness-checks within an array algorithm, but when that fails, invoking the same algorithm on body's argument lets you trade safety for speed.

That second paragraph seems to be exactly what's happening here, so using this method might be more "idiomatic" in Swift, whatever that means:

func calcSummary() {
    let group = dispatch_group_create()
    let queue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0)
    
    self.summary.withUnsafeMutableBufferPointer {
        summaryMem -> Void in
        for i in 0 ..< 10 {
            dispatch_group_async(group, queue, {
                let base = i * 50000
                for x in base ..< base + 50000 {
                    summaryMem[i] += self.array[x]
                }
            })
        }
    }

    dispatch_group_notify(group, queue, {
        println(self.summary)
    })
}

Does Map method of Swift Array implemented concurrently?

The map implementation of Array specifically doesn't perform any multithreading, but there's nothing that says you couldn't make a concurrent implementation. Here's one that's generalized to work with any Sequence:

import Dispatch

class SharedSynchronizedArray<T> {
    var array = [T]()
    let operationQueue = DispatchQueue(label: "SharedSynchronizedArray")

    func append(_ newElement: T) {
        operationQueue.sync {
            array.append(newElement)
        }
    }
}

public struct ConcurrentSequence<Base, Element>: Sequence
    where Base: Sequence, Element == Base.Iterator.Element {
    let base: Base

    public func makeIterator() -> Base.Iterator {
        return base.makeIterator()
    }

    public func map<T>(_ transform: @escaping (Element) -> T) -> [T] {
        let group = DispatchGroup()

        let resultsStorageQueue = DispatchQueue(label: "resultStorageQueue")
        let results = SharedSynchronizedArray<T>()

        let processingQueue = DispatchQueue(
            label: "processingQueue",
            attributes: [.concurrent]
        )

        for element in self {
            group.enter()
            print("Entered DispatchGroup for \(element)")

            var result: T?
            let workItem = DispatchWorkItem{ result = transform(element) }

            processingQueue.async(execute: workItem)

            resultsStorageQueue.async {
                workItem.wait()
                guard let unwrappedResult = result else {
                    fatalError("The work item was completed, but the result wasn't set!")
                }
                results.append(unwrappedResult) 

                group.leave()
                print("Exited DispatchGroup for \(element)")
            }
        }
        print("Started Waiting on DispatchGroup")
        group.wait()
        print("DispatchGroup done")

        return results.array
    }

}

public extension Sequence {
    public var parallel: ConcurrentSequence<Self, Iterator.Element> { 
        return ConcurrentSequence(base: self)
    }
}

print("Start")

import Foundation

let input = Array(0..<100)
let output: [Int] = input.parallel.map {
    let randomDuration = TimeInterval(Float(arc4random()) / Float(UInt32.max))
    Thread.sleep(forTimeInterval: randomDuration)
    print("Transforming \($0)")
    return $0 * 2
}

print(output)
// print(output.parallel.filter{ $0 % 3 == 0 })

print("Done")

Using Grand Central Dispatch in Swift to parallelize and speed up “for loops?

A "multi-threaded iteration" can be done with dispatch_apply():

let outerCount = 100    // # of concurrent block iterations
let innerCount = 10000  // # of iterations within each block

let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
    for innerIdx in 1 ... innerCount {
       // ...
    }
}

(You have to figure out the best relation between outer and inner counts.)

There are two things to notice:

arc4random() uses an internal mutex, which makes it extremely slow when called
from several threads in parallel, see Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version. From the answers given there,
rand_r() (with separate seeds for each thread) seems to be faster alternative.
The result variable winner must not be modified from multiple threads simultaneously.
You can use an array instead where each thread updates its own element, and the results
are added afterwards. A thread-safe method has been described in https://stackoverflow.com/a/26790019/1187415.

Then it would roughly look like this:

let outerCount = 100     // # of concurrent block iterations
let innerCount = 10000   // # of iterations within each block

let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

var winners = [Int](count: outerCount, repeatedValue: 0)
winners.withUnsafeMutableBufferPointer { winnersPtr -> Void in

    dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
        var seed = arc4random() // seed for rand_r() in this "thread"

        for innerIdx in 1 ... innerCount {
            var points = 0
            var ability = 500

            for i in 1 ... 1000 {
                let chance = Int(rand_r(&seed) % 1001)
                if chance < (ability-points) { ++points }
                else {points = points - 1}
            }
            if points > 0 {
                winnersPtr[Int(outerIdx)] += 1
            }
        }
    }
}

// Add results:
let winner = reduce(winners, 0, +)
println(winner)

Swift Async let with loop

You can use a task group. See Tasks and Task Groups section of the The Swift Programming Language: Concurrency (which would appear to be where you got your example).

One can use withTaskGroup(of:returning:body:) to create a task group to run tasks in parallel, but then collate all the results together at the end.

E.g. here is an example that creates child tasks that return a tuple of “name” and ”image”, and the group returns a combined dictionary of those name strings with their associated image values:

func downloadImages(names: [String]) async -> [String: UIImage] {
    await withTaskGroup(
        of: (String, UIImage).self,
        returning: [String: UIImage].self
    ) { [self] group in
        for name in names {
            group.addTask { await (name, downloadPhoto(named: name)) }
        }

        var images: [String: UIImage] = [:]

        for await result in group {
            images[result.0] = result.1
        }

        return images
    }
}

Or, more concisely:

func downloadImages(names: [String]) async -> [String: UIImage] {
    await withTaskGroup(of: (String, UIImage).self) { [self] group in
        for name in names {
            group.addTask { await (name, downloadPhoto(named: name)) }
        }

        return await group.reduce(into: [:]) { $0[$1.0] = $1.1 }
    }
}

They run in parallel:

Sample Image

But you can extract them from the dictionary of results:

let stooges = ["moe", "larry", "curly"]
let images = await downloadImages(names: stooges)

imageView1.image = images["moe"]
imageView2.image = images["larry"]
imageView3.image = images["curly"]

Or if you want an array sorted in the original order, just build an array from the dictionary:

func downloadImages(names: [String]) async -> [UIImage] {
    await withTaskGroup(of: (String, UIImage).self) { [self] group in
        for name in names {
            group.addTask { await (name, downloadPhoto(named: name)) }
        }

        let dictionary = await group.reduce(into: [:]) { $0[$1.0] = $1.1 }
        return names.compactMap { dictionary[$0] }
    }
}