Metal Kernels Not Behaving Properly on the New MACbook Pro (Late 2016) Gpus

Metal kernels not behaving properly on the new MacBook Pro (late 2016) GPUs

I think that whenever the GPU writes to a MTLStorageModeManaged resource such as a texture and you then want to read that resource from the CPU (e.g. using getBytes()), you need to synchronize it using a blit encoder. Try putting the following above the commandBuffer.commit() line:

let blitEncoder = commandBuffer.makeBlitCommandEncoder()
blitEncoder.synchronize(outTexture)
blitEncoder.endEncoding()

You may get away without this on an integrated GPU because the GPU is using system memory for the resource and there's nothing to synchronize.

Metal kernel shader not working

First, with MTLTextureUsage.renderTarget I get the error "validateComputeFunctionArguments:825: failed assertion `Function writes texture (outTexture[0]) whose usage (0x04) doesn't specify MTLTextureUsageShaderWrite (0x02)'" so it should probably be MTLTextureUsage.shaderWrite.

For some reason if I force Intel GPU with gfxSwitch, the readback from texture returns correct data, but with Radeon it's always zero regardlessly of "textureDesc.resourceOptions = MTLResourceOptions.storageModeXXX" flags.

What has worked for me both with Intel and Radeon 460 was creating a MTLBuffer and using it instead of the texture. You would have to calculate the index, though. Should not be a big deal to switch to buffers if you're not using mip mapping or sampling with float indexes, right?.

let texBuffer = device?.makeBuffer(length:4 * width * height, options: MTLResourceOptions.storageModeShared)

var result = [Float](repeating:0, count: width * height * 4)
let data = NSData(bytesNoCopy: texBuffer!.contents(), length: 4 * width * height, freeWhenDone: false)
data.getBytes(&result, length: 4 * width * height)

I would assume creating a texture backed by MTLBuffer would work but the api is only in OSX 10.13.

EDIT: As pointed out by Ken Thomases, there is a similar discussion at Metal kernels not behaving properly on the new MacBook Pro (late 2016) GPUs

I have made a sample app using the approach and shader from the first post of this thread and the fix for the linked thread worked for me. Here is the link for the app code in case anyone wants a reproducible example.
https://gist.github.com/astarasikov/9e4f58e540a6ff066806d37eb5b2af29

Metal on macOS: Visibility testing behaving incorrectly

Per the documentation, this is working as expected:

In MTLVisibilityResultModeBoolean mode, when a sample passes, the
device writes a nonzero value to the buffer. If no samples pass, the
device writes zero.

Is it possible to run Metal code on two or more GPUs at the same time?

Yes.

If, for example, you're on a Mac with a discrete GPU and an integrated GPU, there will be multiple elements in the array returned by a call to MTLCopyAllDevices(). Same if you have one or more external GPUs connected to your Mac.

In order to run the same compute kernel on each GPU, you'll need to create separate resources and pipeline state objects, since these objects are all affiliated with a single MTLDevice. Everything else about encoding and enqueueing work remains the same.

Except in limited cases (i.e., when GPUs occupy the same peer group), you can't copy resources directly between GPUs. You can, however, use a MTLBlitCommandEncoder to copy shared or managed resources via the system bus.

If there are dependencies among the compute commands across devices, you may need to use events to explicitly synchronize them.

Metal Kernels Not Behaving Properly on the New MACbook Pro (Late 2016) Gpus