Counting Coloured Pixels on the Gpu - Theory

Counting coloured pixels on the GPU - Theory

Indeed, General Purpose GPUs (such as those in Apple devices from the A8 on, for example) are not only capable but also intended to be able to solve such parallel data processing problems.

Apple introduced Data-parallel-processing using Metal in their platforms, and with some simple code you can solve problems like yours using the GPU. Even if this can also be done using other frameworks, I am including some code for the Metal+Swift case as proof of concept.

The following runs as a Swift command line tool on OS X Sierra, and was built using Xcode 9 (yup, I know it's beta). You can get the full project from my github repo.

As main.swift:

import Foundation
import Metal
import CoreGraphics
import AppKit

guard FileManager.default.fileExists(atPath: "./testImage.png") else {
    print("./testImage.png does not exist")
    exit(1)
}

let url = URL(fileURLWithPath: "./testImage.png")
let imageData = try Data(contentsOf: url)

guard let image = NSImage(data: imageData),
    let imageRef = image.cgImage(forProposedRect: nil, context: nil, hints: nil) else {
    print("Failed to load image data")
    exit(1)
}

let bytesPerPixel = 4
let bytesPerRow = bytesPerPixel * imageRef.width

var rawData = [UInt8](repeating: 0, count: Int(bytesPerRow * imageRef.height))

let bitmapInfo = CGBitmapInfo(rawValue: CGImageAlphaInfo.premultipliedFirst.rawValue).union(.byteOrder32Big)
let colorSpace = CGColorSpaceCreateDeviceRGB()

let context = CGContext(data: &rawData,
                        width: imageRef.width,
                        height: imageRef.height,
                        bitsPerComponent: 8,
                        bytesPerRow: bytesPerRow,
                        space: colorSpace,
                        bitmapInfo: bitmapInfo.rawValue)

let fullRect = CGRect(x: 0, y: 0, width: CGFloat(imageRef.width), height: CGFloat(imageRef.height))
context?.draw(imageRef, in: fullRect, byTiling: false)

// Get access to iPhone or iPad GPU
guard let device = MTLCreateSystemDefaultDevice() else {
    exit(1)
}

let textureDescriptor = MTLTextureDescriptor.texture2DDescriptor(
    pixelFormat: .rgba8Unorm,
    width: Int(imageRef.width),
    height: Int(imageRef.height),
    mipmapped: true)

let texture = device.makeTexture(descriptor: textureDescriptor)

let region = MTLRegionMake2D(0, 0, Int(imageRef.width), Int(imageRef.height))
texture.replace(region: region, mipmapLevel: 0, withBytes: &rawData, bytesPerRow: Int(bytesPerRow))

// Queue to handle an ordered list of command buffers
let commandQueue = device.makeCommandQueue()

// Buffer for storing encoded commands that are sent to GPU
let commandBuffer = commandQueue.makeCommandBuffer()

// Access to Metal functions that are stored in Shaders.metal file, e.g. sigmoid()
guard let defaultLibrary = device.makeDefaultLibrary() else {
    print("Failed to create default metal shader library")
    exit(1)
}

// Encoder for GPU commands
let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()

// hardcoded to 16 for now (recommendation: read about threadExecutionWidth)
var threadsPerGroup = MTLSize(width:16, height:16, depth:1)
var numThreadgroups = MTLSizeMake(texture.width / threadsPerGroup.width,
                                  texture.height / threadsPerGroup.height,
                                  1);

// b. set up a compute pipeline with Sigmoid function and add it to encoder
let countBlackProgram = defaultLibrary.makeFunction(name: "countBlack")
let computePipelineState = try device.makeComputePipelineState(function: countBlackProgram!)
computeCommandEncoder.setComputePipelineState(computePipelineState)

// set the input texture for the countBlack() function, e.g. inArray
// atIndex: 0 here corresponds to texture(0) in the countBlack() function
computeCommandEncoder.setTexture(texture, index: 0)

// create the output vector for the countBlack() function, e.g. counter
// atIndex: 1 here corresponds to buffer(0) in the Sigmoid function
var counterBuffer = device.makeBuffer(length: MemoryLayout<UInt32>.size,
                                        options: .storageModeShared)
computeCommandEncoder.setBuffer(counterBuffer, offset: 0, index: 0)

computeCommandEncoder.dispatchThreadgroups(numThreadgroups, threadsPerThreadgroup: threadsPerGroup)

computeCommandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// a. Get GPU data
// outVectorBuffer.contents() returns UnsafeMutablePointer roughly equivalent to char* in C
var data = NSData(bytesNoCopy: counterBuffer.contents(),
                  length: MemoryLayout<UInt32>.size,
                  freeWhenDone: false)
// b. prepare Swift array large enough to receive data from GPU
var finalResultArray = [UInt32](repeating: 0, count: 1)

// c. get data from GPU into Swift array
data.getBytes(&finalResultArray, length: MemoryLayout<UInt>.size)

print("Found \(finalResultArray[0]) non-white pixels")

// d. YOU'RE ALL SET!

Also, in Shaders.metal:

#include <metal_stdlib>
using namespace metal;

kernel void
countBlack(texture2d<float, access::read> inArray [[texture(0)]],
           volatile device uint *counter [[buffer(0)]],
           uint2 gid [[thread_position_in_grid]]) {

    // Atomic as we need to sync between threadgroups
    device atomic_uint *atomicBuffer = (device atomic_uint *)counter;
    float3 inColor = inArray.read(gid).rgb;
    if(inColor.r != 1.0 || inColor.g != 1.0 || inColor.b != 1.0) {
        atomic_fetch_add_explicit(atomicBuffer, 1, memory_order_relaxed);
    }
}

I used the question to learn a bit about Metal and data-parallel computing, so most of the code was used as boilerplate from articles online and edited. Please take the time to visit the sources mentioned below for some more examples. Also, the code is pretty much hardcoded for this particular problem, but you shouldn't have a lot of trouble adapting it.

Sources:

http://flexmonkey.blogspot.com.ar/2016/05/histogram-equalisation-with-metal.html

http://metalbyexample.com/introduction-to-compute/

http://memkite.com/blog/2014/12/15/data-parallel-programming-with-metal-and-swift-for-iphoneipad-gpu/

Efficiently count how many transparent pixels are in UIImage/CIImage with Metal

What you want to perform is a reduction operation, which is not necessarily well-suited for the GPU due to its massively parallel nature. I'd recommend not writing a reduction operation for the GPU yourself, but rather use some highly optimized built-in APIs that Apple provides (like CIAreaAverage or the corresponding Metal Performance Shaders).

The most efficient way depends a bit on your use case, specifically where the image comes from (loaded via UIImage/CGImage or the result of a Core Image pipeline?) and where you'd need the resulting count (on the CPU/Swift side or as an input for another Core Image filter?).

It also depends on if the pixels could also be semi-transparent (alpha not 0.0 or 1.0).

If the image is on the GPU and/or the count should be used on the GPU, I'd recommend using CIAreaAverage. The alpha value of the result should reflect the percentage of transparent pixels. Note that this only works if there are now semi-transparent pixels.

The next best solution is probably just iterating the pixel data on the CPU. It might be a few million pixels, but the operation itself is very fast so this should take almost no time. You could even use multi-threading by splitting the image up in chunks and use concurrentPerform(...) of DispatchQueue.

A last, but probably overkill solution would be to use Accelerate (this would make @FlexMonkey happy): Load the image's pixel data into a vDSP buffer and use the sum or average methods to calculate the percentage using the CPU's vector units.

Clarification

When I was saying that a reduction operation is "not necessarily well-suited for the GPU", I meant to say that it's rather complicated to implement in an efficient way and by far not as straightforward as a sequential algorithm.

The check whether a pixel is transparent or not can be done in parallel, sure, but the results need to be gathered into a single value, which requires multiple GPU cores reading and writing values into the same memory. This usually requires some synchronization (and thereby hinders parallel execution) and incurs latency cost due to access to the shared or global memory space. That's why efficient gather algorithms for the GPU usually follow a multi-step tree-based approach. I can highly recommend reading NVIDIA's publications on the topic (e.g. here and here). That's also why I recommended using built-in APIs when possible since Apple's Metal team knows how to best optimize these algorithms for their hardware.

There is also an example reduction implementation in Apple's Metal Shading Language Specification (pp. 158) that uses simd_shuffle intrinsics for efficiently communicating intermediate values down the tree. The general principle is the same as described by NVIDIA's publications linked above, though.

What is the index of the pixel in RGBA theory

The formula

index = y * width + x

Example

Say you have an image that is 128 pixels wide, by 128 pixels high, and it is represented by a one-dimensional array of 32-Bit integers.

2831 = 128 * 22 + 15

Fast color quantization in OpenCV

There are many ways to quantize colors. Here I describe four.

Uniform quantization

Here we are using a color map with uniformly distributed colors, whether they exist in the image or not. In MATLAB-speak you would write

qimg = round(img*(N/255))*(255/N);

to quantize each channel into N levels (assuming the input is in the range [0,255]. You can also use floor, which is more suitable in some cases. This leads to N^3 different colors. For example with N=8 you get 512 unique RGB colors.

K-means clustering

This is the "classical" method to generate an adaptive palette. Obviously it is going to be the most expensive. The OP is applying k-means on the collection of all pixels. Instead, k-means can be applied to the color histogram. The process is identical, but instead of 10 million data points (a typical image nowadays), you have only maybe 32^3 = 33 thousand. The quantization caused by the histogram with reduced number of bins has little effect here when dealing with natural photographs. If you are quantizing a graph, which has a limited set of colors, you don't need to do k-means clustering.

You do a single pass through all pixels to create the histogram. Next, you run the regular k-means clustering, but using the histogram bins. Each data point has a weight now also (the number of pixels within that bin), that you need to take into account. The step in the algorithm that determines the cluster centers is affected. You need to compute the weighted mean of the data points, instead of the regular mean.

The result is affected by the initialization.

Octree quantization

An octree is a data structure for spatial indexing, where the volume is recursively divided into 8 sub-volumes by cutting each axis in half. The tree thus is formed of nodes with 8 children each. For color quantization, the RGB cube is represented by an octree, and the number of pixels per node is counted (this is equivalent to building a color histogram, and constructing an octree on top of that). Next, leaf nodes are removed until the desired number of them is left. Removing leaf nodes happens 8 at a time, such that a node one level up becomes a leaf. There are different strategies to pick which nodes to prune, but they typically revolve around pruning nodes with low pixel counts.

This is the method that Gimp uses.

Because the octree always splits nodes down the middle, it is not as flexible as k-means clustering or the next method.

Minimum variance quantization

MATLAB's rgb2ind, which the OP mentions, does uniform quantization and something they call "minimum variance quantization":

Minimum variance quantization cuts the RGB color cube into smaller boxes (not necessarily cubes) of different sizes, depending on how the colors are distributed in the image.

I'm not sure what this means. This page doesn't give away anything more, but it has a figure that looks like a k-d tree partitioning of the RGB cube. K-d trees are spatial indexing structures that divide spatial data in half recursively. At each level, you pick the dimension where there is most separation, and split along that dimension, leading to one additional leaf node. In contrast to octrees, the splitting can happen at an optimal location, it is not down the middle of the node.

The advantage of using a spatial indexing structure (either k-d trees or octrees) is that the color lookup is really fast. You start at the root, and make a binary decision based on either R, G or B value, until you reach a leaf node. There is no need to compute distances to each prototype cluster, as is the case of k-means.

[Edit two weeks later] I have been thinking about a possible implementation, and came up with one. This is the algorithm:

The full color histogram is considered a partition. This will be the root for a k-d tree, which right now is also the leaf node because there are yet no other nodes.
A priority queue is created. It contains all the leaf nodes of the k-d tree. The priority is given by the variance of the partition along one axis, minus the variances of the two halves if we were to split the partition along that axis. The split location is picked such that the variances of the two halves are minimal (using Otsu's algorithm). That is, the larger the priority, the more total variance we reduce by making the split. For each leaf node, we compute this value for each axis, and use the largest result.
We process partitions on the queue until we have the desired number of partitions:
- We split the partition with highest priority along the axis and at the location computed when determining the priority.
- We compute the priority for each of the two halves, and put them on the queue.

This is a relatively simple algorithm when described this way, the code is somewhat more complex, because I tried to make it efficient but generic.

Comparison

On a 256x256x256 RGB histogram I got these timings comparing k-means clustering and this new algorithm:

# clusters	kmeans (s)	minvar (s)
5	3.98	0.34
20	17.9	0.48
50	220.8	0.59

How to render individual pixels for one layer of a 3DTexture in a framebuffer?

I’m not sure what you’re trying to do and what you think the positions will do.

You have 2 options for GPU simulation in WebGL2

use transform feedback.
In this case you pass in attributes and generate data in buffers. Effectively you have in attributes and out attributes and generally you only run the vertex shader. To put it another way your varyings, the output of your vertex shader, get written to a buffer. So you have at least 2 sets of buffers, currentState, and nextState and your vertex shader reads attributes from currentState and writes them to nextState
There is an example of writing to buffers via transform feedback here though that example only uses transform feedback at the start to fill buffers once.
use textures attached to framebuffers
in this case, similarly you have 2 textures, currentState, and nextState, You set nextState to be your render target and read from currentState to generate next state.
the difficulty is that you can only render to textures by outputting primitives in the vertex shader. If currentState and nextState are 2D textures that’s trival. Just output a -1.0 to +1.0 quad from the vertex shader and all pixels in nextState will be rendered to.
If you’re using a 3D texture then same thing except you can only render to 4 layers at a time (well, gl.getParameter(gl.MAX_DRAW_BUFFERS)). so you’d have to do something like
```
for(let layer = 0; layer < numLayers; layer += 4) {
   // setup framebuffer to use these 4 layers
   gl.drawXXX(...) // draw to 4 layers)
}
```
or better
```
// at init time
const fbs = [];
for(let layer = 0; layer < numLayers; layer += 4) {
   fbs.push(createFramebufferForThese4Layers(layer);
}

// at draw time
fbs.forEach((fb, ndx) => {;
   gl.bindFramebuffer(gl.FRAMEBUFFER, fb);
   gl.drawXXX(...) // draw to 4 layers)
});
```
I’m guessing multiple draw calls is slower than one draw call so another solution is to instead treat a 2D texture as a 3D array and calculate texture coordinates appropriately.

I don’t know which is better. If you’re simulating particles and they only need to look at their own currentState then transform feedback is easier. If need each particle to be able to look at the state of other particles, in other words you need random access to all the data, then your only option is to store the data in textures.

As for positions I don't understand your code. Positions define a primitives, either POINTS, LINES, or TRIANGLES so how does passing integer X, Y values into our vertex shader help you define POINTS, LINES or TRIANGLES?

It looks like you're trying to use POINTS in which case you need to set gl_PointSize to the size of the point you want to draw (1.0) and you need to convert those positions into clip space

gl_Position = vec4((position.xy + 0.5) / resolution, 0, 1);

where resolution is the size of the texture.

But doing it this way will be slow. Much better to just draw a full size (-1 to +1) clip space quad. For every pixel in the destination the fragment shader will be called. gl_FragCoord.xy will be the location of the center of the pixel currently being rendered so first pixel in bottom left corner gl_FragCoord.xy will be (0.5, 0.5). The pixel to the right of that will be (1.5, 0.5). The pixel to the right of that will be (2.5, 0.5). You can use that value to calculate how to access currentState. Assuming 1x1 mapping the easiest way would be

int n = numberOfLayerThatsAttachedToCOLOR_ATTACHMENT0;
vec4 currentStateValueForLayerN = texelFetch(
    currentStateTexture, ivec3(gl_FragCoord.xy, n + 0), 0);
vec4 currentStateValueForLayerNPlus1 = texelFetch(
    currentStateTexture, ivec3(gl_FragCoord.xy, n + 1), 0);
vec4 currentStateValueForLayerNPlus2 = texelFetch(
    currentStateTexture, ivec3(gl_FragCoord.xy, n + 2), 0);
...

vec4 nextStateForLayerN = computeNextStateFromCurrentState(currentStateValueForLayerN);
vec4 nextStateForLayerNPlus1 = computeNextStateFromCurrentState(currentStateValueForLayerNPlus1);
vec4 nextStateForLayerNPlus2 = computeNextStateFromCurrentState(currentStateValueForLayerNPlus2);
...

outColor[0] = nextStateForLayerN;
outColor[1] = nextStateForLayerNPlus1;
outColor[2] = nextStateForLayerNPlus1;
...

I don’t know if you needed this but just to test here’s a simple example that renders a different color to every pixel of a 4x4x4 texture and then displays them.

const pointVS = `#version 300 es
uniform int size;uniform highp sampler3D tex;out vec4 v_color;
void main() {  int x = gl_VertexID % size;  int y = (gl_VertexID / size) % size;  int z = gl_VertexID / (size * size);    v_color = texelFetch(tex, ivec3(x, y, z), 0);    gl_PointSize = 8.0;    vec3 normPos = vec3(x, y, z) / float(size);   gl_Position = vec4(     mix(-0.9, 0.6, normPos.x) + mix(0.0,  0.3, normPos.y),     mix(-0.6, 0.9, normPos.z) + mix(0.0, -0.3, normPos.y),     0,     1);}`;
const pointFS = `#version 300 esprecision highp float;
in vec4 v_color;out vec4 outColor;
void main() {  outColor = v_color;}`;
const rtVS = `#version 300 esin vec4 position;void main() {  gl_Position = position;}`;
const rtFS = `#version 300 esprecision highp float;
uniform vec2 resolution;out vec4 outColor[4];
void main() {  vec2 xy = gl_FragCoord.xy / resolution;  outColor[0] = vec4(1, 0, xy.x, 1);  outColor[1] = vec4(0.5, xy.yx, 1);  outColor[2] = vec4(xy, 0, 1);  outColor[3] = vec4(1, vec2(1) - xy, 1);}`;
function main() {  const gl = document.querySelector('canvas').getContext('webgl2');  if (!gl) {    return alert('need webgl2');  }    const pointProgramInfo = twgl.createProgramInfo(gl, [pointVS, pointFS]);  const rtProgramInfo = twgl.createProgramInfo(gl, [rtVS, rtFS]);    const size = 4;  const numPoints = size * size * size;  const tex = twgl.createTexture(gl, {    target: gl.TEXTURE_3D,    width: size,    height: size,    depth: size,  });    const clipspaceFullSizeQuadBufferInfo = twgl.createBufferInfoFromArrays(gl, {    position: {      data: [        -1, -1,         1, -1,        -1,  1,                -1,  1,         1, -1,         1,  1,      ],      numComponents: 2,    },  });    const fb = gl.createFramebuffer();  gl.bindFramebuffer(gl.FRAMEBUFFER, fb);  for (let i = 0; i < 4; ++i) {    gl.framebufferTextureLayer(        gl.FRAMEBUFFER,        gl.COLOR_ATTACHMENT0 + i,        tex,        0, // mip level        i, // layer    );  }    gl.drawBuffers([     gl.COLOR_ATTACHMENT0,     gl.COLOR_ATTACHMENT1,     gl.COLOR_ATTACHMENT2,     gl.COLOR_ATTACHMENT3,  ]);
  gl.viewport(0, 0, size, size);  gl.useProgram(rtProgramInfo.program);  twgl.setBuffersAndAttributes(      gl,      rtProgramInfo,      clipspaceFullSizeQuadBufferInfo);  twgl.setUniforms(rtProgramInfo, {    resolution: [size, size],  });  twgl.drawBufferInfo(gl, clipspaceFullSizeQuadBufferInfo);    gl.bindFramebuffer(gl.FRAMEBUFFER, null);  gl.viewport(0, 0, gl.canvas.width, gl.canvas.height);  gl.drawBuffers([     gl.BACK,  ]);    gl.useProgram(pointProgramInfo.program);  twgl.setUniforms(pointProgramInfo, {    tex,    size,  });  gl.drawArrays(gl.POINTS, 0, numPoints);}main();

<canvas></canvas><script src="https://twgljs.org/dist/4.x/twgl-full.min.js"></script>

Several arithmetic operations parallelized in C++Amp

You're no the right track but doing in place manipulations of arrays on a GPU is tricky as you cannot guarantee the order in which different elements are updated.

Here's an example of something very similar. The ApplyColorSimplifierTiledHelper method contains an AMP restricted parallel_for_each that calls SimplifyIndexTiled for each index in the 2D array. SimplifyIndexTiled calculates a new value for each pixel in destFrame based on the value of the pixels surrounding the corresponding pixel in srcFrame. This solves the race condition issue present in your code.

This code comes from the Codeplex site for the C++ AMP book. The Cartoonizer case study includes several examples of these sorts of image processing problems implemented in C++ AMP using; arrays, textures, tiled/untiled and multi-GPU. The C++ AMP book discusses the implementation in some detail.

void ApplyColorSimplifierTiledHelper(const array<ArgbPackedPixel, 2>& srcFrame,
    array<ArgbPackedPixel, 2>& destFrame, UINT neighborWindow)
{
    const float_3 W(ImageUtils::W);

    assert(neighborWindow <= FrameProcessorAmp::MaxNeighborWindow);

    tiled_extent<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize>     
        computeDomain = GetTiledExtent(srcFrame.extent);
    parallel_for_each(computeDomain, [=, &srcFrame, &destFrame]
        (tiled_index<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize> idx) 
        restrict(amp)
    {
        SimplifyIndexTiled(srcFrame, destFrame, idx, neighborWindow, W);
    });
}

void SimplifyIndex(const array<ArgbPackedPixel, 2>& srcFrame, array<ArgbPackedPixel,
                   2>& destFrame, index<2> idx, 
                   UINT neighborWindow, const float_3& W) restrict(amp)
{
    const int shift = neighborWindow / 2;
    float sum = 0;
    float_3 partialSum;
    const float standardDeviation = 0.025f;
    const float k = -0.5f / (standardDeviation * standardDeviation);

    const int idxY = idx[0] + shift;         // Corrected index for border offset.
    const int idxX = idx[1] + shift;
    const int y_start = idxY - shift;
    const int y_end = idxY + shift;
    const int x_start = idxX - shift;
    const int x_end = idxX + shift;

    RgbPixel orgClr = UnpackPixel(srcFrame(idxY, idxX));

    for (int y = y_start; y <= y_end; ++y)
        for (int x = x_start; x <= x_end; ++x)
        {
            if (x != idxX || y != idxY) // don't apply filter to the requested index, only to the neighbors
            {
                RgbPixel clr = UnpackPixel(srcFrame(y, x));
                float distance = ImageUtils::GetDistance(orgClr, clr, W);
                float value = concurrency::fast_math::pow(float(M_E), k * distance * distance);
                sum += value;
                partialSum.r += clr.r * value;
                partialSum.g += clr.g * value;
                partialSum.b += clr.b * value;
            }
        }

    RgbPixel newClr;
    newClr.r = static_cast<UINT>(clamp(partialSum.r / sum, 0.0f, 255.0f));
    newClr.g = static_cast<UINT>(clamp(partialSum.g / sum, 0.0f, 255.0f));
    newClr.b = static_cast<UINT>(clamp(partialSum.b / sum, 0.0f, 255.0f));
    destFrame(idxY, idxX) = PackPixel(newClr);
}

The code uses ArgbPackedPixel, which is simply a mechanism for packing 8-bit RGB values into an unsigned long as C++ AMP does not support char. If your problem is small enough to fit into a texture then you may want to look at using this instead of an array as the pack/unpack is implemented in hardware on the GPU so is effectively "free", here you have to pay for it with additional compute. There is also an example of this implementation on CodePlex.

typedef unsigned long ArgbPackedPixel;

struct RgbPixel 
{
    unsigned int r;
    unsigned int g;
    unsigned int b;
};

const int fixedAlpha = 0xFF;

inline ArgbPackedPixel PackPixel(const RgbPixel& rgb) restrict(amp) 
{
    return (rgb.b | (rgb.g << 8) | (rgb.r << 16) | (fixedAlpha << 24));
}

inline RgbPixel UnpackPixel(const ArgbPackedPixel& packedArgb) restrict(amp) 
{
    RgbPixel rgb;
    rgb.b = packedArgb & 0xFF;
    rgb.g = (packedArgb & 0xFF00) >> 8;
    rgb.r = (packedArgb & 0xFF0000) >> 16;
    return rgb;
}

Efficiently analyze dominant color in UIImage

Downscaling an image requires looking at each pixel so you can pick a new pixel that is closest to the average color of some group of neighbors. The reason this appears to happen so fast compared to your implementation of iterating through all the pixels is that CoreGraphics hands the scaling task off to the GPU hardware, whereas your approach uses the CPU to iterate through each pixel which is much slower.

So the thing you need to do is write some GPU-based code to scan through your original image and look at each pixel, tallying up the color counts as you go. This has the advantage not only of being very fast, but you'll also get an accurate count of colors. Downsampling produces as I mentioned pixels that are color averages, so you won't end up with reliably correct color counts that correlate to your original image (unless you happen to be downscaling solid colors, but in the typical case you'll end up with something other than you started with).

I recommend looking into Apple's Metal framework for an API that lets you write code directly for the GPU. It'll be a challenge to learn, but I think you'll find it interesting and when you're done your code will scan original images extremely fast without having to go through any extra downsampling effort.

Counting Coloured Pixels on the Gpu - Theory