Utilizing the Gpu with C#

Run C# code on GPU

1) No - not for the general case of C# - obviously anything can be created for some subset of the language

2) Yes - HLSL using Direct X or Open GL

3) Not generally possible - CPU and GPU coding are fundamentally different

Basically you can't think of CPU and GPU coding as being comparable. A GPU is a highly specialised parallel processing tool - for lots of parallel simple calculations.

Trying to write a general progam in a GPU with lots of branches etc just won't be efficient - maybe not even possible.

Their memory access architectures are totally different.

You should write for the CPU but farm out appropriate parallel computations to the GPU.

Process a heavy function by GPU in C#

Try using one of the NET CIL/GPU compilers: Alea GPU (http://www.aleagpu.com) or Altimesh (http://www.altimesh.com/). They work and but have some limitations. I have been developing Campy (http://campynet.com/), a free and open source compiler that will work with most NET frameworks, but it's not yet ready. You will need to parallelize your code in any case. You could use Cudafy (https://archive.codeplex.com/?p=cudafy), but I would not recommend it since it has not been updated since April 2015, and does not support CUDA 8 or 9.

Does C# natively use GPU for graphics?

Graphics.DrawLine is a GDI+ call. If you're using Windows Forms and doing your drawing with the System.Drawing classes, you're using GDI+, which is not hardware-accelerated. To get hardware acceleration, you need to use WPF in place of WinForms or draw with Direct3D/Direct2D. The latter two (Direct3D/Direct2D) are COM-based, so you'll need a .NET wrapper. Microsoft wrapped Direct3D for .NET with Managed DirectX followed by XNA. Both (I believe) are now deprecated. There are also third-party wrappers for the DirectX libraries that are more up-to-date.

Edit: I just learned from @HansPassant's comment that GDI+ is 2D accelerated. I thought that only applied to GDI (as opposed to GDI+) because GDI+ handles things like antialiasing that (as I understood it) 2D hardware didn't do. But apparently I was wrong.

C#: Perform Operations on GPU, not CPU (Calculate Pi)

It's a very new technology, but you might investigate CUDA. Since your question is tagged with C#, here is a .Net wrapper.

As a bonus, it appears that your 8800 GTX supports CUDA.

C# OpenCL GPU implementation for double array math

Your problem as written does not fit well with something that would work on a GPU. You cannot parallelize (in a way that improves performance) the operation on a single array because the value of the nth element depends on elements 1 to n. However, you can utilize the GPU to process multiple arrays, where each GPU core operates on a separate array.

The full code for the solution is at the end of the answer, but the results of the test, to calculate on 10,000 arrays each of which has 10,000 elements, generates the following (on a GTX1080M and an i7 7700k with 32GB RAM):

Task Generating Data: 1096.4583ms
Task CPU Single Thread: 596.2624ms
Task CPU Parallel: 179.1717ms
GPU CPU->GPU: 89ms
GPU Execute: 86ms
GPU GPU->CPU: 29ms
Task Running GPU: 921.4781ms
Finished

In this test, we measure the speed at which we can generate results into a managed C# array using the CPU with one thread, the CPU with all threads, and finally the GPU using all cores. We validate that the results from each test are identical, using the function AreTheSame.

The fastest time is processing the arrays on the CPU using all threads (Task CPU Parallel: 179ms).

The GPU is actually the slowest (Task Running GPU: 922ms), but this is because of the time taken to reformat the C# arrays in a way that they can be transferred onto the GPU.

If this bottleneck were removed (which is quite possible, depending on your use case), the GPU could potentially be the fastest. If the data were already formatted in a manner that can be immediately be transferred onto the GPU, the total processing time for the GPU would be 204ms (CPU->GPU: 89ms + Execute: 86ms + GPU->CPU: 29 ms = 204ms). This is still slower than the parallel CPU option, but on a different sort of data set, it might be faster.

To get the data back from the GPU (the most important part of actually using the GPU), we use the function ComputeCommandQueue.Read. This transfers the altered array on the GPU back to the CPU.

To run the following code, reference the Cloo Nuget Package (I used 0.9.1). And make sure to compile on x64 (you will need the memory). You may need to update your graphics card driver too if it fails to find an OpenCL device.

class Program
{
    static string CalculateKernel
    {
        get
        {
            return @"
            kernel void Calc(global int* offsets, global int* lengths, global double* doubles, double periodFactor) 
            {
                int id = get_global_id(0);
                int start = offsets[id];
                int length = lengths[id];
                int end = start + length;
                double sum = doubles[start];

                for(int i = start; i < end; i++)
                {
                    sum = sum + periodFactor * ( doubles[i] - sum );
                    doubles[i] = sum;
                }
            }";
        }
    }

    public static double[] Calculate(double[] num, int period)
    {
        var final = new double[num.Length];
        double sum = num[0];
        double coeff = 2.0 / (1.0 + period);

        for (int i = 0; i < num.Length; i++)
        {
            sum += coeff * (num[i] - sum);
            final[i] = sum;
        }

        return final;
    }

    static void Main(string[] args)
    {

        int maxElements = 10000;
        int numArrays = 10000;
        int computeCores = 2048;

        double[][] sets = new double[numArrays][];

        using (Timer("Generating Data"))
        {
            Random elementRand = new Random(1);
            for (int i = 0; i < numArrays; i++)
            {
                sets[i] = GetRandomDoubles(elementRand.Next((int)(maxElements * 0.9), maxElements), randomSeed: i);
            }
        }

        int period = 14;

        double[][] singleResults;
        using (Timer("CPU Single Thread"))
        {
            singleResults = CalculateCPU(sets, period);
        }

        double[][] parallelResults;
        using (Timer("CPU Parallel"))
        {
            parallelResults = CalculateCPUParallel(sets, period);
        }

        if (!AreTheSame(singleResults, parallelResults)) throw new Exception();

        double[][] gpuResults;
        using (Timer("Running GPU"))
        {
            gpuResults = CalculateGPU(computeCores, sets, period);
        }

        if (!AreTheSame(singleResults, gpuResults)) throw new Exception();

        Console.WriteLine("Finished");
        Console.ReadKey();
    }

    public static bool AreTheSame(double[][] a1, double[][] a2)
    {
        if (a1.Length != a2.Length) return false;
        for (int i = 0; i < a1.Length; i++)
        {
            var ar1 = a1[i];
            var ar2 = a2[i];
            if (ar1.Length != ar2.Length) return false;
            for (int j = 0; j < ar1.Length; j++)
                if (Math.Abs(ar1[j] - ar2[j]) > 0.0000001) return false;

        }
        return true;
    }

    public static double[][] CalculateGPU(int partitionSize, double[][] sets, int period)
    {
        ComputeContextPropertyList cpl = new ComputeContextPropertyList(ComputePlatform.Platforms[0]);
        ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, cpl, null, IntPtr.Zero);

        ComputeProgram program = new ComputeProgram(context, new string[] { CalculateKernel });
        program.Build(null, null, null, IntPtr.Zero);

        ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);

        ComputeEventList events = new ComputeEventList();

        ComputeKernel kernel = program.CreateKernel("Calc");

        double[][] results = new double[sets.Length][];

        double periodFactor = 2d / (1d + period);

        Stopwatch sendStopWatch = new Stopwatch();
        Stopwatch executeStopWatch = new Stopwatch();
        Stopwatch recieveStopWatch = new Stopwatch();

        int offset = 0;
        while (true)
        {
            int first = offset;
            int last = Math.Min(offset + partitionSize, sets.Length);
            int length = last - first;

            var merged = Merge(sets, first, length);

            sendStopWatch.Start();

            ComputeBuffer<int> offsetBuffer = new ComputeBuffer<int>(
                context,
                ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
                merged.Offsets);

            ComputeBuffer<int> lengthsBuffer = new ComputeBuffer<int>(
                context,
                ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
                merged.Lengths);

            ComputeBuffer<double> doublesBuffer = new ComputeBuffer<double>(
                context,
                ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
                merged.Doubles);

            kernel.SetMemoryArgument(0, offsetBuffer);
            kernel.SetMemoryArgument(1, lengthsBuffer);
            kernel.SetMemoryArgument(2, doublesBuffer);
            kernel.SetValueArgument(3, periodFactor);

            sendStopWatch.Stop();

            executeStopWatch.Start();

            commands.Execute(kernel, null, new long[] { merged.Lengths.Length }, null, events);

            executeStopWatch.Stop();

            using (var pin = Pinned(merged.Doubles))
            {
                recieveStopWatch.Start();
                commands.Read(doublesBuffer, false, 0, merged.Doubles.Length, pin.Address, events);
                commands.Finish();
                recieveStopWatch.Stop();
            }

            for (int i = 0; i < merged.Lengths.Length; i++)
            {
                int len = merged.Lengths[i];
                int off = merged.Offsets[i];

                var res = new double[len];
                Array.Copy(merged.Doubles,off,res,0,len);

                results[first + i] = res;
            }

            offset += partitionSize;
            if (offset >= sets.Length) break;
        }

        Console.WriteLine("GPU CPU->GPU: " + recieveStopWatch.ElapsedMilliseconds + "ms");
        Console.WriteLine("GPU Execute: " + executeStopWatch.ElapsedMilliseconds + "ms");
        Console.WriteLine("GPU GPU->CPU: " + sendStopWatch.ElapsedMilliseconds + "ms");

        return results;
    }

    public static PinnedHandle Pinned(object obj) => new PinnedHandle(obj);
    public class PinnedHandle : IDisposable
    {
        public IntPtr Address => handle.AddrOfPinnedObject();
        private GCHandle handle;
        public PinnedHandle(object val)
        {
            handle = GCHandle.Alloc(val, GCHandleType.Pinned);
        }
        public void Dispose()
        {
            handle.Free();
        }
    }

    public class MergedResults
    {
        public double[] Doubles { get; set; }
        public int[] Lengths { get; set; }
        public int[] Offsets { get; set; }
    }

    public static MergedResults Merge(double[][] sets, int offset, int length)
    {
        List<int> lengths = new List<int>(length);
        List<int> offsets = new List<int>(length);

        for (int i = 0; i < length; i++)
        {
            var arr = sets[i + offset];
            lengths.Add(arr.Length);
        }
        var totalLength = lengths.Sum();

        double[] doubles = new double[totalLength];
        int dataOffset = 0;
        for (int i = 0; i < length; i++)
        {
            var arr = sets[i + offset];
            Array.Copy(arr, 0, doubles, dataOffset, arr.Length);
            offsets.Add(dataOffset);
            dataOffset += arr.Length;
        }

        return new MergedResults()
        {
            Doubles = doubles,
            Lengths = lengths.ToArray(),
            Offsets = offsets.ToArray(),
        };
    }

    public static IDisposable Timer(string name)
    {
        return new SWTimer(name);
    }

    public class SWTimer : IDisposable
    {
        private Stopwatch _sw;
        private string _name;
        public SWTimer(string name)
        {
            _name = name;
            _sw = Stopwatch.StartNew();
        }
        public void Dispose()
        {
            _sw.Stop();
            Console.WriteLine("Task " + _name + ": " + _sw.Elapsed.TotalMilliseconds + "ms");
        }

    }

    public static double[][] CalculateCPU(double[][] arrays, int period)
    {
        double[][] results = new double[arrays.Length][];
        for (var index = 0; index < arrays.Length; index++)
        {
            var arr = arrays[index];
            results[index] = Calculate(arr, period);
        }
        return results;
    }

    public static double[][] CalculateCPUParallel(double[][] arrays, int period)
    {
        double[][] results = new double[arrays.Length][];
        Parallel.For(0, arrays.Length, i =>
         {
             var arr = arrays[i];
             results[i] = Calculate(arr, period);
         });
        return results;
    }

    static double[] GetRandomDoubles(int num, int randomSeed)
    {
        Random r = new Random(randomSeed);
        var res = new double[num];
        for (int i = 0; i < num; i++)
            res[i] = r.NextDouble() * 0.9 + 0.05;
        return res;
    }
}

Comparing two images using the GPU in C#

Think very carefully if you want to solve this problem on the GPU. You have to transfer the data to the GPU and configure and start a GPU program. This costs some time. And since the GPU program will be very small, the overhead might be larger than the performance gain. So first try to parallelize your comparison on the CPU. A GPU implementation might be even slower (but could also be faster).

There are lots of APIs which let you program the GPU. None of them contain directly usable C# code as far as I know.

Firstly, there are the graphics libraries (DirectX, OpenGL). They let you program the GPU via Compute Shaders. There are C# bindings for these APIs, such as SharpDX, SlimDX, OpenTK. However, Compute Shaders require a graphics card that is capable of these shader stages.

And then there are the pure computation APIs, such as CUDA or OpenCL. CUDA is restricted to nVidia GPUs. OpenCL should run on most hardware. But there are also some limitations on the graphics device. Though, they are usually lower than for Compute Shaders. OpenTK is also a C# wrapper for OpenCL.

Coding CUDA with C#?

There is such a nice complete cuda 4.2 wrapper as ManagedCuda.
You simply add C++ cuda project to your solution, which contains yours c# project, then you just add

call "%VS100COMNTOOLS%vsvars32.bat"
for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_21 -m 64 -o "$(ProjectDir)bin\Debug\%%~na_64.ptx" "$(ProjectDir)Kernels\%%~na.cu"
for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_21 -m 32 -o "$(ProjectDir)bin\Debug\%%~na.ptx" "$(ProjectDir)Kernels\%%~na.cu"

to post-build events in your c# project properties, this compiles *.ptx file and copies it in your c# project output directory.

Then you need simply create new context, load module from file, load function and work with device.

//NewContext creation
CudaContext cntxt = new  CudaContext();

//Module loading from precompiled .ptx in a project output folder
CUmodule cumodule = cntxt.LoadModule("kernel.ptx");

//_Z9addKernelPf - function name, can be found in *.ptx file
CudaKernel addWithCuda = new CudaKernel("_Z9addKernelPf", cumodule, cntxt);

//Create device array for data
CudaDeviceVariable<cData2> vec1_device = new CudaDeviceVariable<cData2>(num);            

//Create arrays with data
cData2[] vec1 = new cData2[num];

//Copy data to device
vec1_device.CopyToDevice(vec1);

//Set grid and block dimensions                       
addWithCuda.GridDimensions = new dim3(8, 1, 1);
addWithCuda.BlockDimensions = new dim3(512, 1, 1);

//Run the kernel
addWithCuda.Run(
    vec1_device.DevicePointer, 
    vec2_device.DevicePointer, 
    vec3_device.DevicePointer);

//Copy data from device
vec1_device.CopyToHost(vec1);

C# ML.Net Image classification: Does GPU acceleration help improve the performance of predictions and how can I tell if it is?

It's likely a version mismatch.

TensorFlow supports CUDA® 10.1 (TensorFlow >= 2.1.0)

https://www.tensorflow.org/install/gpu

You can check your output window for reasons why it would not be connecting to your GPU.

Utilizing the Gpu with C#

Run C# code on GPU

Process a heavy function by GPU in C#

Does C# natively use GPU for graphics?

C#: Perform Operations on GPU, not CPU (Calculate Pi)

C# OpenCL GPU implementation for double array math

Comparing two images using the GPU in C#

Coding CUDA with C#?

C# ML.Net Image classification: Does GPU acceleration help improve the performance of predictions and how can I tell if it is?

Related Topics

Leave a reply

Run C# code on GPU

Process a heavy function by GPU in C#

Does C# natively use GPU for graphics?

C#: Perform Operations on GPU, not CPU (Calculate Pi)

C# OpenCL GPU implementation for double array math

Comparing two images *using the GPU* in C#

Coding CUDA with C#?

C# ML.Net Image classification: Does GPU acceleration help improve the performance of predictions and how can I tell if it is?

Related Topics

Leave a reply

Comparing two images using the GPU in C#