How to Use More Than One Processor Group for My Threads in a C# App

Unable to use more than one processor group for my threads in a C# app

The bug has been fixed by a new (yet unpublished) HP Bios (at the time of writing this).

The new Bios (targeting HP Proliant DL360 and DL380 Gen9) introduce a new setting: "NUMA Group Size Optimization" with choice of [Clustered - default] or [Flat]. HP says to set it to flat.

The sceenshot part of this answer has been conducted on a DL380 instead of a DL360 because of server availability. But I expect same behavior on DL360. The problem disapeared, we had only one group.

As far as I know, the OS communicate with the BIOS to know the CPU(s) configuration. The Bios play an important role in how the OS will present the logical processors available to applications (Processor Group, Affinity, etc).

About the Microsoft documentation Supporting Systems That Have More Than 64 Processors and Processor Groups it is clearly stated that more than one processor group will only be created when the Logical Processor (LC) count is >64. On our server (56 LC) with Numa Architecture set to "Clustered" we had 2 processor groups. A hardware engineer working at HP Bios dev team explained me that when set to "Clustered", the Bios is fooling Windows by padding the real number of logical processor to 72 Logical Processor (the max number of Logical Processor for the E5 v3 Family). The real number of LC is 56 in our DL360. That's the reason why we add 2 groups instead of 1. The Microsoft documentation seems accurate. I personally think that it would be better to create 1 group per numa node whenever possible but in our case, there is a bug. What is faulty is hard to know between HP or Microsoft when the HP Bios setting is set to Clustered (default) but Microsoft seems to not support that option which seems to cause our problem.

On HP Bios for DL360 and DL380, The Bios configuration "Numa Configuration" set to "Clustered" (default) will create 2 groups although there is only 56 Logical Processors (when hyperthreaded). The result is that only one processor is visible at a time for any application. Probably also due to HP fooling Windows by padding fake number of Logical Processors. It sounds like Microsoft does not expect that. Our C# app can't run on the 2 groups. It's hard to blame Microsoft on that behavior where HP does something they can't anticipated. Perhaps we will see, one day, Windows supporting many groups when LC <= 64.

About Prime95. This CPU stress test software has good documentation on Wikipedia that clearly state that it will load into only one processor group (in Limits section).

Running in Numa Architecture set to Flat

processor affinity group C#

The posted examply only sets current thread to a CPU processor group. But you want to set it for all threads of a process. You need to call SetProcessAffinityMask for your process.

There is no need to PInvoke to SetProcessAffinityMask because the Process class already has a property ProcessorAffinity which lets you set it directly.

class Program
{
    static void SetProcessAffinity(ulong groupMask)
    {
        Process.GetCurrentProcess().ProcessorAffinity = new IntPtr((long)groupMask);
    }
    static void Main(string[] args)
    {
        SetProcessAffinity(1);    // group 0
        // binary literals are a C# 7 feature for which you need VS 2017 or later.
        SetProcessAffinity(0b11); // Both groups 0 and 1 
        SetProcessAffinity(0b1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111); // for all cpu groups all 64 bits enabled
    }
}

Are threads executed on multiple processors?

-It appears that Task class provide us the ability to use on multiple processors in the system.
-if threads will/could be executed on multiple cores then what is so special about Task Parallelism ?

The Task class is just a small, but important, part of TPL (Task Parallel Library). TPL is a high level abstraction, so you don't have to work with threads directly. It encapsulates and hides most of the churn you'd have to implement for any decent multi-threaded application.

Tasks don't introduce any new functionality that you couldn't implement on your own, per se (which is the core of your questions, I believe). They can be synchronous or asynchronous - when they are async, they're either using Thread class internally or IOCP ports.

Some of the points addressed by TPL are:

Rethrow exceptions from a child thread on the calling thread.
Asynchronous code (launch thread -> run arbitrary code while waiting for child thread -> resume when child thread is over) looks as if it were synchronous, greatly improving readability and maintainability
Simpler thread cancelation (using CancellationTokenSource)
Parallel queries/data manipulation using PLINQ or the Parallel class
Asynchronous workflows using TPL Dataflow

C# Core 2.1 Utilize 2 CPUs

Looks like tat your multiprocessor architecture is NUMA which implies CPUs have different access time to different memory regions. Systems with such architecture balance the load between CPUs in order to load only those processors which are "closer" (have lowest access time) to the memory area is being operated on. In case of ordinal .NET application standard memory layout implies residing in the same memory area which on NUMA architectures leads to utilizing only those CPUs which are closer to this are (in your case it can be 2 NUMA nodes with 1 CPU each and only 1 us used because it servers the memory area your applications uses).

The applications which need to get benefits from the NUMA architecture are supposed to use specific APIs which expose among other calls the calls to indicate in which NUMA node to allocate a memory (here is example of API functions provided by Windows). The .NET CLR starting from version 4.5 is able to utilize this API indirectly by specific configuration settings. On the CLR runtimes you need to set the following options in the application settings:

<configuration>
   <runtime>
      <gcServer enabled="true"/>
      <Thread_UseAllCpuGroups  enabled="true"/>
   </runtime>
</configuration>

Where gcServer mode controls the NUMA awareness for the memory allocations so that runtime can support multiple heaps for different NUMA nodes and Thread_UseAllCpuGroups controls the NUMA awareness for the tread pool.

However for .NET Core runtimes you need to turn on gcServer mode by using runtime options while Thread_UseAllCpuGroups which is part of ThreadPool settings can be passed via an environment variable according to this with prefix COMPlus_ (i.e. set COMPlus_Thread_UseAllCpuGroups=1).

How to make my code run on multiple cores?

I'd generalize that writing a highly optimized multi-threaded process is a lot harder than just throwing some threads in the mix.

I recommend starting with the following steps:

Split up your workloads into discrete parallel executable units
Measure and characterize workload types - Network intensive, I/O intensive, CPU intensive etc - these become the basis for your worker pooling strategies. e.g. you can have pretty large pools of workers for network intensive applications, but it doesn't make sense having more workers than hardware-threads for CPU intensive tasks.
Think about queuing/array or ThreadWorkerPool to manage pools of threads. Former more finegrain controlled than latter.
Learn to prefer async I/O patterns over sync patterns if you can - frees more CPU time to perform other tasks.
Work to eliminate or atleast reduce serialization around contended resources such as disk.
Minimize I/O, acquire and hold minimum level of locks for minimum period possible. (Reader/Writer locks are your friend)

5.Comb through that code to ensure that resources are locked in consistent sequence to minimize deadly embrace.
Test like crazy - race conditions and bugs in multithreaded applications are hellish to troubleshoot - often you only see the forensic aftermath of the massacre.

Bear in mind that it is entirely possible that a multi-threaded version could perform worse than a single-threaded version of the same app. There is no excuse for good engineering measurement.

Changing the Processor Group of an active process on Windows 2012

I like to quote this related MSDN forum answer: Process Affinity on a System with 128 Processors

In theory you could create a small driver that uses
KeSetSystemGroupAffinityThread to change the affinity. I say in
theory because sine this is a new call and limited documentation it
may not work. Of course once you do it there is the question if the
application will work, I assume you have read
http://download.microsoft.com/download/a/d/f/adf1347d-08dc-41a4-9084-623b1194d4b2/MoreThan64proc.docx
with its warnings about multiple groups and applications that were not
written to take advantage of them.

Also have a look at: Example usage of SetProcessAffinityMask in C++?

How to compile C# for multiple processor machines? (With VS 2010 or csc.exe)

Since it is a cluster, you have to rely on some form of a message-passing parallelism, no compiler will transform your code automatically. At least, a good old MPI is supported: http://osl.iu.edu/research/mpi.net/