How to Spawn Threads on Different CPU Cores

How do I spawn threads on different CPU cores?

Don't bother doing that.

Instead use the Thread Pool. The thread pool is a mechanism (actually a class) of the framework that you can query for a new thread.

When you ask for a new thread it will either give you a new one or enqueue the work until a thread get freed. In that way the framework is in charge on deciding wether it should create more threads or not depending on the number of present CPUs.

Edit: In addition, as it has been already mentioned, the OS is in charge of distributing the threads among the different CPUs.

What situations do single core threading beat threads across cores

If the logic is CPU bound, you only want a single thread per core because multiple CPU bound threads on the same core lead to waste due to context switching, cold caches etc. If the thread isn't CPU bound, but I/O bound, it could be beneficial to use multiple threads per core. But this depends on the architecture, e.g. in thread per core architectures like Seastar/Scylla, you still want a single thread per core.

Java doesn't do anything to determine how threads get mapped to cores. That is a task of the OS since Java threads are tied to a native thread.

In Java, there is no out-of-the-box solution to pin threads to cores. But you can use taskset for that or use one of Peter Lawrey's libraries:

https://github.com/OpenHFT/Java-Thread-Affinity

How to assign threads to different cores in C?

You don't have, don't want and mustn't (I don't know if you somehow you can though) manage hardware resources at such low levels. That's a job for your OS and partially for standard libraries: they have been tested optimized and standardized properly.

I doubt you can do better. If you do what you are saying either you are an expert hardware/OS programmer or you are destroying decades of works :) .

Also consider this fact: your code will not be portable anymore if you could index the cores manually since it depends on the number of cores of your machine.

On the other way multithread programs should work (and even better sometimes) even when having one core. An example is the case where one of the threads doesn't do anything until an event happens: you can make one thread go to "sleep" so that only the other threads use the CPU; then when the event happens it will execute. In a non-multithread program generally polling is used which uses CPU resource to do nothing.

Also @yano said you are multithread program is not really parallel in this case since you are creating the thread and then waiting for it to finish with pthread_join before starting the other threads.

Threads can run on different processors or cores for both Task.Factory.StartNew and Parallel.Invoke

It's actually pretty easy to answer.

Task.Run()

Queues the specified work to run on the ThreadPool ....

Task Parallel Library

... In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, ....

Using the same ThreadPool how is it possible for the ThreadPool to determine the type of task in order to limit the CPU? Either they both run on all Processors or the they all run one a Single Processor.

Extra Credit:

This begs the question, Is the ThreadPool Multi-Core aware?

The answer is surprisingly, it doesn't care. The ThreadPool asks the operating system (just like any c# application that uses new Thread()) for a Thread, it actually the responsibility of the OS. I think it would be pretty clear by now that with all the abstraction that even suggesting that C# can by default limit how threads are used is a pretty ridiculous assertion. (Yes you can run a thread on whatever core you want etc etc, but that is not how the ThreadPool works by default).

I highly recommend reading StartNew is Dangerous... TLDR? Use Task.Run().

How to run different threads on different cores?

See sched_setaffinity function: http://manpages.courier-mta.org/htmlman2/sched_setaffinity.2.html

How can I run 4 threads each on a different core (parallelism)?

You're done, no need to schedule anything. As long as there are multiple processors available, your threads will run simultaneously on available cores.

If there are less than 4 processors available, say 2, your threads will run in an interleaved manner, with up to 2 running at any given time.

p.s. it's also easy to experience it for yourself - just make 4 infinite loops and run them in 4 different threads. You will see 4 CPUs being used.

DISCLAIMER: Of course, "under the hood", scheduling is being done for you by the OS. So you depend on the quality of the scheduler built into the OS for concurrency. The fairness of the scheduler built into the OS on which a C++ application runs is outside the C++ standard, and so is not guaranteed. In reality though, especially when learning to write concurrent applications, most modern OSes will provide adequate fairness in the scheduling of threads.

System.Threading and number of threads in a CPU core

Don't create threads just to wait for I/O to complete. The async/await pattern will help you, by waiting for the I/O to complete, whilst freeing up those threads to do other useful work.

The other thing to bear in mind is that if you call your other service 10 times (or 100 or 1000) at the same time, if that service is also waiting for I/O, then it's possible that you could still take 16 minutes to complete your tasks.

How python multithreaded program can run on different Cores of CPU simultaneously despite of having GIL

https://docs.python.org/3/library/math.html

The math module consists mostly of thin wrappers around the platform C math library functions.

While python itself can only execute a single instruction at a time, a low level c function that is called by python does not have this limitation.

So it's not python that is using multiple cores but your system's well optimized math library that is wrapped by python's math module.

That basically answers both your questions.

Regarding the usefulness of multiprocessing: It is still useful for those cases, where you're trying to parallelize pure python code or code that does not call libraries that already use multiple cores.
However, it comes with inter process communication (IPC) overhead that may or may not be larger than the performance gain that you get from using multiple cores. Tuning IPC is therefore often crucial for multiprocessing in python.