How to Scale Threads According to CPU Cores

How to scale threads according to CPU cores?

You can determine the number of processes available to the Java Virtual Machine by using the static Runtime method, availableProcessors. Once you have determined the number of processors available, create that number of threads and split up your work accordingly.

Update: To further clarify, a Thread is just an Object in Java, so you can create it just like you would create any other object. So, let's say that you call the above method and find that it returns 2 processors. Awesome. Now, you can create a loop that generates a new Thread, and splits the work off for that thread, and fires off the thread. Here's some pseudocode to demonstrate what I mean:

int processors = Runtime.getRuntime().availableProcessors();
for(int i=0; i < processors; i++) {
  Thread yourThread = new AThreadYouCreated();
  // You may need to pass in parameters depending on what work you are doing and how you setup your thread.
  yourThread.start();
}

For more information on creating your own thread, head to this tutorial. Also, you may want to look at Thread Pooling for the creation of the threads.

Optimal number of threads per core

If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.

Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.

One thing for sure: 4k threads will take longer. That's a lot of context switches.

Threads configuration based on no. of CPU-cores

The optimal number of threads to use depends on several factors, but mostly the number of available processors and how cpu-intensive your tasks are. Java Concurrency in Practice proposes the following formal formula to estimate the optimal number of threads:

N_threads = N_cpu * U_cpu * (1 + W / C)

Where:

N_threads is the optimal number of threads
N_cpu is the number of prcessors, which you can obtain from Runtime.getRuntime().availableProcessors();
U_cpu is the target CPU utilization (1 if you want to use the full available resources)
W / C is the ratio of wait time to compute time (0 for CPU-bound task, maybe 10 or 100 for slow I/O tasks)

So for example, in a CPU-bound scenario, you would have as many threads as CPU (some advocate to use that number + 1 but I have never seen that it made a significant difference).

For a slow I/O process, for example a web crawler, W/C could be 10 if downloading a page is 10 times slower than processing it, in which case using 100 threads would be useful.

Note however that there is an upper bound in practice (using 10,000 threads will generally not speed things up, and you would probably get an OutOfMemoryError before you can start them all anyway with normal memory settings).

This is probably the best estimate you can get if you don't know anything about the environment in which your application runs. Profiling your application in production might enable you to fine tune the settings.

Although not strictly related, you might also be interested in Amdahl's law, which aims at measuring the maximum speed-up you can expect from parallelising a program.

Multithreading & parallelism with regards to the number of CPU cores

Yes, it essentially means that at most 2 threads can run in parallel. Intel processors implement HyperThreading where a single core acts as two cores and can run two threads (mostly) in parallel so in this case you would end up with maximally 2*2=4 concurrent threads, but most OSes distinguish HT cores and handle a DualCore HT as if it had 4 cores. But all this meditation is useless. For performance aspects there is much more to consider than the sheer number of cores. For parallelism as a problem there are multitasking operating systems which simulate large number of parallel threads even on one core. These simulations are perfect in the sense that they can hit cases where any problem/benefit of real parallelism is observable.

Threads vs cores when threads are asleep

If we assume that you are talking about threads that are implemented using native thread support in a modern OS, then your statements are more or less correct.

There are a few factors that could cause the behavior to deviate from the "ideal".

If there are other user-space processes, they may compete for resources (CPU, memory, etcetera) with your application. That will reduce (for example) the CPU available to your application. Note that this will include things like the user-space processes responsible for running your desktop environment etc.
There are various overheads that will be incurred by the operating system kernel. There are many places where this happens including:
- Managing the file system.
- Managing physical / virtual memory system.
- Dealing with network traffic.
- Scheduling processes and threads.
That will reduce the CPU available to your application.
The thread scheduler typically doesn't do entirely fair scheduling. So one thread may get a larger percentage of the CPU than another.
There are some complicated interactions with the hardware when the application has a large memory footprint, and threads don't have good memory locality. For various reasons, memory intensive threads compete with each other and can slow each other down. These interactions are all accounted as "user process" time, but they result in threads being able to do less actual work.

So:

1) If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.

Probably not all of the time, due to other user processes and OS overheads.

2) If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.

Approximately. There are the overheads (see above). There is also the issue that time slicing between different threads of the same priority is fairly coarse grained, and not necessarily fair.

3) If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.

Approximately: see above.

4) An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.

There is also the issue that the OS consumes CPU to manage the sleeping threads; e.g. putting them to sleep, deciding when to wake them, rescheduling.

Another one is that the memory used by the threads may also come at a cost. For instance if the sum of the memory used for all process (including all of the 10,000 threads' stacks) is larger than the available physical RAM, then there is likely to be paging. And that also uses CPU resources.

5) In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.

Not necessarily. If the virtual memory usage is out of whack (i.e. you are paging heavily), the system may have to idle some of the CPU while waiting for memory pages to be read from and written to the paging device. In short, you need to take account of memory utilization, or it will impact on the CPU utilization.

This also doesn't take account of thread scheduling and context switching between threads. Each time the OS switches a core from one thread to another it has to:

Save the the old thread's registers.
Flush the processor's memory cache
Invalidate the VM mapping registers, etcetera. This includes the TLBs that @bazza mentioned.
Load the new thread's registers.
Take performance hits due to having to do more main memory reads, and vm page translations because of previous cache invalidations.

These overheads can be significant. According to https://unix.stackexchange.com/questions/506564/ this is typically around 1.2 microseconds per context switch. That may not sound much, but if your application is switching threads rapidly, that could amount to many milliseconds in each second.

Is having multiple cores in a CPU for running multiple threads/processes at once, or for instruction-level parallelism?

ILP is purely within each physical core separately.

cross-site duplicate: How does a single thread run on multiple cores?

(It doesn't - each core has multiple execution units and a wide front-end.
Read my linked answer for details that I'm not going to duplicate. See also Modern Microprocessors
A 90-Minute Guide!)

Also, real CPU cores can pretend to be multiple "logical cores" (i.e. each having register context and can call __schedule() independently). Generically, this is SMT; the mostly widely-known brand-name implementation of that concept is Intel's HyperThreading. Letting multiple software threads share a CPU truly simultaneously (not via software context-switching) gives the core two instruction-streams to find parallelism between as well as within, generally increasing overall throughput (at the cost of single-thread performance to some degree, depending on how busy a single thread of your workload could keep a core).

In some contexts, "CPU" is synonym for a single core. e.g. perf stat output saying "7.5 CPUs utilized".

But more often, a CPU refers to the whole physical package, e.g. my CPU is a quad-core, an i7-6700k. Server motherboards are often dual-socket, allowing you to plug in two separate multi-core CPUs.

Perhaps that's what created some terminology confusion?

How to Scale Threads According to CPU Cores