Threads Configuration Based on No. of Cpu-Cores

Threads configuration based on no. of CPU-cores

The optimal number of threads to use depends on several factors, but mostly the number of available processors and how cpu-intensive your tasks are. Java Concurrency in Practice proposes the following formal formula to estimate the optimal number of threads:

N_threads = N_cpu * U_cpu * (1 + W / C)

Where:

N_threads is the optimal number of threads
N_cpu is the number of prcessors, which you can obtain from Runtime.getRuntime().availableProcessors();
U_cpu is the target CPU utilization (1 if you want to use the full available resources)
W / C is the ratio of wait time to compute time (0 for CPU-bound task, maybe 10 or 100 for slow I/O tasks)

So for example, in a CPU-bound scenario, you would have as many threads as CPU (some advocate to use that number + 1 but I have never seen that it made a significant difference).

For a slow I/O process, for example a web crawler, W/C could be 10 if downloading a page is 10 times slower than processing it, in which case using 100 threads would be useful.

Note however that there is an upper bound in practice (using 10,000 threads will generally not speed things up, and you would probably get an OutOfMemoryError before you can start them all anyway with normal memory settings).

This is probably the best estimate you can get if you don't know anything about the environment in which your application runs. Profiling your application in production might enable you to fine tune the settings.

Although not strictly related, you might also be interested in Amdahl's law, which aims at measuring the maximum speed-up you can expect from parallelising a program.

Optimal number of threads per core

If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.

Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.

One thing for sure: 4k threads will take longer. That's a lot of context switches.

How to scale threads according to CPU cores?

You can determine the number of processes available to the Java Virtual Machine by using the static Runtime method, availableProcessors. Once you have determined the number of processors available, create that number of threads and split up your work accordingly.

Update: To further clarify, a Thread is just an Object in Java, so you can create it just like you would create any other object. So, let's say that you call the above method and find that it returns 2 processors. Awesome. Now, you can create a loop that generates a new Thread, and splits the work off for that thread, and fires off the thread. Here's some pseudocode to demonstrate what I mean:

int processors = Runtime.getRuntime().availableProcessors();
for(int i=0; i < processors; i++) {
  Thread yourThread = new AThreadYouCreated();
  // You may need to pass in parameters depending on what work you are doing and how you setup your thread.
  yourThread.start();
}

For more information on creating your own thread, head to this tutorial. Also, you may want to look at Thread Pooling for the creation of the threads.

Number of processor core vs the size of a thread pool

Many times I've heard that it is better to maintain the number of threads in a thread pool below the number of cores in that system. Having twice or more threads than the number of cores is not only a waste, but also could cause performance degradation.

The claims are not true as a general statement. That is to say, sometimes they are true (or true-ish) and other times they are patently false.

A couple things are indisputably true:

More threads means more memory usage. Each thread requires a thread stack. For recent HotSpot JVMs, the minimum thread stack size is 64Kb, and the default can be as much as 1Mb. That can be significant. In addition, any thread that is alive is likely to own or share objects in the heap whether or not it is currently runnable. Therefore is is reasonable to expect that more threads means a larger memory working set.
A JVM cannot have more threads actually running than there are cores (or hyperthread cores or whatever) on the execution hardware. A car won't run without an engine, and a thread won't run without a core.

Beyond that, things get less clear cut. The "problem" is that a live thread can in a variety of "states". For instance:

A live thread can be running; i.e. actively executing instructions.
A live thread can be runnable; i.e. waiting for a core so that it can be run.
A live thread can by synchronizing; i.e. waiting for a signal from another thread, or waiting for a lock to be released.
A live thread can be waiting on an external event; e.g. waiting for some external server / service to respond to a request.

The "one thread per core" heuristic assumes that threads are either running or runnable (according to the above). But for a lot of multi-threaded applications, the heuristic is wrong ... because it doesn't take account of threads in the other states.

Now "too many" threads clearly can cause significant performance degradation, simple by using too much memory. (Imagine that you have 4Gb of physical memory and you create 8,000 threads with 1Mb stacks. That is a recipe for virtual memory thrashing.)

But what about other things? Can having too many threads cause excessive context switching?

I don't think so. If you have lots of threads, and your application's use of those threads can result in excessive context switches, and that is bad for performance. However, I posit that the root cause of the context switched is not the actual number of threads. The root of the performance problems are more likely that the application is:

synchronizing in a particularly wasteful way; e.g. using Object.notifyAll() when Object.notify() would be better, OR
synchronizing on a highly contended data structure, OR
doing too much synchronization relative to the amount of useful work that each thread is doing, OR
trying to do too much I/O in parallel.

(In the last case, the bottleneck is likely to be the I/O system rather than context switches ... unless the I/O is IPC with services / programs on the same machine.)

The other point is that in the absence of the confounding factors above, having more threads is not going to increase context switches. If your application has N runnable threads competing for M processors, and the threads are purely computational and contention free, then the OS'es thread scheduler is going to attempt to time-slice between them. But the length of a timeslice is likely to be measured in tenths of a second (or more), so that the context switch overhead is negligible compared with the work that a CPU-bound thread actually performs during its slice. And if we assume that the length of a time slice is constant, then the context switch overhead will be constant too. Adding more runnable threads (increasing N) won't change the ratio of work to overhead significantly.

In summary, it is true that "too many threads" is harmful for performance. However, there is no reliable universal "rule of thumb" for how many is "too many". And (fortunately) you generally have considerable leeway before the performance problems of "too many" become significant.

Java threads and number of cores

Processes vs Threads

In days of old, each process had precisely one thread of execution, so processes were scheduled onto cores directly (and in these old days, there was almost only one core to schedule onto). However, in operating systems that support threading (which is almost all moderns OS's), it is threads, not processes that are scheduled. So for the rest of this discussion we will talk exclusively about threads, and you should understand that each running process has one or more threads of execution.

Parallelism vs Concurrency

When two threads are running in parallel, they are both running at the same time. For example, if we have two threads, A and B, then their parallel execution would look like this:

CPU 1: A ------------------------->

CPU 2: B ------------------------->

When two threads are running concurrently, their execution overlaps. Overlapping can happen in one of two ways: either the threads are executing at the same time (i.e. in parallel, as above), or their executions are being interleaved on the processor, like so:

CPU 1: A -----------> B ----------> A -----------> B ---------->

So, for our purposes, parallelism can be thought of as a special case of concurrency*

Scheduling

But we are able to produce a thread pool(lets say 30) with a larger number than the number of cores that we posses(lets say 4) and have them run concurrently. How is this possible if we are only have 4 cores?

In this case, they can run concurrently because the CPU scheduler is giving each one of those 30 threads some share of CPU time. Some threads will be running in parallel (if you have 4 cores, then 4 threads will be running in parallel at any one time), but all 30 threads will be running concurrently. The reason you can then go play games or browse the web is that these new threads are added to the thread pool/queue and also given a share of CPU time.

Logical vs Physical Cores

According to my current understanding, a core can only perform 1 process at a time

This is not quite true. Due to very clever hardware design and pipelining that would be much too long to go into here (plus I don't understand it), it is possible for one physical core to actually be executing two completely different threads of execution at the same time. Chew over that sentence a bit if you need to -- it still blows my mind.

This amazing feat is called simultaneous multi-threading (or popularly Hyper-Threading, although that is a proprietary name for a specific instance of such technology). Thus, we have physical cores, which are the actual hardware CPU cores, and logical cores, which is the number of cores the operating system tells software is available for use. Logical cores are essentially an abstraction. In typical modern Intel CPUs, each physical core acts as two logical cores.

can someone explain how this works and also recommend some good reading on this?

I would recommend Operating System Concepts if you really want to understand how processes, threads, and scheduling all work together.

The precise meanings of the terms parallel and concurrent are hotly debated, even here in our very own stack overflow. What one means by these terms depends a lot on the application domain.

Tomcat thread pool size and number of cores

the number of cpu cores rarely exceeds 32.
As others have mentioned, threads provide concurrency - another thread can be executing while a different thread is in a blocked state. If all threads were in an active state, then it's true that some threads will context switch to another thread assigned to its core. But anyway, this also seems like a configurable parameter - https://www.baeldung.com/java-web-thread-pool-config

How to get the number of threads per cores in python

You can use psutil module in python3.

python3 -m pip install psutil

The psutil.cpu_count method returns the number of logical CPUs in the system (same as os.cpu_count in Python 3.4) or None if undetermined. logical cores means the number of physical cores multiplied by the number of threads that can run on each core (this is known as Hyper Threading). If logical is False return the number of physical cores only (Hyper Thread CPUs are excluded) or None if undetermined.

import psutil

threads_count = psutil.cpu_count() / psutil.cpu_count(logical=False)

Threads Configuration Based on No. of Cpu-Cores