How Much Overhead Is There When Creating a Thread

How much overhead is there when creating a thread?

...sends Messages on a serial port ... for every message a pthread is created, bits are properly set up, then the thread terminates. ...how much overhead is there when actually creating a thread?

This is highly system specific. For example, last time I used VMS threading was nightmarishly slow (been years, but from memory one thread could create something like 10 more per second (and if you kept that up for a few seconds without threads exiting you'd core)), whereas on Linux you can probably create thousands. If you want to know exactly, benchmark it on your system. But, it's not much use just knowing that without knowing more about the messages: whether they average 5 bytes or 100k, whether they're sent contiguously or the line idles in between, and what the latency requirements for the app are are all as relevant to the appropriateness of the code's thread use as any absolute measurement of thread creation overhead. And performance may not have needed to be the dominant design consideration.

Why is creating a Thread said to be expensive?

Why is creating a Thread said to be expensive?

Because it >>is<< expensive.

Java thread creation is expensive because there is a fair bit of work involved:

A large block of memory has to be allocated and initialized for the thread stack.
System calls need to be made to create / register the native thread with the host OS.
Descriptors need to be created, initialized and added to JVM-internal data structures.

It is also expensive in the sense that the thread ties down resources as long as it is alive; e.g. the thread stack, any objects reachable from the stack, the JVM thread descriptors, the OS native thread descriptors.

The costs of all of these things are platform specific, but they are not cheap on any Java platform I've ever come across.

A Google search found me an old benchmark that reports a thread creation rate of ~4000 per second on a Sun Java 1.4.1 on a 2002 vintage dual processor Xeon running 2002 vintage Linux. A more modern platform will give better numbers ... and I can't comment on the methodology ... but at least it gives a ballpark for how expensive thread creation is likely to be.

Peter Lawrey's benchmarking indicates that thread creation is significantly faster these days in absolute terms, but it is unclear how much of this is due improvements in Java and/or the OS ... or higher processor speeds. But his numbers still indicate a 150+ fold improvement if you use a thread pool versus creating/starting a new thread each time. (And he makes the point that this is all relative ...)

The above assumes native threads rather than green threads, but modern JVMs all use native threads for performance reasons. Green threads are possibly cheaper to create, but you pay for it in other areas.

Update: The OpenJDK Loom project aims to provide a light-weight alternative to standard Java threads, among other things. The are proposing virtual threads which are a hybrid of native threads and green threads. In simple terms, a virtual thread is rather like a green thread implementation that uses native threads underneath when parallel execution is required.

As of now (Jan 2021) the Project Loom work is still at the prototyping stage, with (AFAIK) no Java version targeted for the release.

I've done a bit of digging to see how a Java thread's stack really gets allocated. In the case of OpenJDK 6 on Linux, the thread stack is allocated by the call to pthread_create that creates the native thread. (The JVM does not pass pthread_create a preallocated stack.)

Then, within pthread_create the stack is allocated by a call to mmap as follows:

mmap(0, attr.__stacksize, 
     PROT_READ|PROT_WRITE|PROT_EXEC, 
     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)

According to man mmap, the MAP_ANONYMOUS flag causes the memory to be initialized to zero.

Thus, even though it might not be essential that new Java thread stacks are zeroed (per the JVM spec), in practice (at least with OpenJDK 6 on Linux) they are zeroed.

Java thread creation overhead

Here is an example microbenchmark:

public class ThreadSpawningPerformanceTest {
static long test(final int threadCount, final int workAmountPerThread) throws InterruptedException {
    Thread[] tt = new Thread[threadCount];
    final int[] aa = new int[tt.length];
    System.out.print("Creating "+tt.length+" Thread objects... ");
    long t0 = System.nanoTime(), t00 = t0;
    for (int i = 0; i < tt.length; i++) { 
        final int j = i;
        tt[i] = new Thread() {
            public void run() {
                int k = j;
                for (int l = 0; l < workAmountPerThread; l++) {
                    k += k*k+l;
                }
                aa[j] = k;
            }
        };
    }
    System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
    System.out.print("Starting "+tt.length+" threads with "+workAmountPerThread+" steps of work per thread... ");
    t0 = System.nanoTime();
    for (int i = 0; i < tt.length; i++) { 
        tt[i].start();
    }
    System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
    System.out.print("Joining "+tt.length+" threads... ");
    t0 = System.nanoTime();
    for (int i = 0; i < tt.length; i++) { 
        tt[i].join();
    }
    System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
    long totalTime = System.nanoTime()-t00;
    int checkSum = 0; //display checksum in order to give the JVM no chance to optimize out the contents of the run() method and possibly even thread creation
    for (int a : aa) {
        checkSum += a;
    }
    System.out.println("Checksum: "+checkSum);
    System.out.println("Total time: "+totalTime*1E-6+" ms");
    System.out.println();
    return totalTime;
}

public static void main(String[] kr) throws InterruptedException {
    int workAmount = 100000000;
    int[] threadCount = new int[]{1, 2, 10, 100, 1000, 10000, 100000};
    int trialCount = 2;
    long[][] time = new long[threadCount.length][trialCount];
    for (int j = 0; j < trialCount; j++) {
        for (int i = 0; i < threadCount.length; i++) {
            time[i][j] = test(threadCount[i], workAmount/threadCount[i]); 
        }
    }
    System.out.print("Number of threads ");
    for (long t : threadCount) {
        System.out.print("\t"+t);
    }
    System.out.println();
    for (int j = 0; j < trialCount; j++) {
        System.out.print((j+1)+". trial time (ms)");
        for (int i = 0; i < threadCount.length; i++) {
            System.out.print("\t"+Math.round(time[i][j]*1E-6));
        }
        System.out.println();
    }
}
}

The results on 64-bit Windows 7 with 32-bit Sun's Java 1.6.0_21 Client VM on Intel Core2 Duo E6400 @2.13 GHz are as follows:

Number of threads  1    2    10   100  1000 10000 100000
1. trial time (ms) 346  181  179  191  286  1229  11308
2. trial time (ms) 346  181  187  189  281  1224  10651

Conclusions: Two threads do the work almost twice as fast as one, as expected since my computer has two cores. My computer can spawn nearly 10000 threads per second, i. e. thread creation overhead is 0.1 milliseconds. Hence, on such a machine, a couple of hundred new threads per second pose a negligible overhead (as can also be seen by comparing the numbers in the columns for 2 and 100 threads).

What is the overhead of a waiting thread?

What is the overhead of a thread which is in wait() mode

None. Waiting thread doesn't consume any CPU cycles at all, it just waits for being awakened. So don't bother yourself.

I assume that since a waiting thread will be continuously polling on a monitor/lock internally to wake up ,it might consume a considerable amount af cpu cycles to maintain a waiting thread . Correct me if I am wrong.

That's not true. A waiting thread doesn't do any polling on a monitor/ lock/ anything.

The only situation where a big number of threads can hurt performance is where there is many active threads (much more than nr of CPUs/ cores) which are often switched back and forth. Because CPU context switching also comes with some cost. Waiting threads only consumes memory, not CPU.

If you want to look at the internal implementation of threads - I have to disappoint you. Methods like wait()/ notify() are native - which means that their implementation depends on the JVM. So in case of the HotSpot JVM you can take a look at its source code (written in C++/ with a bit of the assembler).

But do you really need this? Why you don't want to trust a JVM documentation?

Is it always good for performance to create a thread?

Creating a thread can be expensive. If you have very small amounts of work to do, it might not be worth it. This article's measurements show that creating a thread may take ~milliseconds.
Threads are an abstraction over CPU cores, and while you can basically create as many threads as you want, the number of available cores is fixed. After a certain point, you will get no additional speedup because the hardware has no more to offer, and you may actually slow things down by introducing more bookkeeping and communication overhead.
Even if you were not limited by hardware concurrency, most workloads are not completely parallelizable, and you'll be limited by the non-parallel parts of your problem. See: https://en.wikipedia.org/wiki/Amdahl%27s_law

Sample Image

What kind of overhead to c++11 threads introduce?

C++ offers a fairly thin wrapper on top of the underlying implementation, leading to no significant additional overhead. In fact, you can even get a handle to the underlying OS thread, which will be a __gthread_t, which is a pthread handle for g++ and a WINAPI thread handle for Visual C++.

However, threads do have intrinsic overhead, because they need to be scheduled by the OS, contain a stack and so forth.

An analysis by Mark Russinovich goes through the limits of thread creation under Windows. These limits are of course caused by the thread overhead and give:

A thread requires about 1 MB of virtual address space (default linker setting)
4-16 KB of initial commit size
12-48 KB of nonpageable memory

How much memory does a thread consume when first created?

I have a server application which is heavy in thread usage, it uses a configurable thread pool which is set up by the customer, and in at least one site it has 1000+ threads, and when started up it uses only 50 MB. The reason is that Windows reserves 1MB for the stack (it maps its address space), but it is not necessarily allocated in the physical memory, only a smaller part of it. If the stack grows more than that a page fault is generated and more physical memory is allocated. I don't know what the initial allocation is, but I would assume it's equal to the page granularity of the system (usually 64 KB). Of course, the thread would also use a little more memory for other things when created (TLS, TSS, etc), but my guess for the total would be about 200 KB. And bear in mind that any memory that is not frequently used would be unloaded by the virtual memory manager.

Overhead of threads on performance

General Guidelines:

It can be quite fiddly to increase application performance by use of threads; too few and you're not doing as well as you might, too many and the overhead of threads erodes the benefits gain from your program's concurrency.

To get it just right you have to setup as many threads as there are cores, and make sure you're program breaks down nicely to that many threads. That can be tricky to get right if you're allowing for the program being run on a wide range of hardware.

Thread pools were invented to help here. Broadly speaking, they have just the right number of threads for the hardware the program happens to be running on, and they let you submit lots of small concurrent tasks for execution without the overhead of setting up a new thread for each one. Thus your program runs well on a wide range of different hardware without having worry too much about it.

What is the overhead for thread creation in the .net micro framework?

In my experience there's a roughly 1K memory cost for each thread under NETMF. As for the time required to allocate a thread, if you're contemplating questions like that it's probably time to do a bit of reading on embedded systems best practice. I'm not mocking you, there's quite a bit of hard won lore that can save you heartache and hassle. Case in point, the thread thing. If you want reliability you have to guarantee maximum resource demand. If you're going to say "no more than 5 threads" then you may as well start all five as part of your initialisation process, and allocate all the resources they're going to want. If you can't do that then you can't guarantee the stability of your system under load. A side affect of this is that the time required to start them is irrelevant to the responsiveness of your system, although it does affect boot time slightly.

There is overhead for context switching. I can't give you quantified figures because I've never needed to benchmark it. NETMF is implemented right on the metal; more than likely you can get some insight from the SoC documentation which you can download from ATMEL. Or if you ask on the netduino forums there's a fair chance Chris can tell you off the cuff.

If this is a homework question then take Hans' advice and look at the source code. If you're looking to build something and assessing the platform suitability for an application then it may be of interest that I have never suffered from switching lag when doing timing sensitive things on different threads, but I never use more than three or four threads and one of them services a number of logical processes (all the timing insensitive stuff) in round robin fashion.

Once again, the key to long term stability is to avoid dynamic allocation of anything.

An advantage of explicitly coded round robin is that you have control of sequence for the logical processes.

How Much Overhead Is There When Creating a Thread