Multithreading Program Stuck in Optimized Mode But Runs Normally in -O0

Visual Studio C++ Runtime Issue with Multithreading on the Release Configuration

My guess would be that you forgot to declare your shared global variables volatile (nPC_Current specifically). Since the thread function itself never modifies nPC_Current, in the release version of the code the compiler optimized you progress bar loop into an infinite loop with never changing value of nPC_Current.

This is why your progress bar never updates from 0% value in release version of the code and this is why your progress bar thread never terminates.

P.S. Also, it appears that you originally intended to pass your nPC_Current counter to the thread function as a thread parameter (judging by your CreateThread call). However, in the thread function you ignore the parameter and access nPC_Current directly as a global variable. It might be a better idea to stick to the original idea of passing and accessing it as a thread parameter.

GDB debug output for multi-thread program

We know that the program is segfaulting on this line:

current_node->children.insert(std::pair<string, ComponentTrieNode*>(comps[j], temp_node));

From the stack trace, we know that the segfault happens deep in the red black tree implementation of std::map:

#0  std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=@0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
126         __x->_M_right = __y->_M_left;

This implies that:

The segfault could be caused by:
1. evaluating __x->_M_right
2. evaluating __y->_M_left
3. storing the right hand side to the left hand side of __x->_M_right = __y->_M_left
std::map::insert() being called implies that the segfault was NOT caused while building the arguments to the call. In particular comps[j] is not out of bounds.

This leads me to think that your heap was already corrupted by previous memory operation errors by this time and that the crash in std::map::insert() is a symptom and not a cause.

Run your program under the Valgrind memcheck tool:

$ valgrind --tool=memcheck /mnt/disk2/experiments_BLOODMOON/two_stage_bloom_filter/programs/nameComponentEncoding/NCE_david /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt

and carefully read Valgrind's output afterwards to find the first memory error in your program.

Valgrind is implemented as a virtual CPU, so your program would slow down by a factor of ~30. This is time consuming but should allow you to make progress in troubleshooting the problem.

In addition to Valgrind, you might also want to try enabling debug mode for the libstdc++ containers:

To use the libstdc++ debug mode, compile your application with the compiler flag -D_GLIBCXX_DEBUG. Note that this flag changes the sizes and behavior of standard class templates such as std::vector, and therefore you can only link code compiled with debug mode and code compiled without debug mode if no instantiation of a container is passed between the two translation units.

If your program uses no external libraries then rebuilding the whole thing with -D_GLIBCXX_DEBUG added to CXXFLAGS in the Makefile should work. Otherwise you'd need to know whether C++ containers are passed between components compiled with and without the debug flag.

Valgrind Log Review

I'm surprised that you're using strtok() in a multi-threaded program. Is ComponentTrie::add_prefix() never called from two threads concurrently? While fixing the invalid read by inspecting how strtok() is used on ComponentTrie_david.cpp:99, you might want to replace strtok() with strtok_r() as well.

Concurrent Access to STL Containers

The standard C++ containers are explicitly documented to not do thread synchronization:

The user code must guard against concurrent function calls which access any particular library object's state when one or more of those accesses modifies the state. An object will be modified by invoking a non-const member function on it or passing it as a non-const argument to a library function. An object will not be modified by invoking a const member function on it or passing it to a function as a pointer- or reference-to-const. Typically, the application programmer may infer what object locks must be held based on the objects referenced in a function call and whether the objects are accessed as const or non-const.

(That's from the GNU libstdc++ documentation but the C++11 standard essentially specifies the same behavior) Concurrent modifications of std::map and other containers is a serious error and likely the culprit that caused the crash. Guard each container with their own pthread_mutex_t or use the OpenMP synchronization mechanisms.

gcc optimisation effect on loops with apparently constant variable

This code has Undefined Behavior. You're modifying hit from one thread and reading it form another, without synchronization.

Optimizing hit to false is a valid outcome of Undefined Behavior. You can solve this by making hit a std::atomic<bool>. This makes if well-defined, and blocks the optimization.

Write to multiple text files from multiple threads

You have at least two problems:

you make way too many syscalls; basically, for one loop of each thread you do open(), write(), (maybe flush() and finally close(); at least 300k syscalls!
you create 100k FileWriter objects, 100k File objects; the gc needs to handle all of them; and since the gc runs in a thread by itself and is scheduled like any other threads, it will run more or less often.

The problem is therefore more with your program than anything OS-related... The JIT can't do anything for you here.

Also, since you use Java 7, you should consider using Files.newBufferedWriter() -- only once per thread, of course, not 10000 times!

Further note about the "syscall problem": at least on Unix systems, but other OSes probably work the same, each time you make a syscall, your process has to enter kernel mode the time that the syscall is executed; this is not free. Even if on modern systems the cost is not that significant, it is nevertheless significantly higher than not having to do user->kernel->user.

Well, OK, I lied a little; the JIT does kick in but it will only optimize the user side of things. The JIT will start to optimize after 10k executions of a method, here your run(), and optimize more as time passes.

Which Java thread is hogging the CPU?

Try looking at the Hot Thread Detector plugin for visual VM -- it uses the ThreadMXBean API to take multiple CPU consumption samples to find the most active threads. It's based on a command-line equivalent from Bruce Chapman which might also be useful.

Is thread time spent in synchronization too high?

I'm not sure what this means.

It means that the threads were on average spending 75% of their time waiting for another thread to finish some work.

Does this mean that the application is suffering from a live-lock condition?

Maybe!

To clarify for readers unfamiliar with the term: a 'deadlock' is when two threads are both waiting for each other to finish, and therefore they wait forever. A 'live lock' is a situation where two threads are trying to avoid a deadlock, but due to their poor choices, spend most of their time waiting anyway. Imagine for example a table with two people, a fork and a knife. Both people wish to pick up both utensils, use them, and then put them down. Suppose I pick up the knife and you pick up the fork. If we both decide to wait for the other to put the utensil down, we are deadlocked. If we both realize that we're about to deadlock, and I put down the knife, and you put down the fork and then I pick up the fork and you pick up the knife, we are live-locked. We can repeat this process indefinitely; we're both working to resolve the situation, but we're not communicating effectively enough to actually resolve it quickly.

However, my guess is that you're not in a live-lock situation. My guess is rather that you simply have enormous contention on a small number of critical resources that can only be accessed by one thread at a time. Occam's Razor would indicate that you should assume the simple hypothesis -- lots of threads taking turns using a scarce resource -- rather than the complicated hypothesis -- a whole bunch of threads all trying to tell each other "no, you go first".

There are ~30+ long-running threads bound to a single AppDomain (if that matters) and some of the threads are very busy (Ex. while(true) { _waitEvent.WaitOne(0); //do stuff }).

Sounds awful.

I realize this is a fairly vague question.

Yes, it is.

How much is too much, and why?

Well, suppose you were trying to drive across town, and you and every other driver in the city spent 75% of your time stopped at traffic lights waiting for other drivers. You tell me: is that too much, and why? Spending an hour in traffic to drive for 15 minutes distance might be perfectly acceptable to some people and utterly unacceptable to other people. Every time I take SR 520 at rush hour I spend an hour in traffic to move a distance that should take 15 minutes; that wasn't acceptable to me so now I take the bus.

Whether this lousy performance is acceptable to you and your customers or not is your call. Fixing performance problems is expensive. The question you should be asking is how much profit you'll gain by taking on the expense of diagnosing and fixing the problem.

Is ~75% really bad?

Your threads are taking four times longer than they need to. Doesn't seem too good to me.

Do I have too many threads?

You almost certainly do, yes. 30 is a lot.

But that is completely the wrong technical question to ask in your situation. Asking "do I have too many threads?" is like trying to fix traffic congestion by asking "does this city have too many cars?" The right question to ask is "why are there so many traffic lights in this city where there could be freeways?" The problem isn't the threads; the problem is that they are waiting on each other instead of driving on through to their destinations without stopping.

should I just start looking in other areas?

How on earth should we know?