What Is a Busy Loop

What is a busy loop in C?

A busy loop is a loop which purposely wastes time waiting for something to happen. Normally, you would want to avoid busy loops at all costs, as they consume CPU time doing nothing and therefore are a waste of resources, but there are rare cases in which they might be needed.

One of those cases is indeed when you need to sleep for a long amount of time and you have things like signal handlers installed that could interrupt sleeping. However, a "sleep busy loop" is hardly a busy loop at all, since almost all the time is spent sleeping.

You can build a busy loop with any loop construct you prefer, after all for, while, do ... while and goto are all interchangeable constructs in C given the appropriate control code.

Here's an example using clock_nanosleep:

// I want to sleep for 10 seconds, but I cannot do it just with a single
// syscall as it might get interrupted, I need to continue requesting to
// sleep untill the entire 10 seconds have elapsed.

struct timespec requested = { .tv_sec = 10, .tv_nsec = 0 };
struct timespec remaining;
int err;
 
for (;;) {
    err = clock_nanosleep(CLOCK_MONOTONIC, 0, &requested, &remaining);
    
    if (err == 0) {
        // We're done sleeping
        break;
    }

    if (err != EINTR) {
        // Some error occurred, check the value of err
        // Handle err somehow
        break;
    }
        
    // err == EINTR, we did not finish sleeping all the requested time
    // Just keep going...
    requested = remaining;
}

An actual busy loop would look something like the following, where var is supposedly some sort of atomic variable set by somebody else (e.g. another thread):

while (var != 1);

// or equivalent

while (1) {
    if (var == 1)
        break;
}

Needless to say, this is the kind of loop that you want to avoid as it is continuously checking for a condition wasting CPU. A better implementation would be to use signals, pthread condition variables, semaphores, etc. There are usually plenty of different ways to avoid busy looping.

Finally, note that in the above case, as @einpoklum says in the comments, the compiler may "optimize" the entire loop body away by dropping the check for var, unless it has some idea that it might change. A volatile qualifier can help, but it really depends on the scenario, don't take the above code as anything other than a silly example.

Best way to implement busy loop?

To do an infinite wait, for a signal (what else) there is the pause() system call. You'll have to put it in a loop, as it returns (always with -1 and errno set to EINTR) every time a signal is delivered:

while (1)
    pause();

As a curious note, this is, AFAIK the only POSIX function documented to always fail.

UPDATE: Thanks to dmckee, in the comments below, sigsuspend() also fails always. It does just the same as pause(), but it is more signal-friendly. The difference is that it has parameters, so it can fail with EFAULT in addition to EINTR.

SpinLock doesn't really do busy-loop waiting?

Regarding your edited question, yes busy spinning will give you the lowest latency of communication between the threads but you will pay for it by burning CPU time.

On the other hand you can use SpinWait which is less aggressive or you can code something in between such as this.

It comes down to a trade-off on how much you value latency and how many cores you have to spare.

Why does a busy loop take 100% of the CPU?

A complex algorithm can certainly use 100% of the cpu. However, many loops that implement complex algorithms either explicitly yield the thread periodically and/or have some code that calls down into the OS at some point, where either the thread is yielded or something that requires a wait (such as calling on a co-processor) happens.

Busy loop and the barrier

TL:DR: The problem here is that you're only thinking about it as std::atomic_thread_fence(std::memory_order_seq_cst), but that's not the only thing GNU C volatile asm statements do.

Yes, obviously the barrier is there to make a nasty busy-wait delay loop. Remember that a volatile asm statement can't be reordered with any other C statements, not just memory operations.

Godbolt

void loop_nomemclobber(int loops) {
  do {     // loop rearranged for simpler asm
    asm volatile ("" : : : /* "memory" */ );
  } while (--loops > 0);
}

loop_nomemclobber:
.L3:
    sub     edi, 1
    test    edi, edi
    jg      .L3
    ret

We still get a loop even without forcing all reachable memory to be up-to-date and treated as clobbered. So the reason the asm volatile statement does this has nothing to do with the "memory" clobber.

int loops is a local with automatic storage. The compiler can prove that nothing (including the asm statement) has any way to determine where it might be in memory, so it doesn't have to be in memory at all.

How deeply does the CPU is able to predict that there is a chance to apply StoreLoad?

The CPU doesn't go looking for chances to reorder memory for no reason! Reordering happens naturally (unless prevented with MFENCE) because the CPU needs to buffer stores until it's certain that they're not speculative, and on cache-miss stores. So it puts stores in the Store Buffer, and they eventually commit to cache.

There isn't a little demon inside the CPU saying "aha, here's another chance to make things difficult for Gilgamesz, maybe I'll really trick him this time with this reordering!"

There is a real question here, and that's how far apart two instructions need to be (in time, or in number of insns, or number of intervening loads/stores) before a specific microarchitecture doesn't have enough out-of-order resources for that store to ever be buffered until after that load.

I don't know, but since StoreStore reordering isn't allowed, a cache-miss store to a highly-contended cache line can't sit there waiting to gain access to the cache line while millions of other instructions run. Unless none of those instructions were a store.

I don't know the answer, but I think it's plausible that a store could theoretically be delayed for millions of cycles on Intel Haswell, maybe bounded only by the fairness algorithms of the hardware arbitration mechanisms that handle the case where multiple cores contend for access to the same cache line.

I forget what I've read about whether modern Intel hardware works this way or not, but I think maybe a store can retire from the out-of-order core but still not have committed to L1 cache. Instead, it's only in the store queue as a store that will definitely happen. That would let cache-miss stores avoid blocking new instructions from entering the ROB. (Loads need to probe the store buffer to preserve correct execution within the single core, but doing that doesn't require the stores to also be tracked by the ROB).

Busy waiting and context switch

why busy waiting avoids the context switch

Busy waiting doesn't avoid context switch! context switch is the process in which the scheduler decides to give the CPU to other process to run. Busy waiting is a technique used to keep a process looping and waiting for something. i.e. to prevent two processes from modifying some shared data at the same time. One process will loop until the other process "tell" him it has finished. This is a bad synchronization method as CPU utilization drops as process spends its time burst doing nothing.

And why context switch is less expensive than wasting CPU time

DEPENDS! if the code in critical section is short, busy waiting might be faster than a context switch. A process can ignore signals and by this way run on CPU until it gives it away voluntarily. NOTE not all signals can be ignored.

What changes if there are threads instead of processes ?

I am afraid this is too board to answer as this might depend on OS and if OS is aware of threads.

What Is a Busy Loop