Why Does This Delay-Loop Start to Run Faster After Several Iterations with No Sleep

Why does this delay-loop start to run faster after several iterations with no sleep?

After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.

If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).

If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.

If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency. (See Why can't my CPU maintain peak performance in HPC for more about Turbo letting the CPU run faster than it can sustain for high-power workloads.)

Introducing a usleep prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)

comments on other theories:

re: David's theory that a potential context switch from usleep could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.

Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int of stack memory. Any potential cache pollution during usleep is a tiny fraction of the time for this code (real code will be different)!

In more detail for x86:

The call to clock() itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock() will almost never be delayed, because it should still be hot in cache.

The run function may be in a different cache line from main (since gcc marks main as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main will have triggered the potential TLB miss before entering the timed region of the program.

gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.

The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).

Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).

Even if the loop-counter was kept in a register, 100k cycles is a lot.

Why does sleeping between iterations causes operations in a loop to take longer than the case where it does not sleep

When you sleep for even a short period of time (you may find that 10 ms is long enough) you give up the CPU and the data, instruction and branch prediction caches are disturbed or even cleared. Even making a system call like System.currentTimeMillis() or the much more accurate System.nanoTime() can do this to a small degree.

AFAIK, The only way to avoid giving up the core is to busy wait and using thread affinity to lock your thread to a core. This prevent minimises such a disturbance and means your program can runs 2-5x faster in low latency situations i.e. when sub-millisecond tasks matter.

For your interest

http://vanillajava.blogspot.co.uk/2012/01/java-thread-affinity-support-for-hyper.html

http://vanillajava.blogspot.co.uk/2012/02/how-much-difference-can-thread-affinity.html

How can I delay a loop iteration accurately to hit frequencies like 1M per second, with System.nanoTime()?

@TobiasGeiselmann makes a good point: your delay calculation doesn't take into account the time spent between calls to busySleep

You should be calculating a deadline relative to the last deadline, not the current time after logging. Don't use the result from the previous System.nanoTime() either; that will be some time >= the actual deadline (because nanoTime itself takes time, at least a few nanoseconds, so it unavoidably over-sleeps). You'd accumulate error that way.

Before the first iteration, find the current time and set long deadline = System.nanoTime();. At the end of every iteration, do deadline += 1000; and use your busy-wait loop to spin until now >= deadline.

If deadline - now is large enough, use something that yields the CPU to other threads until close to the wakeup deadline. According to comments, LockSupport.parkNanos(…) is a good choice for modern Java, and may actually busy-wait for short enough sleeps(?). I don't really know Java. If so, you should just check the current time, calculate time till deadline, and call it once.

(For future CPUs like Intel Tremont (next-gen Goldmont), LockSupport.parkNanos could portably expose functionality like tpause to idle the CPU core until a given TSC deadline. Not via the OS, just a hyperthreading-friendly deadline pause, good for short sleeps on SMT CPUs.)

Busy-waiting is generally bad but is appropriate for high-precision very short delays. 1 microsecond is not long enough to usefully let the OS context switch to something else and back, on current hardware with current OSes. But longer sleep intervals (when you've chosen a lower frequency) should sleep to let the OS do something useful on this core, instead of just busy waiting for so long.

Ideally when you are spinning on a time-check, you'd be executing an instruction like x86's pause in the delay loop, to be more friendly to other logical core sharing the same physical core (hyperthreading / SMT). Java 9 Thread.onSpinWait(); should be called in spin-wait loops (especially when waiting on memory), which lets the JVM expose this concept in a portable way. (I assume that's what it's for.)

This will work if your system is fast enough to keep up while running that time-getting function once per iteration. If not, then you could maybe check a deadline every 4 iterations (loop unrolling), to amortize the cost of nanoTime() so you log in bursts of 4 or something.

Of course if your system isn't fast enough even with no delay call at all, you'll need to optimize something to fix that. You can't delay for a negative amount of time, and checking the clock itself takes time.

How could one delay a function without the use of sleep / suspending the code?

If you want to execute some specific code at a certain time interval and don't want to use threads (to be able to suspend), then you have to keep track of time and execute the specific code when the delay time was exceeded.

Example (pseudo):

timestamp = getTime();

while (true) {

    if (getTime() - timestamp > delay) {

        //main functionality

        //reset timer
        timestamp = getTime();

    }

    //the other functionality you mentioned

}

With this approach, you invoke a specific fuction every time interval specified by delay. The other functions will be invoked at each iteration of the loop.

In other words, it makes no difference if you delay a function or execute it at specific time intervals.

Time delay loop in python

case of 1x repetition: doStuff
case of 2x repetition: doStuff, sleep, doStuff
case of 3x repetition: doStuff, sleep, doStuff, sleep, doStuff
...

What you seem to want is to do a bunch of things, then sleep between each of them.

Furthermore, the delay is independent of the nature of the tasks.

Solution 1 -- Sleep if it isn't the last

The easiest way is to special-case the for loop:

for i in range(numTimesToRepeat):
    if i>0:
        time.sleep(lengthToSleep)
    doStuff()

You could also do the following, but it's much less elegant because you repeat doStuff():

doStuff()
for i in range(numTimesToRepeat-1):
    time.sleep(lengthToSleep)
    doStuff()

Solution 2 -- While loop with break

One of the most straightforward ways to solve these issues is to break out of a while loop. It's dirty and somewhat non-functional-programming, but it's very much how a human might naturally think of the problem:

numTimesToRepeat = ...
while True:
    doStuff()                     # we did stuff
    numTimesToRepeat -= 1         # repeat one less time because we just did it
    if numTimesToRepeat==0:       # do we repeat again?
        break                     #   no -> done!
    else:
        time.sleep(lengthToSleep) #   yes -> sleep before repeating

While loops are annoying because if you make a programming error, your program will infinite-loop.

Solution 3 -- Detect if last

The above methods do something if it's not the first iteration of the loop. However, "detecting the last iteration of the loop" is how you tried to approach the problem. It is a bit more annoying, but you can do it like this: https://stackoverflow.com/a/6090673/711085 and use it like this:

for i,task,isLast in enumerateLast(tasks):
    task()
    if not isLast:
        time.sleep(lengthToSleep)

Less elegant is to reverse the list and do enumerate():

for untilLast,task in reversed(enumerate(reversed(tasks))):
    task()
    if untilLast>0:
        time.sleep(lengthToSleep)

(very small irrelevant minutiae: do note that in the reversed(enumerate(reversed(iterable))) case, if tasks were a generator, it would generate every single task to do, enumerate them in reverse order, then do them: this can cause lag, and also prevents you from creating an infinite task generator)

Task abstraction

No matter which way you do it, you may wish to abstract the concept of having tasks by creating callbacks, into which you insert a delay between each, sort of like a ''.join.

def callAndWaitBetween(tasks):
    for ..task.. in enumerate?(tasks):
        #... some solution to this problem...
        task()

For reference, if the delay depended on the tasks, then your tasks should return how long to wait.

PHP sleep delay

Use PHP sleep() function. http://php.net/manual/en/function.sleep.php
This stops execution of next loop for the given number of seconds. So something like this

for ($i=0; $i <= 10; $i++) {
    $file_exists=file_exists($location.$filename);
    if($file_exists) {
        break;
    }
    sleep(3); // this should halt for 3 seconds for every loop
}

Kotlin Coroutines - How does a delay within a loop can be completed so fast

how can I debug or print the execution process of each suspended block that runs delaying execution, just like we do for threads viz. Thread.currentThread().name

All you have to do is enable the coroutine debug mode. As per the documentation, the easiest way to do it is with the -ea JVM switch.

If you run the following with -ea:

fun main() {
    measureTimeMillis {
        runBlocking {
            (1..4).forEach {
                launch {
                    println("$it, delaying in ${Thread.currentThread().name}")
                    delay(1000)
                    println("$it, done in ${Thread.currentThread().name}")
                }
            }
        }
    }.also {
        println("Took $it ms")
    }
}

you should see output like this:

1, delaying in main @coroutine#2
2, delaying in main @coroutine#3
3, delaying in main @coroutine#4
4, delaying in main @coroutine#5
1, done in main @coroutine#2
2, done in main @coroutine#3
3, done in main @coroutine#4
4, done in main @coroutine#5
Took 1099 ms

what has happened to multiple delay blocks

You started 100,000 concurrent coroutines, all of them concurrently spent their 1 second of time sleeping, and then all of them woke up after 1 second, completing your test. This is exactly the same behavior you should expect from starting 100,000 threads and making them sleep.

Just like a thread doesn't occupy a CPU core while sleeping, so a coroutine doesn't occupy a thread while sleeping.

Why Does This Delay-Loop Start to Run Faster After Several Iterations with No Sleep