How to Calculate Time For an Asm Delay Loop on X86 Linux

How we can calculate the delay given by the following for loop in embedded C?

With your solution the controller spends most of the time in an unproductive loop just wasting CPU cycles and energy.
A better solution would be to drive the LCD by a timer-interrupt with frequency t/2 (for example 5ms), put the data to be written in a ring-buffer or similar an send them in every cycle.
Just to be sure, if the circuit does not signal ready, leave it allone and write in the next cycle.
With this approach the cpu can be used for calculations, and if nothing is to be done it can simply idle.
Btw: often this kind of loop gets optimized away.

@Yunnosch: Thank you for your suggestion. I hope my point is more objective and clear now.

How to set 1 second time delay at assembly language 8086

What i finally ended up using was the nop loop

; start delay

mov bp, 43690
mov si, 43690
dec bp
jnz delay2
dec si
cmp si,0
jnz delay2
; end delay

I used two registers which I set them both to any high value
and its gonna keep on looping until both values go to zero

What I used here was AAAA for both SI and BP, i ended up with roughly 1 second for each delay loop.

Thanks for the help guys, and yes, we still use MS DOS for this assembly language course :(

Delay Loop in 68k Assembly

There is no 68k instruction that would execute in exactly one cycle. Even a simple NOP already takes four cycles - so you will need to adjust your expectations a bit.

The most simple delay loop one can imagine is

       move.w #delay-1,d0
loop: dbf d0,loop ; 10 cycles per loop + 14 cycles for last
; (branch not taken)

This will delay delay * 10 number of cycles. Note that delay is word-sized, so the construct is limited to delays between 14 and 655354 cycles. If you want a wider range, you need to use a different construct that uses long word counters:

       move.l  #delay,d0
moveq.l #1,d1
loop: sub.l d1,d0 ; 6 cycles for Dn.l->Dn.l
bne.s loop ; 10 cycles for branch

This eats 16 cycles per iteration. It does, however, accept a long word loop counter.

If you want to increase the achievable delay, you may think about nested delay lops or more complex instructions and addressing mode inside the loop. These two are, however, the shortest possibe delay loops.

delays and measurement of specific instructions

You have 4 main options:

  • delay the 2nd operation by giving it a data dependency on (the result of) the first.
  • lfence, fixed delay sequence, lfence. Both of these can only give a minimum delay; could be much longer depending on CPU frequency scaling and/or interrupts.
  • spin on rdtsc until a deadline (which you calculate somehow, e.g. based on an earlier rdtsc), or do a longer sleep based on a TSC deadline e.g. using the local APIC.
  • Give up and use a different design, or use an in-order microcontroller where you can get reliable cycle-accurate timing at a fixed clock frequency.

This may be an X-Y problem, or at least isn't solvable without getting into the specific details of the two things you want to separate with a delay. (e.g. create a data dependency between a load and a store-address, and lengthen that dep chain with some instructions). There is no general-case answer that works between arbitrary code for very short delays.

If you need accurate delays of only a few clock cycles, you're mostly screwed; superscalar out-of-order execution, interrupts, and variable clock frequency makes that essentially impossible in the general case. As @Brendan explained:

For "extremely small and accurate" delays the only option is to give up then reassess the reason why you made the mistake of thinking you wanted it.

For kernel code; for longer delays with slightly less accuracy you could look into using local APIC timer in "TSC deadline mode" (possibly with some adjustment for IRQ exit timing) and/or similar with performance monitoring counters.

For delays of several dozen clock cycles, spin-wait for RDTSC to have a value you're looking for. How to calculate time for an asm delay loop on x86 linux? But that has some minimum overhead to execute RDTSC twice, or RDTSC plus TPAUSE if you have the "waitpkg" ISA extension. (You don't on i9-9900k). You also need lfence if you want to stop out-of-order exec across the whole thing.

If you need to do something "every 20 ns" or something, then increment a deadline instead of trying to do a fixed delay between other work. So variation in the other work won't accumulate error. But one interrupt will put you far behind and lead to running your other work back-to-back until you catch up. So as well as checking for the deadline, you'd also want to check for being far behind the deadline and take a new TSC sample.

(The TSC ticks at constant frequency on modern x86, but the core clock doesn't: see How to get the CPU cycle count in x86_64 from C++? for more details)

Maybe you can use a data dependency between your real work?

Small delays of a few clock cycles, smaller than the out-of-order scheduler size1, are not really possible without taking the surrounding code into consideration and knowing the exact microarchitecture you're executing on.

footnote 1: 97 entry RS on Skylake-derived uarches, although there's some evidence that it's not truly a unified scheduler: some entries can only hold some kinds of uops.

If you can create a data dependency between the two things you're trying to separate, you might be able to create a minimum delay between their execution that way. There are ways to couple a dependency chain into another register without affecting its value, e.g. and eax, 0 / or ecx, eax makes ECX depend on the instruction that wrote EAX without affecting the value of ECX. (Make a register depend on another one without changing its value).

e.g. between two loads, you could create a data dependency from the load result of one into the load address of the later load, or into a store address. Coupling two store addresses together with a dependency chain is less good; the first store could take a bunch of extra time (e.g. for a dTLB miss) after the address is known, so two stores end up committing back-to-back after all. You might need mfence then lfence between two stores if you want to put a delay before the 2nd store. See also Are loads and stores the only instructions that gets reordered? for more about OoO exec across lfence (and mfence on Skylake).

This may require writing your "real work" in asm, too, unless you can come up with a way to "launder" the data dependency from the compiler with a small inline asm statement.

CMC is one of the few single-byte instructions available in 64-bit mode that you can just repeat to create a latency bottleneck (1 cycle per instruction on most CPUs) without also accessing memory (like lodsb which bottlenecks on merging into the low byte of RAX). xchg eax, reg would also work, but that's 3 uops on Intel.

Instead of lfence, you could couple that dep chain into a specific instruction using adc reg, 0, if you start with a known CF state and use an odd or even number of CMC instructions such that CF=0 at that point. Or cmovc same,same would make a register value depend on CF without modifying it, regardless of whether CF was set or cleared.

However, single-byte instructions can create weird front-end effects when you have too many in a row for the uop cache to handle. That's what slows down CDQ if you repeat it indefinitely; apparently Skylake can only decode it at 1/clock in the legacy decoders. Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?. That may be ok and/or what you want. 3 cycles per 3-byte instruction would let this code be cached by the uop cache, e.g imul eax, eax or imul eax, 0. But maybe it's better to avoid polluting the uop cache with code that's supposed to run slowly.

Between LFENCE instructions, cld is 3 uops and has a 4c throughput on Skylake, so if you're using lfence at the start/end of your delay that could be usable.

Also of course, any dead-reckoning delay in terms of a certain number of some instructions (not rdtsc) will depend on the core clock frequency, not the reference frequency. And at best it's a minimum delay; if an interrupt comes in during your delay loop, the total delay will be close to the total of interrupt handling time plus whatever your delay-loop took.

Or if the CPU happens to be running at idle speed (often 800MHz), the delay in nanoseconds will be much longer than if the CPU is at max turbo.

Re: your 2nd experiment with CMC between lfence OoO exec barriers

Yes, you can pretty accurately control the core clock cycles between two lfence instructions, or between lfence and rdtscp, with a simple dependency chain, pause instruction, or a throughput bottleneck on some execution unit(s), possibly the integer or FP divider. But I assume your real use case cares about the total delay between stuff before the first lfence and stuff after the 2nd lfence.

The first lfence has to wait for whatever instructions were previously in flight to retire from the out-of-order back-end (ROB = reorder buffer, 224 fused-domain uops on Skylake-family). If those included any loads that might miss in cache, your wait time can vary tremendously, and be much longer than you probably want.

Is it because CMC instructions back to back have no dependency on each other but CDQ instructions do have a dependency in between them?

You have that backwards: CMC has a true dependency on the previous CMC because it reads and writes the carry flag. Just like not eax has a true dependency on the previous EAX value.

CDQ does not: it reads EAX and writes EDX. Register renaming makes it possible for RDX to be written more than once in the same clock cycle. e.g. Zen can run 4 cdq instructions per clock. Your Coffee Lake can run 2 CDQ per clock (0.5c throughput), bottlenecked on the back-end ports it can run on (p0 and p6).

Agner Fog's numbers were based on testing a huge block of repeated instruction, apparently bottlenecking on legacy-decode throughput of 1/clock. (Again, see Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions? ). numbers are closer to accurate for small repeat counts for Coffee Lake, showing it as 0.6 c throughput. (But if you look at the detailed breakdown, with an unroll count of 500 confirms that Coffee Lake still has that front-end bottleneck).

But increasing the repeat count up past about 20 (if aligned) will lead to the same legacy-decode bottleneck that Agner saw. However, if you don't use lfence, decode could be far ahead of execution so this is not good.

CDQ is a poor choice because of the weird front-end effects, and/or being a back-end throughput bottleneck instead of latency. But OoO exec can still see around it once the front-end gets past the repeated CDQs. 1-byte NOP could create a front-end bottleneck which might be more usable depending on what two things you were trying to separate.

BTW, if you don't fully understand dependency chains and their implications for out-of-order execution, and probably a bunch of other cpu-architecture details about the exact CPU you're using (e.g. store buffers if you want to separate any stores), you're going to have a bad time trying to do anything meaningful.

If you can do what you need with just a data dependency between two things, that might reduce the amount of stuff you need to understand to make anything like what you described as your goal.

Otherwise you probably need to understand basically all of this answer (and Agner Fog's microarchitecture guide) to figure out how your real problem translates into something you can actually make a CPU do. Or realize that it can't, and you'll need something else. (Like maybe a very fast in-order CPU, perhaps ARM, where you can somewhat control timing between independent instructions with delay sequences / loops.)

Related Topics

Leave a reply