What Is Iaca and How to Use It

What is IACA and how do I use it?

2019-04: Reached EOL. Suggested alternative: LLVM-MCA
2017-11: Version 3.0 released (latest as of 2019-05-18)
2017-03: Version 2.3 released

What it is:

IACA (the Intel Architecture Code Analyzer) is a (2019: end-of-life) freeware, closed-source static analysis tool made by Intel to statically analyze the scheduling of instructions when executed by modern Intel processors. This allows it to compute, for a given snippet,

In Throughput mode, the maximum throughput (the snippet is assumed to be the body of an innermost loop)
In Latency mode, the minimum latency from the first instruction to the last.
In Trace mode, prints the progress of instructions through their pipeline stages.

when assuming optimal execution conditions (All memory accesses hit L1 cache and there are no page faults).

IACA supports computing schedulings for Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell and Skylake processors as of version 2.3 and Haswell, Broadwell and Skylake as of version 3.0.

IACA is a command-line tool that produces ASCII text reports and Graphviz diagrams. Versions 2.1 and below supported 32- and 64-bit Linux, Mac OS X and Windows and analysis of 32-bit and 64-bit code; Version 2.2 and up only support 64-bit OSes and analysis of 64-bit code.

How to use it:

IACA's input is a compiled binary of your code, into which have been injected two markers: a start marker and an end marker. The markers make the code unrunnable, but allow the tool to find quickly the relevant pieces of code and analyze them.

You do not need the ability to run the binary on your system; In fact, the binary supplied to IACA can't run anyways because of the presence of the injected markers in the code. IACA only requires the ability to read the binary to be analyzed. Thus it is possible, using IACA, to analyze a Haswell binary employing FMA instructions on a Pentium III machine.

C/C++

In C and C++, one gains access to marker-injecting macros with #include "iacaMarks.h", where iacaMarks.h is a header that ships with the tool in the include/ subdirectory.

One then inserts the markers around the innermost loop of interest, or the straight-line chunk of interest, as follows:

/* C or C++ usage of IACA */

while(cond){
    IACA_START
    /* Loop body */
    /* ... */
}
IACA_END

The application is then rebuilt as it otherwise would with optimizations enabled (In Release mode for users of IDEs such as Visual Studio). The output is a binary that is identical in all respects to the Release build except with the presence of the marks, which make the application non-runnable.

IACA relies on the compiler not reordering the marks excessively; As such, for such analysis builds certain powerful optimizations may need to be disabled if they reorder the marks to include extraneous code not within the innermost loop, or exclude code within it.

Assembly (x86)

IACA's markers are magic byte patterns injected at the correct location within the code. When using iacaMarks.h in C or C++, the compiler handles inserting the magic bytes specified by the header at the correct location. In assembly, however, you must manually insert these marks. Thus, one must do the following:

    ; NASM usage of IACA
    
    mov ebx, 111          ; Start marker bytes
    db 0x64, 0x67, 0x90   ; Start marker bytes
    
.innermostlooplabel:
    ; Loop body
    ; ...
    jne .innermostlooplabel ; Conditional branch backwards to top of loop

    mov ebx, 222          ; End marker bytes
    db 0x64, 0x67, 0x90   ; End marker bytes

It is critical for C/C++ programmers that the compiler achieve this same pattern.

What it outputs:

As an example, let us analyze the following assembler example on the Haswell architecture:

.L2:
    vmovaps         ymm1, [rdi+rax] ;L2
    vfmadd231ps     ymm1, ymm2, [rsi+rax] ;L2
    vmovaps         [rdx+rax], ymm1 ; S1
    add             rax, 32         ; ADD
    jne             .L2             ; JMP

We add immediately before the .L2 label the start marker and immediately after jne the end marker. We then rebuild the software, and invoke IACA thus (On Linux, assumes the bin/ directory to be in the path, and foo to be an ELF64 object containing the IACA marks):

iaca.sh -64 -arch HSW -graph insndeps.dot foo

, thus producing an analysis report of the 64-bit binary foo when run on a Haswell processor, and a graph of the instruction dependencies viewable with Graphviz.

The report is printed to standard output (though it may be directed to a file with a -o switch). The report given for the above snippet is:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 1.5    1.0  | 1.5    1.0  | 1.0  | 0.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
|   2    | 0.5       | 0.5 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | add rax, 0x20
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffec
Total Num Of Uops: 6

The tool helpfully points out that currently, the bottleneck is the Haswell frontend and Port 2 and 3's AGU. This example allows us to diagnose the problem as the store not being processed by Port 7, and take remedial action.

Limitations:

IACA does not support a certain few instructions, which are ignored in the analysis. It does not support processors older than Nehalem and does not support non-innermost loops in throughput mode (having no ability to guess which branch is taken how often and in what pattern).

Using IACA with non-assembly routine

I don't use IACA so I can't test this idea, and I will delete the answer if it does not work, but can you not just do something like this:

procedure TForm10.Button1Click(Sender: TObject);
begin
  asm
    //RCX = self
    //edx = a
    //r8d = b

    mov ebx, 111      // Start IACA marker bytes
    db $64, $67, $90  // Start IACA marker bytes
  end;

  fRotate( fLine - Point(0,1), 23 );

  asm
    mov ebx, 222      // End IACA marker bytes
    db $64, $67, $90  // End IACA marker bytes

  end;
end;

This was just a sample routine from something else to check that it compiles, which it does.

Sadly this only works for 32 bit - as Johan points out it is not allowed for 64 bit.

For 64 bit the following may work, but again I cannot test it.

procedure TForm10.Button1Click(Sender: TObject);
  procedure Test1;
  asm
    //RCX = self
    //edx = a
    //r8d = b

    mov ebx, 111      // Start IACA marker bytes
    db $64, $67, $90  // Start IACA marker bytes
  end;
  procedure Test2;
  begin
    fRotate( fLine - Point(0,1), 23 );
  end;
  procedure Test3;
  asm
    mov ebx, 222      // End IACA marker bytes
    db $64, $67, $90  // End IACA marker bytes

  end;
begin
  Test1;
  Test2;
  Test3;
end;

Intel IACA analyzer alters assembly?

IACA mark macros are just inline asm (or for 64-bit MSVC: start = __writegsbyte(111, 111); and stop = 222). They can potentially disturb the optimizer, or end up in the wrong place (e.g. not the last instruction before falling into a loop, so the block includes some loop setup).

If that happens, like in your case, your best bet is to ask the compiler to produce asm (not machine code) output, and manually insert the markers into the asm you want to analyze.

In NASM syntax, I use this %if / %else block so I can build with nasm -DIACA_MARKS or not. I know this isn't the right syntax for MASM, but the IACA start/end markers are pretty simple: mov to EBX and fs addr32 nop.

%ifdef IACA_MARKS

%macro  IACA_start 0             ; NASM macro with 0 args, defines IACA_start
     mov ebx, 111
     db 0x64, 0x67, 0x90
%endmacro
%macro  IACA_end 0
     mov ebx, 222
     db 0x64, 0x67, 0x90
%endmacro

%else
%define IACA_start
%define IACA_end
%endif

Micro fusion and addressing modes

In the decoders and uop-cache, addressing mode doesn't affect micro-fusion (except that an instruction with an immediate operand can't micro-fuse a RIP-relative addressing mode).

But some combinations of uop and addressing mode can't stay micro-fused in the ROB (in the out-of-order core), so Intel SnB-family CPUs "un-laminate" when necessary, at some point before the issue/rename stage. For issue-throughput, and out-of-order window size (ROB-size), fused-domain uop count after un-lamination is what matters.

Intel's optimization manual describes un-lamination for Sandybridge in Section E.2.2.4: Micro-op Queue and the Loop Stream Detector (LSD), but doesn't describe the changes for any later microarchitectures.

UPDATE: Now Intel manual has a detailed section to describe un-lamination for Haswell. See section E.1.5 Unlamination. And a brief description for SandyBridge is in section E.2.2.4.

The rules, as best I can tell from experiments on SnB, HSW, and SKL:

SnB (and I assume also IvB): indexed addressing modes are always un-laminated, others stay micro-fused. IACA is (mostly?) correct.
HSW, SKL: These only keep an indexed ALU instruction micro-fused if it has 2 operands and treats the dst register as read-modify-write. Here "operands" includes flags, meaning that adc and cmov don't micro-fuse. Most VEX-encoded instructions also don't fuse since they generally have three operands (so paddb xmm0, [rdi+rbx] fuses but vpaddb xmm0, xmm0, [rdi+rbx] doesn't). Finally, the occasional 2-operand instruction where the first operand is write only, such as pabsb xmm0, [rax + rbx] also do not fuse. IACA is wrong, applying the SnB rules.

Related: simple (non-indexed) addressing modes are the only ones that the dedicated store-address unit on port7 (Haswell and later) can handle, so it's still potentially useful to avoid indexed addressing modes for stores. (A good trick for this is to address your dst with a single register, but src with dst+(initial_src-initial_dst). Then you only have to increment the dst register inside a loop.)

Note that some instructions never micro-fuse at all (even in the decoders/uop-cache). e.g. shufps xmm, [mem], imm8, or vinsertf128 ymm, ymm, [mem], imm8, are always 2 uops on SnB through Skylake, even though their register-source versions are only 1 uop. This is typical for instructions with an imm8 control operand plus the usual dest/src1, src2 register/memory operands, but there are a few other cases. e.g. PSRLW/D/Q xmm,[mem] (vector shift count from a memory operand) doesn't micro-fuse, and neither does PMULLD.

See also this post on Agner Fog's blog for discussion about issue throughput limits on HSW/SKL when you read lots of registers: Lots of micro-fusion with indexed addressing modes can lead to slowdowns vs. the same instructions with fewer register operands: one-register addressing modes and immediates. We don't know the cause yet, but I suspect some kind of register-read limit, maybe related to reading lots of cold registers from the PRF.

Test cases, numbers from real measurements: These all micro-fuse in the decoders, AFAIK, even if they're later un-laminated.

# store
mov        [rax], edi  SnB/HSW/SKL: 1 fused-domain, 2 unfused.  The store-address uop can run on port7.
mov    [rax+rsi], edi  SnB: unlaminated.  HSW/SKL: stays micro-fused.  (The store-address can't use port7, though).
mov [buf +rax*4], edi  SnB: unlaminated.  HSW/SKL: stays micro-fused.

# normal ALU stuff
add    edx, [rsp+rsi]  SnB: unlaminated.  HSW/SKL: stays micro-fused.  
# I assume the majority of traditional/normal ALU insns are like add

Three-input instructions that HSW/SKL may have to un-laminate

vfmadd213ps xmm0,xmm0,[rel buf] HSW/SKL: stays micro-fused: 1 fused, 2 unfused.
vfmadd213ps xmm0,xmm0,[rdi]     HSW/SKL: stays micro-fused
vfmadd213ps xmm0,xmm0,[0+rdi*4] HSW/SKL: un-laminated: 2 uops in fused & unfused-domains.
     (So indexed addressing mode is still the condition for HSW/SKL, same as documented by Intel for SnB)

# no idea why this one-source BMI2 instruction is unlaminated
# It's different from ADD in that its destination is write-only (and it uses a VEX encoding)
blsi   edi, [rdi]       HSW/SKL: 1 fused-domain, 2 unfused.
blsi   edi, [rdi+rsi]   HSW/SKL: 2 fused & unfused-domain.


adc         eax, [rdi] same as cmov r, [rdi]
cmove       ebx, [rdi]   Stays micro-fused.  (SnB?)/HSW: 2 fused-domain, 3 unfused domain.  
                         SKL: 1 fused-domain, 2 unfused.

# I haven't confirmed that this micro-fuses in the decoders, but I'm assuming it does since a one-register addressing mode does.

adc   eax, [rdi+rsi] same as cmov r, [rdi+rsi]
cmove ebx, [rdi+rax]  SnB: untested, probably 3 fused&unfused-domain.
                      HSW: un-laminated to 3 fused&unfused-domain.  
                      SKL: un-laminated to 2 fused&unfused-domain.

I assume that Broadwell behaves like Skylake for adc/cmov.

It's strange that HSW un-laminates memory-source ADC and CMOV. Maybe Intel didn't get around to changing that from SnB before they hit the deadline for shipping Haswell.

Agner's insn table says cmovcc r,m and adc r,m don't micro-fuse at all on HSW/SKL, but that doesn't match my experiments. The cycle counts I'm measuring match up with the the fused-domain uop issue count, for a 4 uops / clock issue bottleneck. Hopefully he'll double-check that and correct the tables.

Memory-dest integer ALU:

add        [rdi], eax  SnB: untested (Agner says 2 fused-domain, 4 unfused-domain (load + ALU  + store-address + store-data)
                       HSW/SKL: 2 fused-domain, 4 unfused.
add    [rdi+rsi], eax  SnB: untested, probably 4 fused & unfused-domain
                       HSW/SKL: 3 fused-domain, 4 unfused.  (I don't know which uop stays fused).
                  HSW: About 0.95 cycles extra store-forwarding latency vs. [rdi] for the same address used repeatedly.  (6.98c per iter, up from 6.04c for [rdi])
                  SKL: 0.02c extra latency (5.45c per iter, up from 5.43c for [rdi]), again in a tiny loop with dec ecx/jnz


adc     [rdi], eax      SnB: untested
                        HSW: 4 fused-domain, 6 unfused-domain.  (same-address throughput 7.23c with dec, 7.19c with sub ecx,1)
                        SKL: 4 fused-domain, 6 unfused-domain.  (same-address throughput ~5.25c with dec, 5.28c with sub)
adc     [rdi+rsi], eax  SnB: untested
                        HSW: 5 fused-domain, 6 unfused-domain.  (same-address throughput = 7.03c)
                        SKL: 5 fused-domain, 6 unfused-domain.  (same-address throughput = ~5.4c with sub ecx,1 for the loop branch, or 5.23c with dec ecx for the loop branch.)

Yes, that's right, adc [rdi],eax / dec ecx / jnz runs faster than the same loop with add instead of adc on SKL. I didn't try using different addresses, since clearly SKL doesn't like repeated rewrites of the same address (store-forwarding latency higher than expected. See also this post about repeated store/reload to the same address being slower than expected on SKL.

Memory-destination adc is so many uops because Intel P6-family (and apparently SnB-family) can't keep the same TLB entries for all the uops of a multi-uop instruction, so it needs an extra uop to work around the problem-case where the load and add complete, and then the store faults, but the insn can't just be restarted because CF has already been updated. Interesting series of comments from Andy Glew (@krazyglew).

Presumably fusion in the decoders and un-lamination later saves us from needing microcode ROM to produce more than 4 fused-domain uops from a single instruction for adc [base+idx], reg.

Why SnB-family un-laminates:

Sandybridge simplified the internal uop format to save power and transistors (along with making the major change to using a physical register file, instead of keeping input / output data in the ROB). SnB-family CPUs only allow a limited number of input registers for a fused-domain uop in the out-of-order core. For SnB/IvB, that limit is 2 inputs (including flags). For HSW and later, the limit is 3 inputs for a uop. I'm not sure if memory-destination add and adc are taking full advantage of that, or if Intel had to get Haswell out the door with some instructions

Nehalem and earlier have a limit of 2 inputs for an unfused-domain uop, but the ROB can apparently track micro-fused uops with 3 input registers (the non-memory register operand, base, and index).

So indexed stores and ALU+load instructions can still decode efficiently (not having to be the first uop in a group), and don't take extra space in the uop cache, but otherwise the advantages of micro-fusion are essentially gone for tuning tight loops. "un-lamination" happens before the 4-fused-domain-uops-per-cycle issue/retire width out-of-order core. The fused-domain performance counters (uops_issued / uops_retired.retire_slots) count fused-domain uops after un-lamination.

Intel's description of the renamer (Section 2.3.3.1: Renamer) implies that it's the issue/rename stage which actually does the un-lamination, so uops destined for un-lamination may still be micro-fused in the 28/56/64 fused-domain uop issue queue / loop-buffer (aka the IDQ).

TODO: test this. Make a loop that should just barely fit in the loop buffer. Change something so one of the uops will be un-laminated before issuing, and see if it still runs from the loop buffer (LSD), or if all the uops are now re-fetched from the uop cache (DSB). There are perf counters to track where uops come from, so this should be easy.

Harder TODO: if un-lamination happens between reading from the uop cache and adding to the IDQ, test whether it can ever reduce uop-cache bandwidth. Or if un-lamination happens right at the issue stage, can it hurt issue throughput? (i.e. how does it handle the leftover uops after issuing the first 4.)

(See the a previous version of this answer for some guesses based on tuning some LUT code, with some notes on vpgatherdd being about 1.7x more cycles than a pinsrw loop.)

Experimental testing on SnB

The HSW/SKL numbers were measured on an i5-4210U and an i7-6700k. Both had HT enabled (but the system idle so the thread had the whole core to itself). I ran the same static binaries on both systems, Linux 4.10 on SKL and Linux 4.8 on HSW, using ocperf.py. (The HSW laptop NFS-mounted my SKL desktop's /home.)

The SnB numbers were measured as described below, on an i5-2500k which is no longer working.

Confirmed by testing with performance counters for uops and cycles.

I found a table of PMU events for Intel Sandybridge, for use with Linux's perf command. (Standard perf unfortunately doesn't have symbolic names for most hardware-specific PMU events, like uops.) I made use of it for a recent answer.

ocperf.py provides symbolic names for these uarch-specific PMU events, so you don't have to look up tables. Also, the same symbolic name works across multiple uarches. I wasn't aware of it when I first wrote this answer.

To test for uop micro-fusion, I constructed a test program that is bottlenecked on the 4-uops-per-cycle fused-domain limit of Intel CPUs. To avoid any execution-port contention, many of these uops are nops, which still sit in the uop cache and go through the pipeline the same as any other uop, except they don't get dispatched to an execution port. (An xor x, same, or an eliminated move, would be the same.)

Test program: yasm -f elf64 uop-test.s && ld uop-test.o -o uop-test

GLOBAL _start
_start:
    xor eax, eax
    xor ebx, ebx
    xor edx, edx
    xor edi, edi
    lea rsi, [rel mydata]   ; load pointer
    mov ecx, 10000000
    cmp dword [rsp], 2      ; argc >= 2
    jge .loop_2reg

ALIGN 32
.loop_1reg:
    or eax, [rsi + 0]
    or ebx, [rsi + 4]
    dec ecx
    nop
    nop
    nop
    nop
    jg .loop_1reg
;   xchg r8, r9     ; no effect on flags; decided to use NOPs instead

    jmp .out

ALIGN 32
.loop_2reg:
    or eax, [rsi + 0 + rdi]
    or ebx, [rsi + 4 + rdi]
    dec ecx
    nop
    nop
    nop
    nop
    jg .loop_2reg

.out:
    xor edi, edi
    mov eax, 231    ;  exit(0)
    syscall

SECTION .rodata
mydata:
db 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff

I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e. it's abc, abc, ...; not abca, bcab, ...). Agner Fog's microarch doc unfortunately wasn't clear on this limitation of the loop buffer. See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for more investigation on HSW/SKL. SnB may be worse than HSW in this case, but I'm not sure and don't still have working SnB hardware.

I wanted to keep macro-fusion (compare-and-branch) out of the picture, so I used nops between the dec and the branch. I used 4 nops, so with micro-fusion, the loop would be 8 uops, and fill the pipeline with at 2 cycles per 1 iteration.

In the other version of the loop, using 2-operand addressing modes that don't micro-fuse, the loop will be 10 fused-domain uops, and run in 3 cycles.

Results from my 3.3GHz Intel Sandybridge (i5 2500k). I didn't do anything to get the cpufreq governor to ramp up clock speed before testing, because cycles are cycles when you aren't interacting with memory. I've added annotations for the performance counter events that I had to enter in hex.

testing the 1-reg addressing mode: no cmdline arg

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend ./uop-test

Performance counter stats for './uop-test':

     11.489620      task-clock (msec)         #    0.961 CPUs utilized
    20,288,530      cycles                    #    1.766 GHz
    80,082,993      instructions              #    3.95  insns per cycle
                                              #    0.00  stalled cycles per insn
    60,190,182      r1b1  ; UOPS_DISPATCHED: (unfused-domain.  1->umask 02 -> uops sent to execution ports from this thread)
    80,203,853      r10e  ; UOPS_ISSUED: fused-domain
    80,118,315      r2c2  ; UOPS_RETIRED: retirement slots used (fused-domain)
   100,136,097      r1c2  ; UOPS_RETIRED: ALL (unfused-domain)
       220,440      stalled-cycles-frontend   #    1.09% frontend cycles idle
       193,887      stalled-cycles-backend    #    0.96% backend  cycles idle

   0.011949917 seconds time elapsed

testing the 2-reg addressing mode: with a cmdline arg

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend ./uop-test x

 Performance counter stats for './uop-test x':

         18.756134      task-clock (msec)         #    0.981 CPUs utilized
        30,377,306      cycles                    #    1.620 GHz
        80,105,553      instructions              #    2.64  insns per cycle
                                                  #    0.01  stalled cycles per insn
        60,218,693      r1b1  ; UOPS_DISPATCHED: (unfused-domain.  1->umask 02 -> uops sent to execution ports from this thread)
       100,224,654      r10e  ; UOPS_ISSUED: fused-domain
       100,148,591      r2c2  ; UOPS_RETIRED: retirement slots used (fused-domain)
       100,172,151      r1c2  ; UOPS_RETIRED: ALL (unfused-domain)
           307,712      stalled-cycles-frontend   #    1.01% frontend cycles idle
         1,100,168      stalled-cycles-backend    #    3.62% backend  cycles idle

       0.019114911 seconds time elapsed

So, both versions ran 80M instructions, and dispatched 60M uops to execution ports. (or with a memory source dispatches to an ALU for the or, and a load port for the load, regardless of whether it was micro-fused or not in the rest of the pipeline. nop doesn't dispatch to an execution port at all.) Similarly, both versions retire 100M unfused-domain uops, because the 40M nops count here.

The difference is in the counters for the fused-domain.

The 1-register address version only issues and retires 80M fused-domain uops. This is the same as the number of instructions. Each insn turns into one fused-domain uop.
The 2-register address version issues 100M fused-domain uops. This is the same as the number of unfused-domain uops, indicating that no micro-fusion happened.

I suspect that you'd only see a difference between UOPS_ISSUED and UOPS_RETIRED(retirement slots used) if branch mispredicts led to uops being cancelled after issue, but before retirement.

And finally, the performance impact is real. The non-fused version took 1.5x as many clock cycles. This exaggerates the performance difference compared to most real cases. The loop has to run in a whole number of cycles (on Sandybridge where the LSD is less sophisticated), and the 2 extra uops push it from 2 to 3. Often, an extra 2 fused-domain uops will make less difference. And potentially no difference, if the code is bottlecked by something other than 4-fused-domain-uops-per-cycle.

Still, code that makes a lot of memory references in a loop might be faster if implemented with a moderate amount of unrolling and incrementing multiple pointers which are used with simple [base + immediate offset] addressing, instead of the using [base + index] addressing modes.

Further stuff

Bottleneck when using indexed addressing modes - un-lamination may slow down the front-end more than an extra 1 uop normally would.

RIP-relative with an immediate can't micro-fuse. Agner Fog's testing shows that this is the case even in the decoders / uop-cache, so they never fuse in the first place (rather than being un-laminated).

IACA gets this wrong, and claims that both of these micro-fuse:

cmp dword  [abs mydata], 0x1b   ; fused counters != unfused counters (micro-fusion happened, and wasn't un-laminated).  Uses 2 entries in the uop-cache, according to Agner Fog's testing
cmp dword  [rel mydata], 0x1b   ; fused counters ~= unfused counters (micro-fusion didn't happen)

(There are some more limits for micro+macro fusion to both happen for a cmp/jcc. TODO: write that up for testing a memory location.)

RIP-rel does micro-fuse (and stay fused) when there's no immediate, e.g.:

or  eax, dword  [rel mydata]    ; fused counters != unfused counters, i.e. micro-fusion happens

Micro-fusion doesn't increase the latency of an instruction. The load can issue before the other input is ready.

ALIGN 32
.dep_fuse:
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    or eax, [rsi + 0]
    dec ecx
    jg .dep_fuse

This loop runs at 5 cycles per iteration, because of the eax dep chain. No faster than a sequence of or eax, [rsi + 0 + rdi], or mov ebx, [rsi + 0 + rdi] / or eax, ebx. (The unfused and the mov versions both run the same number of uops.) Scheduling / dep checking happens in the unfused-domain. Newly issued uops go into the scheduler (aka Reservation Station (RS)) as well as the ROB. They leave the scheduler after dispatching (aka being sent to an execution unit), but stay in the ROB until retirement. So the out-of-order window for hiding load latency is at least the scheduler size (54 unfused-domain uops in Sandybridge, 60 in Haswell, 97 in Skylake).

Micro-fusion doesn't have a shortcut for the base and offset being the same register. A loop with or eax, [mydata + rdi+4*rdi] (where rdi is zeroed) runs as many uops and cycles as the loop with or eax, [rsi+rdi]. This addressing mode could be used for iterating over an array of odd-sized structs starting at a fixed address. This is probably never used in most programs, so it's no surprise that Intel didn't spend transistors on allowing this special-case of 2-register modes to micro-fuse.