What Is Intel Microcode

What is Intel microcode?

In older times, microcode was heavily used in CPU: every single instruction was split into microcode. This enabled relatively complex instruction sets in modest CPU (consider that a Motorola 68000, with its many operand modes and eight 32-bit registers, fits in 40000 transistors, whereas a single-core modern x86 will have more than a hundred millions). This is not true anymore. For performance reasons, most instructions are now "hardwired": their interpretation is performed by inflexible circuitry, outside of any microcode.

In a recent x86, it is plausible that some complex instructions such as fsin (which computes the sine function on a floating point value) are implemented with microcode, but simple instructions (including integer multiplication with imul) are not. This limits what can be achieved with custom microcode.

That being said, microcode format is not only very specific to the specific processor model (e.g. microcode for a Pentium III and a Pentium IV cannot be freely exchanged with eachother -- and, of course, using Intel microcode for an AMD processor is out of the question), but it is also a severely protected secret. Intel has published the method by which an operating system or a motherboard BIOS may update the microcode (it must be done after each hard reset; the update is kept in volatile RAM) but the microcode contents are undocumented. The Intel® 64 and IA-32 Architectures Software Developer’s Manual (volume 3a) describes the update procedure (section 9.11 "microcode update facilities") but states that the actual microcode is "encrypted" and clock-full of checksums. The wording is vague enough that just about any kind of cryptographic protection may be hidden, but the bottom-line is that it is not currently possible, for people other than Intel, to write and try some custom microcode.

If the "encryption" does not include a digital (asymmetric) signature and/or if the people at Intel botched the protection system somehow, then it may be conceivable that some remarkable reverse-engineering effort could potentially enable one to produce such microcode, but, given the probably limited applicability (since most instructions are hardwired), chances are that this would not buy much, as far as programming power is concerned.

Intel microcode update version number meaning

Each microcode update has a revision. The version can only be increased (enforced by the operating system), but the processor itself does not enforce that.

(Non-ancient Intel processors do enforce some no-downgrade barriers, but it is not based on the microcode update revision number. It looks at some undocumented fields inside the microcode update itself).

The microcode revision number is a 32 bit, signed number. There is no processor-enforced structure to it. Apparently, some Intel microcode teams that work with particular micro-architectures appear to have structured it a bit to ease their work.

For future reference:
https://manpages.debian.org/unstable/iucode-tool/iucode_tool.8.en.html
https://github.com/platomav/MCExtractor/wiki/Intel-Microcode-Extra-Undocumented-Header

What is a microcoded instruction?

A CPU reads machine code and decodes it into internal control signals that send the right data to the right execution units.

Most instructions map to one internal operation, and can be decoded directly. (e.g. on x86, add eax, edx just sends eax and edx to the integer ALU for an ADD operation, and puts the result in eax.)

Some other single instructions do much more work. e.g. x86's rep movs implements memcpy(edi, esi, ecx), and requires the CPU to loop.

When the instruction decoders see an instruction like that, instead of just producing internal control signals directly they read micro-code out of the microcode ROM.

A micro-coded instruction is one that decodes to many internal operations

Modern x86 CPUs always decode x86 instructions to internal micro-operations. In this terminology, it still doesn't count as "micro-coded" even when add [mem], eax decodes to a load from [mem], an ALU ADD operation, and a store back into [mem]. Another example is xchg eax, edx, which decodes to 3 uops on Intel Haswell. And interestingly, not exactly the same kind of uops you'd get from using 3 MOV instructions to do the exchange with a scratch register, because they aren't zero-latency.

On Intel / AMD CPUs, "micro-coded" means the decoders turn on the micro-code sequencer to feed uops from the ROM into the pipeline, instead of producing multiple uops directly.

(You could call any multi-uop x86 instruction "microcoded" if you were thinking in pure RISC terms, but it's useful to use the term "microcoded" to make a different distinction, IMO. This meaning is I think widespread in x86 optimization circles, like Intel's optimization manual. Other people may use different meanings for terminology, especially if talking about other architectures or about computer architecture in general when comparing x86 to a RISC.)

In current Intel CPUs, the limit on what the decoders can produce directly, without going to micro-code ROM, is 4 uops (fused-domain). AMD similarly has FastPath (aka DirectPath) single or double instructions (1 or 2 "macro-ops", AMD's equivalent of uops), and beyond that it's VectorPath aka Microcode, as explained in David Kanter's in-depth look at AMD Bulldozer, specifically talking about its decoders.

Another example is x86's integer DIV instruction, which is micro-coded even on modern Intel CPUs like Haswell. But not AMD; AMD just has one or 2 uops activate everything inside the integer divider unit. It's not fundamental to DIV, just an implementation choice. See my answer on C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for the numbers.

FP division is also slow, but is decoded to a single uop so it doesn't bottleneck the front-end. If FP division is rare and not part of a latency bottleneck, it can be as cheap as multiplication. (But if execution does have to wait for its result, or bottlenecks on its throughput, it's much slower.) More in this answer.

Integer division and other micro-coded instructions can give the CPU a hard time, and creates effects that make code alignment matter where it wouldn't otherwise.

To learn more about x86 CPU internals, see the x86 tag wiki, and especially Agner Fog's microarch guide.

Also David Kanter's deep dives into x86 microarchitectures are useful to understand the pipeline that uops go through: Core 2 and Sandy Bridge being major ones, also AMD K8 and Bulldozer articles are interesting for comparison.

RISC vs. CISC Still Matters (Feb 2000) by Paul DeMone looks at how PPro breaks down instructions into uops, vs. RISCs where most instructions are already simple to just go through the pipeline in one step, with only rare ones like ARM push/pop multiple registers needing to send multiple things down the pipeline (aka microcoded in RISC terms).

And for good measure, Modern Microprocessors
A 90-Minute Guide! is always worth recommending for the basics of pipelining and OoO exec.

Other uses of the term in very different contexts than modern x86

In some older / simpler CPUs, every instruction was effectively micro-coded. For example, the 6502 executed 6502 instructions by running a sequence of internal instructions from a PLA decode ROM. This works well for a non-pipelined CPU, where the order of using the different parts of the CPU can vary from instruction to instruction.

Historically, there was a different technical meaning for "microcode", meaning something like the internal control signals decoded from the instruction word. Especially in a CPU like MIPS where the instruction word mapped directly to those control signals, without complicated decoding. (I may have this partly wrong; I read something like this (other than in the deleted answer on this question) but couldn't find it again later.)

This meaning may still actually get used in some circles, like when designing a simple pipelined CPU, like a hobby MIPS.

can we read and program the microcodes of AMD processor?

This article provides information on the microcode of AMD's Opteron (K8) family. It claims that it is not encrypted and provides information on the microcode format and updating the microcode.

How are microcodes executed during an instruction cycle?

div is not simple, it's one of the hardest integer operations to compute! It's microcoded on Intel CPUs, unlike mov, or add/sub or even imul which are all single-uop on modern Intel. See https://agner.org/optimize/ for instruction tables and microarch guides. (Fun fact: AMD Ryzen doesn't microcode div; it's only 2 uops because it has to write 2 output registers. Piledriver and later also make 32 and 64-bit division 2 uops.)

All instructions decode to 1 or more uops (with most instructions in most programs being 1 uop on current CPUs). Instructions which decode to 4 or fewer uops on Intel CPUs are described as "not microcoded", because they don't use the special MSROM mechanism for many-uop instructions.

No CPUs that decode x86 instructions to uops use a simple 3-phase fetch/decode/exec cycle, so that part of the premise of your question makes no sense. Again, see Agner Fog's microarch guide.

Are you sure you wanted to ask about modern Intel CPUs? Some older CPUs are internally microcoded, especially non-pipelined CPUs where the process of executing different instructions can activate different internal logic blocks in a different order. The logic that controls this is also called microcode, but it's a different kind of microcode from the modern meaning of the term in the context of a pipelined out-of-order CPU.

If that's what you're looking for, see How was microcode implemented in retro processors? on retrocomputing.SE for non-pipelined CPUs like 6502 and Z80, where some of the microcode internal timing cycles are documented.

How do microcoded instructions execute on modern Intel CPUs?

When a microcoded "indirect uop" reaches the head of the IDQ in a Sandybridge-family CPU, it takes over the issue/rename stage and feeds it uops from the microcode-sequencer MS-ROM until the instruction has issued all its uops, then the front-end can resume issuing other uops into the out-of-order back-end.

The IDQ is the Instruction Decode Queue that feeds the issue/rename stage (which sends uops from the front-end into the out-of-order back-end). It buffers uops that come from the uop cache + legacy decoders, to absorb bubbles and bursts. It's the 56 uop queue in David Kanter's Haswell block diagram. (But that shows microcode only being read before the queue, which doesn't match Intel's description of some perf events¹, or what has to happen for microcoded instructions that run a data-dependent number of uops).

(This might not be 100% accurate, but at least works as a mental model for most of the performance implications². There might be other explanations for the performance effects we've observed so far.)

This only happens for instructions that need more than 4 uops; instructions that need 4 or fewer decode to separate uops in the normal decoders and can issue normally. e.g. xchg eax, ecx is 3 uops on modern Intel: Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? goes into detail about what we can figure out about what those uops actually are.

The special "indirect" uop for a microcoded instruction takes a whole line to itself in the decoded-uop cache, the DSB (potentially causing code-alignment performance issue). I'm not sure if they only take 1 entry in the queue that feeds the issue stage from the uop cache and/or legacy decoders, the IDQ. Anyway, I made up the term "indirect uop" to describe it. It's really more like a not-yet-decoded instruction or a pointer into the MS-ROM. (Possibly some microcoded instructions might be a couple "normal" uops and one microcode pointer; that could explain it taking a whole uop-cache line to itself.)

I'm pretty sure they don't fully expand until they reach the head of the queue, because some microcoded instructions are a variable number of uops depending on data in registers. Notably rep movs which basically implements memcpy. In fact this is tricky; with different strategies depending on alignment and size, rep movs actually needs to do a some conditional branching. But it's jumping to different MS-ROM locations, not to different x86 machine-code locations (RIP values). See Conditional jump instructions in MSROM procedures?.

Intel's fast-strings patent also sheds some light on the original implementation in P6: first n copy iterations are predicated in the back-end; and give the back-end time to send the value of ECX to the MS. From that, the microcode sequencer can send exactly the right number of copy uops if more are needed, with no branching in the back-end needed. Maybe the mechanism for handling nearly-overlapping src and dst or other special cases aren't based on branching after all, but Andy Glew did mention lack of microcode branch prediction as an issue for the implementation. So we know they are special. And that was back in P6 days; rep movsb is more complicated now.

Depending on the instruction, it might or might not drain the out-of-order back end's reservation station aka scheduler while sorting out what to do. rep movs does that for copies > 96 bytes on Skylake, unfortunately (according to my testing with perf counters, putting rep movs between independent chains of imul). This might be due to mispredicted microcode branches, which aren't like regular branches. Maybe branch-miss fast-recovery doesn't work on them, so they aren't detected / handled until they reach retirement? (See the microcode branch Q&A for more about this).

rep movs is very different from mov. Normal mov like mov eax, [rdi + rcx*4] is a single uop even with a complex addressing mode. A mov store is 1 micro-fused uop, including both a store-address and store-data uop that can execute in either order, writing the data and physical address into the store buffer so the store can commit to L1d after the instruction retires from the out-of-order back-end and becomes non-speculative. The microcode for rep movs will include many load and store uops.

Footnote 1:

We know there are perf events like idq.ms_dsb_cycles on Skylake:

[Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser[sic] (MS) is busy]

That would make no sense if microcode is just a 3rd possible source of uops to feed into the front of the IDQ. But then there's an event whose descriptions sounds like that:

idq.ms_switches
[Number of switches from DSB (Decode Stream Buffer) or MITE (legacy
decode pipeline) to the Microcode Sequencer]

I think this actually means it counts when the issue/rename stage switches to taking uops from the microcode sequencer instead of the IDQ (which holds uops from DSB and/or MITE). Not that the IDQ switches its source of incoming uops.

Footnote 2:

To test this theory, we could construct a test case with lots of easily-predicted jumps to cold i-cache lines after a microcoded instruction, and see how far the front-end gets in following cache misses and queueing up uops into the IDQ and other internal buffers during the execution of a big rep scasb.

SCASB doesn't have fast-strings support, so it's very slow and doesn't touch a huge amount of memory per cycle. We want it to hit in L1d so timing is highly predictable. Probably a couple 4k pages are enough time for the front-end to follow a lot of i-cache misses. We can even map contiguous virtual pages to the same physical page (e.g. from user-space with mmap on a file)

If the IDQ space behind the microcoded instruction can be filled up with later instructions while it's executing, that leaves more room for the front-end to fetch from more i-cache lines ahead of when they're needed. We can then hopefully detect the difference with total cycles and/or other perf counters, for running rep scasb plus a sequence of jumps. Before each test, use clflushopt on the lines holding the jump instructions.

To test rep movs this way, we could maybe play tricks with virtual memory to get contiguous pages mapped to the same physical page, again giving us L1d hits for loads + stores, but dTLB delays would be hard to control. Or even boot with the CPU in no-fill mode, but that's very hard to use and would need a custom "kernel" to put the result somewhere visible.

I'm pretty confident we would find uops entering the IDQ while a microcoded instruction has taken over the front-end (if it wasn't already full). There is a perf event

idq.ms_uops
[Uops delivered to Instruction Decode Queue (IDQ) while Microcode
Sequenser (MS) is busy]

and 2 other events like that which count only uops coming from MITE (legacy decode) or uops coming from DSB (uop cache). Intel's description of those events is compatible with my description of how a microcoded instruction ("indirect uop") takes over the issue stage to read uops from the microcode sequencer / ROM while the rest of the front-end continues doing its thing delivering uops to the other end of the IDQ until it fills up.

Conditional jump instructions in MSROM procedures?

Microcode branches are apparently special.

Intel's P6 and SnB families do not support dynamic prediction for microcode branches, according to Andy Glew's description of original P6 (What setup does REP do?). Given the similar performance of SnB-family rep-string instructions, I assume this PPro fact applies to even the most recent Skylake / CoffeeLake CPUs¹.

But there is a penalty for microcode branch misprediction, so they are statically(?) predicted. (This is why rep movsb startup cost goes in increments of 5 cycles for low/medium/high counts in ECX, and aligned vs. misaligned.)

A microcoded instruction takes a full line to itself in the uop cache. When it reaches the front of the IDQ, it takes over the issue/rename stage until it's done issuing microcode uops. (See also How are microcodes executed during an instruction cycle? for more detail, and some evidence from perf event descriptions like idq.dsb_uops that show the IDQ can be accepting new uops from the uop cache while the issue/rename stage is reading from the microcode-sequencer.)

For rep-string instructions, I think each iteration of the loop has to actually issue through the front-end, not just loop inside the back-end and reuse those uops. So this involves feedback from the OoO back-end to find out when the instruction is finished executing.

I don't know the details of what happens when issue/rename switches over to reading uops from the MS-ROM instead of the IDQ.

Even though each uop doesn't have its own RIP (being part of a single microcoded instruction), I'd guess that the branch mispredict detection mechanism works similarly to normal branches.

rep movs setup times on some CPUs seem to go in steps of 5 cycles depending on which case it is (small vs. large, alignment, etc). If these are from microcode branch mispredict, that would appear to mean that the mispredict penalty is a fixed number of cycles, unless that's just a special case of rep movs. May be because the OoO back-end can keep up with the front-end? And reading from the MS-ROM shortens the path even more than reading from the uop cache, making the miss penalty that low.

It would be interesting to run some experiments into how much OoO exec is possible around rep movsb, e.g. with two chains of dependent imul instructions, to see if it (partially) serializes them like lfence. We hope not, but to achieve ILP the later imul uops would have to issue without waiting for the back-end to drain.

I did some experiments here on Skylake (i7-6700k). Preliminary result: copy sizes of 95 bytes and less are cheap and hidden by the latency of the IMUL chains, but they do basically fully overlap. Copy sizes of 96 bytes or more drain the RS, serializing the two IMUL chains. It doesn't matter whether it's rep movsb with RCX=95 vs. 96 or rep movsd with RCX=23 vs. 24. See discussion in comments for some more summary of my findings; if I find time I'll post more details.

The "drains the RS" behaviour was measured with the rs_events.empty_end:u even becoming 1 per rep movsb instead of ~0.003. other_assists.any:u was zero, so it's not an "assist", or at least not counted as one.

Perhaps whatever uop is involved only detects a mispredict when reaching retirement, if microcode branches don't support fast recovery via the BoB? The 96 byte threshold is probably the cutoff for some alternate strategy. RCX=0 also drains the RS, presumably because it's also a special case.

Would be interesting to test with rep scas (which doesn't have fast-strings support, and is just slow and dumb microcode.)

Intel's 1994 Fast Strings patent describes the implementation in P6. It doesn't have an IDQ (so it makes sense that modern CPUs that do have buffers between stages and a uop cache will have some changes), but the mechanism they describe for avoiding branches is neat and maybe still used for modern ERMSB: the first n copy iterations are predicated uops for the back-end, so they can be issued unconditionally. There's also a uop that causes the back-end to send its ECX value to the microcode sequencer, which uses that to feed in exactly the right number of extra copy iterations after that. Just the copy uops (and maybe updates of ESI, EDI, and ECX, or maybe only doing that on an interrupt or exception), not microcode-branch uops.

This initial n uops vs. feeding in more after reading RCX could be the 96-byte threshold I was seeing; it came with an extra idq.ms_switches:u per rep movsb (up from 4 to 5).

https://eprint.iacr.org/2016/086.pdf suggests that microcode can trigger an assist in some cases, which might be the modern mechanism for larger copy sizes and would explain draining the RS (and apparently ROB), because it only triggers when the uop is committed (retired), so it's like a branch without fast-recovery.

The execution units can issue an assist or signal a fault by associating an event code with the result of a micro- op. When the micro-op is committed (§ 2.10), the event code causes the out-of-order scheduler to squash all the micro-ops that are in-flight in the ROB. The event code is forwarded to the microcode sequencer, which reads the micro-ops in the corresponding event handler"

The difference between this and the P6 patent is that this assist-request can happen after some non-microcode uops from later instructions have already been issued, in anticipation of the microcoded instruction being complete with only the first batch of uops. Or if it's not the last uop in a batch from microcode, it could be used like a branch for picking a different strategy.

But that's why it has to flush the ROB.

My impression of the P6 patent is that the feedback to the MS happens before issuing uops from later instructions, in time for more MS uops to be issued if needed. If I'm wrong, then maybe it's already the same mechanism still described in the 2016 paper.

Usually, when a branch mispredicts as being taken then when the instruction retires,

Intel since Nehalem has had "fast recovery", starting recovery when a mispredicted branch executes, not waiting for it to reach retirement like an exception.

This is the point of having a Branch-Order-Buffer on top of the usual ROB retirement state that lets you roll back when any other type of unexpected event becomes non-speculative. (What exactly happens when a skylake CPU mispredicts a branch?)

Footnote 1: IceLake is supposed to have the "fast short rep" feature, which might be a different mechanism for handling rep strings, rather than a change to microcode. e.g. maybe a HW state machine like Andy mentions he wished he'd designed in the first place.

I don't have any info on performance characteristics, but once we know something we might be able to make some guesses about the new implementation.

How can I find the micro-ops which instructions on Intel's x86 CPUs decode to?

Agner Fog's PDF document on x86 instructions (linked off of the main page Hans cites) is the only reference I've found on instruction timings and micro-ops. I've never seen an Intel document on micro-op breakdown.

Intel JCC Erratum - should JCC really be treated separately?

Macro-fused jumps have to be mentioned separately because it means the whole cmp/jcc or whatever is vulnerable to this slowdown if the cmp touches the boundary when the jcc itself doesn't. Because the uop cache would have a single uop for both those x86 machine instructions together, with the start address of the non-jump instruction.

If everyone only said "jumps", you'd expect that only the JCC / JMP / CALL / RET itself had to avoid touching a 32B boundary. So it's a good thing to highlight the interaction with macro-fusion.

This slowdown (for all jumps) is the result of a microcode mitigation / workaround for a hardware design flaw. Not being able to uop-cache cache jumps that touch a 32-byte boundary is not the original erratum, it's a side effect of the cure.

That original erratum description doesn't say anything about affecting only conditional branches. Even if it was only conditional branches that were a real problem, perhaps the best way Intel could find to make it safe with a microcode update unfortunately affected all jumps.

For example, in Skylake-Xeon (SKX), the original erratum is documented as SKX102 in Intel's "spec update" errata list for that uarch:

SKX102. Processor May Behave Unpredictably on Complex Sequence of
Conditions Which Involve Branches That Cross 64 Byte Boundaries
Problem: Under complex micro-architectural conditions involving branch instructions bytes that
span multiple 64 byte boundaries (cross cache line), unpredictable system behavior
may occur.
Implication: When this erratum occurs, the system may behave unpredictably.
Workaround: It is possible for BIOS to contain a workaround for this erratum. [i.e. a microcode update]
Status: No fix.

I suspect the "JCC erratum" name caught on because most branches in "hot" code paths are conditional. Compilers can usually avoid putting unconditional taken branches in the fast path. So it's likely that people noticed the performance problem with JCC instructions first, and that name simply stuck even though it's not accurate.

BTW, 32-byte aligned routine does not fit the uops cache has a screenshot of the relevant diagram from the Intel PDF you linked about, and some other links and details about performance effects.

What Is Intel Microcode