Operand Generation of Call Instruction on X86-64 Amd

operand generation of CALL instruction on x86-64 AMD

E8 is the operand for "Call Relative", meaning the destination address is computed by adding the operand to the address of the next instruction. The operand is 0xFFFFFFAE, which is negative 0x52. 0x808406 - 0x52 is 0x80483b4.

Most disassemblers helpfully calculate the actual target address rather than just give you the relative address in the operand.

Complete info for x86 ISA at: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2a-manual.html

Why is default operand size 32 bits in 64 mode?

TL:DR: you have 2 separate questions. 1 about C type sizes, and another about how x86-64 machine code encodes 32 vs. 64-bit operand-size. The encoding choice is fairly arbitrary and could have been made different. But int is 32-bit because that's what compiler devs chose, nothing to do with machine code.

int is 32-bit because that's still a useful size to use. It uses half the memory bandwidth / cache footprint of int64_t. Most C implementations for 64-bit ISAs have 32-bit int, including both mainstream ABIs for x86-64 (x86-64 System V and Windows). On Windows, even long is a 32-bit type, presumably for source compatibility with code written for 32-bit that made assumptions about type sizes.

Also, AMD's integer multiplier at the time was somewhat faster for 32-bit than 64-bit, and this was the case until Ryzen. (First-gen AMD64 silicon was AMD's K8 microarchitecture; see https://agner.org/optimize/ for instruction tables.)

The advantages of using 32bit registers/instructions in x86-64

x86-64 was designed by AMD in ~2000, as AMD64. Intel was committed to Itanium and not involved; all the design decisions for x86-64 were made by AMD architects.

AMD64 is designed with implicit zero-extension when writing a 32-bit register, so 32-bit operand-size can be used efficiently with none of the partial-register shenanigans you get with 8 and 16-bit mode.

TL:DR: There's good reason for CPUs to want to make 32-bit operand-size available somehow, and for C type systems to have an easily accessible 32-bit type. Using int for that is natural.

If you want 64-bit operand-size, use it. (And then describe it to a C compiler as long long or [u]int64_t, if you're writing C declarations for your asm globals or function prototypes). Nothing's stopping you (except for somewhat larger code size from needing REX prefixes where you might not have before).

All of that is a totally separate question from how x86-64 machine code encodes 32-bit operand-size.

AMD chose to make 32-bit the default and 64-bit operand-size require a REX prefix.

They could have gone the other way and made 64-bit operand-size the default, requiring REX.W=0 to set it to 32, or 0x66 operand-size to set it to 16. That might have led to smaller machine code for code that mostly manipulates things that have to be 64-bit anyway (usually pointers), if it didn't need r8..r15.

A REX prefix is also required to use r8..r15 at all (even as part of an addressing mode), so code that needs lots of registers often finds itself using a REX prefix on most instructions anyway, even when using the default operand-size.

A lot of code does use int for a lot of stuff, so 32-bit operand-size is not rare. And as noted above, it's sometimes faster. So it kind of makes sense to make the fastest instructions the most compact (if you avoid r8d..r15d).

It also maybe lets the decoder hardware be simpler if the same opcode decodes the same way with no prefixes in 32 and 64-bit mode. I think this was AMD's real motivation for this design choice. They certainly could have cleaned up a lot of x86 warts but chose not to, probably also to keep decoding more similar to 32-bit mode.

It might be interesting to see if you'd save overall code size for a version of x86-64 with a default operand-size of 64-bit. e.g. tweak a compiler and compile some existing codebases. You'd want to teach its optimizer to favour the legacy registers RAX..RDI for 64-bit operands instead of 32-bit, though, to try to minimize the number of instructions that need REX prefixes.

(Many instructions like add or imul reg,reg can safely be used at 64-bit operand-size even if you only care about the low 32, although the high garbage will affect the FLAGS result.)

Re: misinformation in comments: compat with 32-bit machine code has nothing to do with this. 64-bit mode is not binary compatible with existing 32-bit machine code; that's why x86-64 introduced a new mode. 64-bit kernels run 32-bit binaries in compat mode, where decoding works exactly like 32-bit protected mode.

https://en.wikipedia.org/wiki/X86-64#OPMODES has a useful table of modes, including long mode (and 64-bit vs. 32 and 16-bit compat modes) vs. legacy mode (if you boot a kernel that's not x86-64 aware).

In 64-bit mode some opcodes are different, and operand-size default to 64-bit for push/pop and other stack instruction opcodes.

32-bit machine code would decode incorrectly in that mode. e.g. 0x40 is inc eax in compat mode but a REX prefix in 64-bit mode. See x86-32 / x86-64 polyglot machine-code fragment that detects 64bit mode at run-time? for an example.

Also

x86 32 bit opcodes that differ in x86-x64 or entirely removed
Assembly: why some x86 opcodes are invalid in x64?

64-bit mode decoding mostly similarly is a matter of sharing transistors in the decoders, not binary compatibility. Presumably it's easier for the decoders to only have 2 mode-dependent default operand sizes (16 or 32-bit) for opcodes like 03 add r, r/m, not 3. Only special-casing for opcodes like push/pop that warrant it. (Also note that REX.W=0 does not let you encode push r32; the operand-size stays at 64-bit.)

AMD's design decisions seem to have been focused on sharing decoder transistors as much as possible, perhaps in case AMD64 didn't catch on and they were stuck supporting it without people using it.

They could have done lots of subtle things that removed annoying legacy quirks of x86, for example made setcc a 32-bit operand-size instruction in 64-bit mode to avoid needing xor-zeroing first. Or CISC annoyances like flags staying unchanged after zero-count shifts (although AMD CPUs handle that more efficiently than Intel, so maybe they intentionally left that in.)

Or maybe they thought that subtle tweaks could hurt asm source porting, or in the short term make it harder to get compiler back-ends to support 64-bit code-gen.

In assembly, how to add integers without destroying either operand?

Only a few specific GPR instructions have VEX encodings, primarily the BMI1/BMI2 instructions that were added after AVX already existed. See the list in Table 2-28, which has ANDN, BEXTR, BLSI, BLSMSK, BLSR, BZHI, MULX, PDEP, PEXT, RORX, SARX, SHLX, SHRX, as well as the same list in 5.1.16.1. For example, andn's manual entry lists only a VEX encoding, and's manual entry doesn't list any.

So Intel (unfortunately) didn't introduce a brand new three-operand alternate encoding for the entire instruction set. They just introduced a few specific instructions that take three operands and use VEX for it. In some cases these have similar or equivalent functionality to an existing instruction, e.g. SHLX for SHL with a variable count, and so effectively provide a three-operand version of the previous two-operand instruction, but only in those special cases. There are not equivalent instructions across the board.

The "old style" two-operand form remains the only version of the add instruction. However, as fuz points out in comments, lea can be a good way to add two registers and write the result to a third, subject to some restrictions on operand size.

See Using LEA on values that aren't addresses / pointers? for more general things LEA can do, like copy-and-add a constant to a register, or shift-and-add. Compilers already know this and will use lea where appropriate, any time it saves instructions. (Or with some tune options like -mtune=atom for old in-order Atom, will use lea even when they could have used add.)

If more flexible encodings of common integer instructions other than add existed, like and/xor/sub, gcc -O3 -march=skylake would already be using them in its own asm output, without needing inline asm. Or if alternative instructions could get the job done, like lea for add, would be doing that, so it makes sense to look at compiler output to see what tricks it knows. Trying it yourself would make more sense as something to play around with in a stand-alone .s file that just makes an exit system call, or just to single-step, removing the complexity of using inline asm. (GAS by default doesn't restrict instruction-sets. gcc -march=skylake doesn't pass that on to the assembler, as.)

In your inline asm, your c operand should be to output-only: =r instead of +r. The old value is overwritten, so there's no need to tell the compiler to produce it as an input. (Like you said, you want c = a+b not c += a+b.)

Using a single lea as the asm template means you don't need a =&r early-clobber output, because your asm will read all its inputs before writing that output. In your case, having it as an input/output was probably stopping the compiler from choosing the same register as one of the inputs, which could have broken with mov; add.

CS:APP example uses idivq with two operands?

That's a mistake. Only imul has immediate and 2-register forms.

mul, div, or idiv still only exist in the one-operand form introduced with 8086, using RDX:RAX as the implicit double-width operand for output (and input for division).

Or EDX:EAX, DX:AX, or AH:AL, depending on operand-size of course. Consult an ISA reference like Intel's manual, not this book! https://www.felixcloutier.com/x86/idiv

Also see When and why do we sign extend and use cdq with mul/div? and Why should EDX be 0 before using the DIV instruction?

x86-64's only hardware division instructions are idiv and div. 64-bit mode removed aam, which does 8-bit division by an immediate. (Dividing in Assembler x86 and Displaying Time in Assembly has an example of using aam in 16-bit mode).

Of course for division by constants idiv and div (and aam) are very inefficient. Use shifts for powers of 2, or a multiplicative inverse otherwise, unless you're optimizing for code-size instead of performance.

CS:APP 3e Global Edition apparently has multiple serious x86-64 instruction-set mistakes like this in practice problems, claiming that GCC emits impossible instructions. Not just typos or subtle mistakes, but misleading nonsense that's very obviously wrong to people familiar with the x86-64 instruction set. It's not just a syntax mistake, it's trying to use instructions that aren't encodeable (no syntax can exist to express them, other than a macro that expands to multiple instructions. Defining idivq as a pseudo-instruction using a macro would be pretty weird).

e.g. I correctly guessed missing part of a function, but gcc generated assembly code doesn't match the answer is another one where it suggests that (%rbx, %rdi, %rsi) and (%rsi, %rsi, 9) are valid addressing modes! The scale factor is actually a 2-bit shift count so these are total garbage and a sign of a serious lack of knowledge by the authors about the ISA they're teaching, not a typo.

Their code won't assemble with any AT&T syntax assembler.

Also What does this x86-64 addq instruction mean, which only have one operand? (From CSAPP book 3rd Edition) is another example, where they have a nonsensical addq %eax instead of inc %rdx, and a mismatched operand-size in a mov store.

It seems that they're just making stuff up and claiming it was emitted by GCC. IDK if they start with real GCC output and edit it into what they think is a better example, or actually write it by hand from scratch without testing it.

GCC's actual output would have used multiplication by a magic constant (fixed-point multiplicative inverse) to divide by 9 (even at -O0, but this is clearly not debug-mode code. They could have used -Os).

Presumably they didn't want to talk about Why does GCC use multiplication by a strange number in implementing integer division? and replaced that block of code with their made-up instruction. From context you can probably figure out where they expect the output to go; perhaps they mean rcx /= 9.

These errors are from 3rd-party practice problems in the Global Edition

From the publisher's web site (https://csapp.cs.cmu.edu/3e/errata.html)

Note on the Global Edition: Unfortunately, the publisher arranged for the generation of a different set of practice and homework problems in the global edition. The person doing this didn't do a very good job, and so these problems and their solutions have many errors. We have not created an errata for this edition.

So CS:APP 3e is probably a good textbook, as long as you get the North American edition, or ignore the practice / homework problems. This explains the huge disconnect between the textbook's reputation and wide use vs. the serious and obvious (to people familiar with x86-64 asm) errors like this one that go beyond sloppy into don't-know-the-language territory.

How a hypothetical `idiv reg, reg` or `idiv $imm, reg` would be designed

Also, the dividend should be given from the quantity in registers %rdx (high-order 64 bits) and %rax (low-order 64 bits) - so if this is defined in the architecture then it does not seem possible that the second operand could be a specified dividend.

If Intel or AMD had introduced a new convenient forms for div or idiv, they would have designed it to use a single-width dividend because that's how compilers always use it.

Most languages are like C and implicitly promote both operands for + - * / to the same type and produce a result of that width. Of course if the inputs are known to be narrow that can be optimized away. (e.g. using one imul r32 to implement a * (int64_t)b).

But div and idiv fault if the quotient overflows so it's not safe to use a single 32-bit idiv when compiling int32_t q = (int64_t)a / (int32_t)b.

Compilers always use xor edx,edx before DIV or cdq or cqo before IDIV to actually do n / n => n-bit division.

Real full-width division using a dividend that isn't just zero- or sign-extended is only done by hand with intrinsics or asm (because gcc/clang and other compilers don't know when the optimization is safe), or in gcc helper functions that do e.g. 64-bit / 64-bit division in 32-bit code. (Or 128-bit division in 64-bit code).

So what would be most helpful is a div/idiv that avoids the extra instruction to set up RDX, too, as well as minimizing the number of implicit register operands. (Like imul r32, r/m32 and imul r32, r/m32, imm do: making the common case of non-widening multiplication more convenient with no implicit registers. That's Intel-syntax like the manuals, destination first)

The simplest way would be a 2-operand instruction that did dst /= src. Or maybe replaced both operands with quotient and remainder. Using a VEX encoding for 3 operands like BMI1 andn, you could maybe have

idivx remainder_dst, dividend, divisor. With the 2nd operand also an output for the quotient. Or you could have the remainder written to RDX with a non-destructive destination for the quotient.

Or more likely to optimize for the simple case where only the quotient is needed, idivx quot, dividend, divisor and not store the remainder anywhere. You can always use regular idiv when you want the quotient.

BMI2 mulx uses an implicit rdx input operand because its purpose is to allow multiple dep chains of add-with-carry for extended-precision multiply. So it still has to produce 2 outputs. But this hypothetical new form of idiv would exist to save code-size and uops around normal uses of idiv that aren't widening. So 386 imul reg, reg/mem is the point of comparison, not BMI2 mulx.

IDK if it would make sense to introduce an immediate form of idivx as well; you'd only use it for code-size reasons. Multiplicative inverses are more efficient division by constants so there's very little real-world use-case for such an instruction.

Operand Generation of Call Instruction on X86-64 Amd