X86 Mul Instruction from VS 2008/2010

x86 MUL Instruction from VS 2008/2010

imul (signed) and mul (unsigned) both have a one-operand form that does edx:eax = eax * src. i.e. a 32x32b => 64b full multiply (or 64x64b => 128b).

186 added an imul dest(reg), src(reg/mem), immediate form, and 386 added an imul r32, r/m32 form, both of which which only compute the lower half of the result. (According to NASM's appendix B, see also the x86 tag wiki)

When multiplying two 32-bit values, the least significant 32 bits of the result are the same, whether you consider the values to be signed or unsigned. In other words, the difference between a signed and an unsigned multiply becomes apparent only if you look at the "upper" half of the result, which one-operand imul/mul puts in edx and two or three operand imul puts nowhere. Thus, the multi-operand forms of imul can be used on signed and unsigned values, and there was no need for Intel to add new forms of mul as well. (They could have made multi-operand mul a synonym for imul, but that would make disassembly output not match the source.)

In C, results of arithmetic operations have the same type as the operands (after integer promotion for narrow integer types). If you multiply two int together, you get an int, not a long long: the "upper half" is not retained. Hence, the C compiler only needs what imul provides, and since imul is easier to use than mul, the C compiler uses imul to avoid needing mov instructions to get data into / out of eax.

As a second step, since C compilers use the multiple-operand form of imul a lot, Intel and AMD invest effort into making it as fast as possible. It only writes one output register, not e/rdx:e/rax, so it was possible for CPUs to optimize it more easily than the one-operand form. This makes imul even more attractive.

The one-operand form of mul/imul is useful when implementing big number arithmetic. In C, in 32-bit mode, you should get some mul invocations by multiplying unsigned long long values together. But, depending on the compiler and OS, those mul opcodes may be hidden in some dedicated function, so you will not necessarily see them. In 64-bit mode, long long has only 64 bits, not 128, and the compiler will simply use imul.

problem in understanding mul & imul instructions of Assembly language

Q1/Q2: The x86 instruction set maintains its 16-bit history. When doing a 16-bit multiply, the answer is stored in DX:AX. That's just the way it is, because that's how it was in 16-bit land.

Q3: The code you showed has a bug if you try to compute the square of a number larger than 2^16, because the code ignores the high 32 bits of the result stored in edx.

Q4: I think you may be misreading the table. 8-bit multiplications are stored in a 16-bit result; 16-bit multiplications are stored in a 32-bit result; 32-bit multiplications are stored in a 64-bit result. Which line are you referring to specifically?

GCC and the Multiply Instruction

The compiler relies on the "as-if" rule: No standard conforming program can detect a difference between what this program does and what the program should do, since the lowest 32 bits of the result are the same for both instructions.

Why are signed and unsigned multiplication different instructions on x86(-64)?

Addition and subtraction are the same, as is the low-half of a multiply. A full multiply, however, is not. Simple example:

In 32-bit twos-complement, -1 has the same representation as the unsigned quantity 2**32 - 1. However:

-1 * -1 = +1
(2**32 - 1) * (2**32 - 1) = (2**64 - 2**33 + 1)

(Note that the low 32-bits of both results are the same; that's what I mean when I say the "low-half of the multiply" is the same).

Why is imul used for multiplying unsigned numbers?

TL:DR: because it's a faster way of getting the correct result when we don't care about the high half (i.e. the output is only as wide as the 2 inputs). And more flexible register-allocation instead of forced use of RAX and RDX.

If it wasn't usable for this, Intel probably would have added two-operand versions of mul as well. But that wasn't necessary, as this answer explains.

WARNING This answer is long!

... and it's full of unneeded explanations - but I have always wanted to write something more lengthy about the multiplication.

A bit of theory

When multiplying two numbers a and b of length n the result is of length 2 n^† and, most importantly, the k-th digit only depends on the lowest k digits (a proof is given in Appendix A).

x86 `imul`'s two forms

The x86 multiplication instruction imul comes in two form: the full form and the partial form.

The first form is of the kind n×n→2 n, meaning that it produces a result twice the size of the operands - we know from the theory why this makes sense.

For example

imul ax         ;16x16->32, Result is dx:ax
imul rax        ;64x64->128, Result is rdx:rax

The second form is of the kind n×n→n, this necessarily cuts out some information.

Particularly, this form takes only the lower n bits of the result.

imul ax, ax          ;16x16->16, Lower WORD of the result is ax
imul rax, rax        ;64x64->64, Lower QWORD of the result is rax

Only the single-operand version is of the first form.

(There's also a 3-operand form, imul r64, r/m64, imm8/32, which allows you to copy-and-multiply by a constant in one instruction. It has no implicit operands and again doesn't write the high half anywhere so we can just treat it as equivalent to the imul r64, r/m64 dst *= src form.)

The two instructions: `imul` vs `mul`

Regardless of the form used, the processor always calculates the result with a size twice the operands' (i.e. like the first form).

In order to be able to do that, the operands are first converted from their size n to size 2 n (e.g. from 64 to 128 bits).

See Appendix B for more on this.

The multiplication is done and the full, or partial, result is stored in the destination.

The difference between imul and mul is in how the operands are converted.

Since the size is extended, this particular type of conversion is called extension.

The mul instruction simply fills the upper part with zeros - it zero extends.

The imul instructions replicate the high-order bit (the first from the left) - this is called sign extension and it has the interesting property of transforming a two's complement signed number of n bits into a signed number of 2 n bits with the same sign and modulus (i.e. it does the right thing, it is left to the reader to find a counter-example for the zero-extension case).

     How mul extends              How imul extends       
       an operand                   an operand

     +----+       +----+          +----+       +----+
     |0...|       |1...|          |0...|       |1...|
     +----+       +----+          +----+       +----+  

+----+----+  +----+----+     +----+----+  +----+----+
|0000|0...|  |0000|1...|     |0000|0...|  |1111|1...|
+----+----+  +----+----+     +----+----+  +----+----+

The thesis

The difference between imul and mul is noticeable only from the (n+1)-th bit onward.

For a 32-bit operand, it means that only the upper 32-bit part of the full result will eventually be different.

This is easy to see as the lower n bits are the same for both instructions and as we know from the theory the first n bits of the result only depend on the first n bits of the operands.

Thus the thesis: The result of the partial form of imul is identical to that of mul.

Then why does imul exist?

Original 8086 only had one-operand versions of mul and imul. Later versions of x86 added more flexible two and three operand versions of imul only, intended for the common use-case where you don't want the double-width result.

They only write one output register, which for modern x86 means they can decode to a single uop: https://agner.org/optimize/. (In modern x86 microarchitectures, each uop can write at most 1 register.) One-operand imul r32 is 3 uops on Intel CPUs: presumably one to multiply, another to split the 64-bit product into 2 halves and write the low half, and another to do the same for the high half. imul r64 is 2 uops; presumably the 128-bit result comes out of the multiplier already split in 64-bit halves.

mul still only exists in the very ancient one-operand form with fixed registers as part of the interface.

imul sets the flags according to a signed multiplication - CF and OF are set if the partial result has discarded any significant information (the technical condition being: the sign extension of the partial result is different from the full result) such as in case of overflow. This is also why the two and three operand forms are not called mul, which otherwise would have been a perfectly fit name.

The practice

To test all this in practice we can ask a compiler^[live] for the assembly of the following program

#include <stdint.h>

uint64_t foo(uint32_t a)
{
    return a*(uint64_t)a;
}

While we know that for 64-bit target the code generated uses imul because a unint64_t fits a register and thus a 64×64→64 multiplication is available as imul <reg64>, <reg64>

foo(unsigned int):
        mov     eax, edi        ;edi = a
        imul    rax, rax        ;64x64->64
        ret

in 32-bit code there is no such multiplication using imul.

An imul <reg32> or imul <reg32>, <reg32>, <reg32> is necessary but that would produce a full result! And a full signed result is not generally equal to a full unsigned result.

In fact, the compiler reverts back to mul:

foo(unsigned int):
        mov     eax, DWORD PTR [esp+4]
        mul     eax
        ret

Appendix A

Without loss of generality, we can assume base 2 and that the numbers are n + 1 bits long (so that the indices run from 0 to n) - then

c = a·b = ∑_i=0..n (a_i·2ⁱ) · ∑_j=0..n(b_j·2^j) =
∑_i=0..n [a_i·∑_j=0..n (b_j·2^i+j)] (by the distributive property)

We see that the k-th digit of the result is the sum of all the addends such that i + j = k plus an eventual carry

c_k = ∑_{i,j=0..n; i+j=k} a_i·b_j·2^i+j + C_k

The term C_k is the carry and, as it propagates towards higher bits, it depends only on the lower bits.

The second term cannot have a a_i or b_j with i or j > k as if the first were true then i = k + e, for a positive, non null, e and thus j = k - i = k - k -e = -e

But j cannot be negative!

The second case is similar and left to the reader.

Appendix B

As BeeOnRope pointed out in the comments the processor probably doesn't compute a full result if only the partial result is needed.

You probably means that this is only a way of thinking about it, conceptually. The processor does not necessarily do a full 128-bit multiplication when you use the 64x64 -> 64 form. Indeed, the truncated form takes only 1 uop on recent Intel, but the full form takes 2 uops, so some extra work is being done
Comment from BeeOnRope

Also, the sign extension is probably conceptually too.

Similarly the sign extension may happens "conceptually", but probably not in hardware. They won't have the extra wires and transistors just to do the sign or zero extension, which would add a lot of bulk to an already huge multiplier, but will use some other tricks to do the multiplication "as if" that had happened.
Comment from BeeOnRope

^† Binary numbers of length n are in the order of magnitude of 2ⁿ, thus the multiplication of two such numbers is in the order of magnitude 2ⁿ · 2ⁿ = 2ⁿ⁺ⁿ = 2^{2 n}. Just like a number of length 2 n.

signed and unsigned arithmetic implementation on x86

If you look at the various multiplication instructions of x86, looking only at 32bit variants and ignoring BMI2, you will find these:

imul r/m32 (32x32->64 signed multiply)
imul r32, r/m32 (32x32->32 multiply) *
imul r32, r/m32, imm (32x32->32 multiply) *
mul r/m32 (32x32->64 unsigned multiply)

Notice that only the "widening" multiply has an unsigned counterpart. The two forms in the middle, marked with an asterisk, are both signed and unsigned multiplication, because for the case where you don't get that extra "upper part", that's the same thing.

The "widening" multiplications have no direct equivalent in C, but compilers can (and often do) use those forms anyway.

For example, if you compile this:

uint32_t test(uint32_t a, uint32_t b)
{
    return a * b;
}

int32_t test(int32_t a, int32_t b)
{
    return a * b;
}

With GCC or some other relatively reasonable compiler, you'd get something like this:

test(unsigned int, unsigned int):
    mov eax, edi
    imul    eax, esi
    ret
test(int, int):
    mov eax, edi
    imul    eax, esi
    ret

(actual GCC output with -O1)

So signedness doesn't matter for multiplication (at least not for the kind of multiplication you use in C) and for some other operations, namely:

addition and subtraction
bitwise AND, OR, XOR, NOT
negation
left shift
comparing for equality

x86 doesn't offer separate signed/unsigned versions for those, because there's no difference anyway.

But for some operations there is a difference, for example:

division (idiv vs div)
remainder (also idiv vs div)
right shift (sar vs shr) (but beware of signed right shift in C)
comparing for bigger than / smaller than

But that last one is special, x86 doesn't have separate versions for signed and unsigned of this either, instead it has one operation (cmp, which is really just a nondestructive sub) that does both at once, and gives several results (multiple bits in "the flags" are affected). Later instructions that actually use those flags (branches, conditional moves, setcc) then choose which flags they care about. So for example,

cmp a, b
jg somewhere

Will go somewhere if a is "signed greater than" b.

cmp a, b
jb somewhere

Would go somewhere if a is "unsigned below" b.

See Assembly - JG/JNLE/JL/JNGE after CMP for more about the flags and branches.

This won't be a formal proof that signed and unsigned multiplication are the same, I'll just try to give you insight into why they should be the same.

Consider 4-bit 2's-complement integers. The weights their individual bits are, from lsb to msb, 1, 2, 4, and -8. When you multiply two of those numbers, you can decompose one of them into 4 parts corresponding to its bits, for example:

0011 (decompose this one to keep it interesting)
0010
---- *
0010 (from the bit with weight 1)
0100 (from the bit with weight 2, so shifted left 1)
---- +
0110

2 * 3 = 6 so everything checks out. That's just regular long multiplication that most people learn in school, only binary, which makes it a lot easier since you don't have to multiply by a decimal digit, you only have to multiply by 0 or 1, and shift.

Anyway, now take a negative number. The weight of the sign bit is -8, so at one point you will make a partial product -8 * something. A multiplication by 8 is shifting left by 3, so the former lsb is now the msb, and all other bits are 0. Now if you negate that (it was -8 after all, not 8), nothing happens. Zero is obviously unchanged, but so is 8, and in general the number with only the msb set:

-1000 = ~1000 + 1 = 0111 + 1 = 1000

So you've done the same thing you would have done if the weight of the msb was 8 (as in the unsigned case) instead of -8.

x86 imul opcode with two 32-bit registers

The x86 tag wiki contains a list of resources that you can use to look up information on various instructions. For example, searching for IMUL on http://www.felixcloutier.com/x86/ leads you to http://www.felixcloutier.com/x86/IMUL.html.

So it depends, how your assembler will compile it, if two operand variant is used, then result is truncated, if single operand (IMUL ebx) is used, then result is in edx:eax.

About flags: probably debug to be sure, but the two operand version, if I'm reading it correctly will set CF/OF when the result is truncated, so in your case CF=1, OF=0? (unsigned trunc, signed not overflow).

After 3h later...
It occurred to me, that I actually should search again for some nice linux debugger, like "TD" from Borland in DOS era, because it may have be handy to have one.

There's of course the gdb always around, but I can't get to like it (especially as it usually goes with AT&T syntax, which I personally dislike).

So I tried "ddd" (Data Display Debugger), directly available in Ubuntu repositories, but it looks a bit old (yeah, MOTIF, if you know, what I mean, if you never heard MOTIF, then you are not old enough). And it's gdb based... Probably worth of mention here, for people who want to go in GNU path, but I searched further.

And I ended up downloading and compiled edb-debugger. After few tries I got it to compile on my older ubuntu, run it .. and... that's IT. Just few clicks in preferences, and "TD" feel is here.

So, I tried your instructions (had to compile first some helloworld.asm from NASM examples to have some test linux binary to run, then you can directly overwrite instructions in debugger for these short questions).

After IMUL eax,ebx the changes to CPU state are:

eax = 0xe0000001
CF = 1, OF = 1, SF = 1, ZF = 0, PF = 0, AF = 0, ...

After IMUL ebx the changes to CPU state are:

eax = 0xe0000001
edx = 0x00ffffff
CF = 1, OF = 1, SF = 1, ZF = 0, PF = 0, AF = 0, ...

(so in this case flags are the same).

I strongly suggest to you to get some nice debugger too, to stop asking questions like these. If it wouldn't be along my own interest to get one debugger running, I wouldn't bother to answer.

Is there any restriction on the maximum limit in usage of stack size?

I think you're hitting the catch in form of the guard page.

To not waste actual memory until it's actually used, Windows initially reserves the full stack space (1MB default; can be changed by editing PE headers) but commits only two pages, and makes the second one a guard page. A guard page is a page (4KB) of memory which triggers a special exception (STATUS_GUARD_PAGE_VIOLATION) on any access to it. When the kernel detects a guard page exception, it commits the touched page and adds another guard page after it. This way, if your functions push small variables on the stack, it keeps growing "by itself".

However, there is a problem if you try to allocate a local variable which is over 4K (4096 bytes) in size. Normally, stack allocation is done by simply subtracting from the ESP. If you subtract more than 4K from it and then try writing to the stack, there is a chance that you shoot over the guard page and will access the reserved memory after it. This one won't be caught by the kernel but will be passed on to your program, and usually results in a crash.

The solution is simple - do stack allocations in chunks of 4K (=4096=0x1000 bytes) and touch the stack after each one to trigger the guard page. The MSVC compiler does it automatically by invoking the __chkstk() function at the start of the functions which use more than 4K of local variables. Here's the listing of the function from CRT sources:

;***
;_chkstk - check stack upon procedure entry
;
;Purpose:
;       Provide stack checking on procedure entry. Method is to simply probe
;       each page of memory required for the stack in descending order. This
;       causes the necessary pages of memory to be allocated via the guard
;       page scheme, if possible. In the event of failure, the OS raises the
;       _XCPT_UNABLE_TO_GROW_STACK exception.
;
;       NOTE:  Currently, the (EAX < _PAGESIZE_) code path falls through
;       to the "lastpage" label of the (EAX >= _PAGESIZE_) code path.  This
;       is small; a minor speed optimization would be to special case
;       this up top.  This would avoid the painful save/restore of
;       ecx and would shorten the code path by 4-6 instructions.
;
;Entry:
;       EAX = size of local frame
;
;Exit:
;       ESP = new stackframe, if successful
;
;Uses:
;       EAX
;
;Exceptions:
;       _XCPT_GUARD_PAGE_VIOLATION - May be raised on a page probe. NEVER TRAP
;                                    THIS!!!! It is used by the OS to grow the
;                                    stack on demand.
;       _XCPT_UNABLE_TO_GROW_STACK - The stack cannot be grown. More precisely,
;                                    the attempt by the OS memory manager to
;                                    allocate another guard page in response
;                                    to a _XCPT_GUARD_PAGE_VIOLATION has
;                                    failed.
;
;*******************************************************************************

public  _alloca_probe

_chkstk proc

_alloca_probe    =  _chkstk

        push    ecx

; Calculate new TOS.

        lea     ecx, [esp] + 8 - 4      ; TOS before entering function + size for ret value
        sub     ecx, eax                ; new TOS

; Handle allocation size that results in wraparound.
; Wraparound will result in StackOverflow exception.

        sbb     eax, eax                ; 0 if CF==0, ~0 if CF==1
        not     eax                     ; ~0 if TOS did not wrapped around, 0 otherwise
        and     ecx, eax                ; set to 0 if wraparound

        mov     eax, esp                ; current TOS
        and     eax, not ( _PAGESIZE_ - 1) ; Round down to current page boundary

cs10:
        cmp     ecx, eax                ; Is new TOS
        jb      short cs20              ; in probed page?
        mov     eax, ecx                ; yes.
        pop     ecx
        xchg    esp, eax                ; update esp
        mov     eax, dword ptr [eax]    ; get return address
        mov     dword ptr [esp], eax    ; and put it at new TOS
        ret

; Find next lower page and probe
cs20:
        sub     eax, _PAGESIZE_         ; decrease by PAGESIZE
        test    dword ptr [eax],eax     ; probe page.
        jmp     short cs10

_chkstk endp

In your case you probably don't need this complex logic, something like this will do:

    xor eax, eax
    mov ecx, 40     ; alloc 40 pages
l1:
    sub esp, 1000h  ; move esp one page
    mov [esp], eax  ; touch the guard page
    loop l1         ; keep looping
    sub esp, xxxh   ; alloc the remaining variables

See here for more details about stacks and guard pages.

X86 Mul Instruction from VS 2008/2010