Why Does Printf Overwrite the Ecx Register

Why is the value of EDX overwritten when making call to printf?

According to the x86 ABI, EBX, ESI, EDI, and EBP are callee-save registers and EAX, ECX and EDX are caller-save registers.

It means that functions can freely use and destroy previous values EAX, ECX, and EDX.
For that reason, save values of EAX, ECX, EDX before calling functions if you don't want their values to change. It is what "caller-save" mean.

Or better, use other registers for values that you're still going to need after a function call. push/pop of EBX at the start/end of a function is much better than push/pop of EDX inside a loop that makes a function call. When possible, use call-clobbered registers for temporaries that aren't needed after the call. Values that are already in memory, so they don't need to written before being re-read, are also cheaper to spill.

Since EBX, ESI, EDI, and EBP are callee-save registers, functions have to restore the values to the original for any of those they modify, before returning.

ESP is also callee-saved, but you can't mess this up unless you copy the return address somewhere.

x86: Why does one stack-allocated array overwrite the other?

In several places there is code like this:

lea rsi, [names_ptr+rbx*8]

or

mov rax, [grades_ptr+rbx*8]

That doesn't indirect through a pointer in memory like you want. What it does is index relative to the address of the variable names_ptr, rather than to the pointer stored in that variable.

To fix this, you have to load the pointer into a register and then do the index operation. So you could replace the first one with something like:

mov rsi, [names_ptr]
lea rsi, [rsi+rbx*8]

Even better would be to take advantage of the registers available. Put names_ptr in r14 and grades_ptr in r15. Then you can use lea rsi, [r14+rbx*8] without an additional load each time.

Be sure to push r14 and r15 (and also rbx) at the beginning of the function. It's not strictly necessary since you don't return from the function, but it's good habit.

Why does %rdi still have the right value, after I clobber it and call printf?

You didn't specify which x86-64 ABI you're using, but from your use of %rdi / %rsi for arg passing, I'll assume you're targeting the SysV ABI (everything except windows). See the x86 wiki for links to docs and stuff.

... clobbering of return values from first two sidefx() calls... In order to fix this problem I need to store the return values either somewhere in the stack or maybe in callee-saved registers.

That's correct. gcc prefers using call-preserved regs, because then you don't have to fiddle with the stack alignment when pushing or popping between calls.

Why is the final printf returning one: 1, two: 2147483641, three: 3? Shouldn't the first number printed also be mangled like what happened to the second number due to the succeeding sidefx calls?

It's just a coincidence that %rdi=1 when you call threeargs(). If you single-step your code, you'd probably find it happens to have that value when printf returns. It's not from saving/restoring, since the original value is destroyed by movq $.string1, %rdi before the call to printf. It just happens that 1 is a common thing to find in a register.

Best guess: 1 is the file-descriptor arg to the write(2) system call, which is the last thing printf needed to do before returning. (Because stdout is line-buffered).

Your C doesn't match your implementation. In the asm, global_a is 8 bytes, but in C you're treating it as a 4 byte integer (printing with %d, not %ld). Your C doesn't declare it at all. I was going to edit in a declaration into the question, but you should resolve the ambiguity yourself (between long global_a = 0; or int global_a = 0;). The AMD64 SysV ABI specifies that long is 8 bytes. Use int64_t whenever you're writing portable C, though. There's no harm in writing int64_t when interoperating with asm, even when you do happen to know the sizes of short, int and long in the ABI you're using.

Avoid the enter instruction, unless you only care about code size, and not speed. It's horribly slow. leave is ok, maybe slower than mov %rbp, %rsp / pop %rbp, but usually you only need pop %rbp because you either didn't modify %rsp, or you needed to restore rsp anyway with add $something, %rsp before popping some other registers that you saved after %rbp.

Zeroing 64bit registers with xor %eax,%eax (2 bytes) has many advantages beyond code-size over mov $0, %rax (7 bytes: mov $sign-extended-imm32, r64).

Compare your code with compiler output: gcc -fverbose-asm -O3 -fno-inline will actually generate code from your C; all you need is a declaration of a, and to make main return an int, and it compiles just fine as C11. Of course, it mostly uses 32bit operand size because you used int, but the data movement (which thing goes in which register) is the same.

Also, the order of evaluation of argument lists is not specified, so threeargs(sidefx(), sidefx(), sidefx()) is undefined behaviour. You have multiple expressions with side effects with no sequence points separating them. I guess this is why you called it pseudo-code, not C, but it's poor way to express what you mean.

Anyway, here's your code on the Godbolt Compiler Explorer from gcc 5.3 -O3.

threeargs uses a jmp to tail-call printf, instead of call/ret.

The significant differences in main are all about correctly saving the return values from sidefx. Note that a=0 in main is not needed, because it's already initialized to zero by being in the BSS, but with -fwhole-program, gcc can't optimize it away. (A constructor could modify a before main runs, or maybe after linking a different definition of a could be used, that has a different initializer.)

The implementation of sidefx is noticeably tighter than yours:

sidefx:
    subq    $8, %rsp        #     aligns the stack for another function call
    movl    a(%rip), %eax   # a, tmp94      # load `a`
    movl    $.LC0, %edi     #,              # the format string
    leal    1(%rax), %esi   #, D.2311       # esi = a+1
    xorl    %eax, %eax      #               # only needed because printf is a varargs function.  Your `main` is doing this unnecessarily.
    movl    %esi, a(%rip)   # D.2311, a     # store back to the global
    call    printf  #
    movl    a(%rip), %eax   # a,            # reload a
    addq    $8, %rsp        #,
    ret

IDK why gcc didn't load into %esi in the first place, and inc %esi instead of using lea to add one and store in a different dest. Your version moves an immediate 1 into a register, which is silly. Use immediate operands, and lea. The CPU designers already paid the x86 tax (extra design complexity to support the CISC instruction set), make sure you get your money's worth by taking full advantage of lea and immediate operands.

Note that it doesn't store/reload a before the call to printf. Your version doesn't need to do that.

Also note that none of the functions waste instructions making stack frames.

Why does the compiler make space on the stack

You should compile your (real) code with gcc -S -fverbose-asm -O if you want to look into the generated .s assembler file.

Notice that recent ABI and calling conventions require the stack pointer to be 16 byte aligned at least (in particular, for compatibility with AVX or SSE). Read also about the Red Zone (as suggested by Zang Ming Jie).

But why did compiler put a subq $32, %rsp line here? Why doesn't it appear in the first example, without printf statement?

Probably because without any calls to printf your main has become a leaf routine.
So the compiler don't need to update %rsp to be ABI compliant (in the called printf call frame).

Loop with printf in NASM

Several issues as mentioned in comments:

array resb 10 will reserve space for 10 bytes, but you want to store 10 dwords there (40 bytes). Change to array resd 10.
(Pointed out by Sep Roland) In _loop you have an off-by-one bug; since the inc is done before the mov you will access the dwords at [array+4], [array+8], ... [array+40], where the last one is out of range. This is like doing int array[10]; for (i=1; i <= 10; i++) array[i]=i; in C, and is incorrect for exactly the same reason. One fix would be to do mov [array + ecx * 4 - 4], ecx instead.
After _loop2 you have jmp print, which will transfer control to print and never come back. Since you apparently want to call print as a subroutine and continue executing with add ecx, 1 ; cmp ecx, 10, etc, you need to call print instead of jmp. And also uncomment the ret at the end of print so that it will actually return. Subroutines in assembly language don't automatically return unless you actually execute ret; otherwise the CPU will just continue executing whatever garbage happens to be next in memory.
You have a push ecx to save the value of ecx before the call to printf, which is good since printf will overwrite that register, but you need to pop ecx afterwards to get that value back and put the stack back to where it was.
Specifically, the pop ecx should follow the add esp, 8; a stack is a last-in-first-out structure, and the push ecx was before the pushing of the printf arguments, so you need to pop ecx after removing those arguments from the stack.
The mov eax, 0 as a return value at the end of print is unnecessary since you never use it anywhere else.

With these changes the code works as it should.