Significance of Address 0X8048080

significance of address 0x8048080

There is no special significance to address 0x8048080, but there is one for address 0x08048000.

The latter address is the default address, on which ld starts the first PT_LOAD segment on Linux/x86. On Linux/x86_64, the default is 0x400000, and you can change the default by using a "custom" linker script. You can also change where .text section starts with -Wl,-Ttext,0xNNNNNNNN flag.

After ld starts at 0x08048000, it adds space for program headers, and proceeds to link the rest of the executable according to its built-in linker script, which you can see if you pass in -Wl,--verbose to your link line.

For your program, the size of program headers appears to always be 0x80, so your .text section always starts at 0x8048080, but that is by no means universal.

When I link a trivial int main() { return 0; } program, I get &_start == &.text at 0x8048300, 0x8048178 or 0x8048360, depending on which compiler I use.

Why do virtual memory addresses for linux binaries start at 0x8048000?

From the Linkers and loaders book:

On 386 systems, the text base address is 0x08048000, which permits a reasonably large stack below the text while still staying above address 0x08000000, permitting most programs to use a single second-level page table. (Recall that on the 386, each second-level table maps 0x00400000 addresses.)

Why does my data section appear twice in the compiled binary? Ubuntu, x86, nasm, gdb, reaelf

Let's look at the LOAD segments:

Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x08048000 0x08048000 0x0009d 0x0009d R E 0x1000
LOAD 0x00009d 0x0804909d 0x0804909d 0x00010 0x00010 RW 0x1000

The first one instructs the loader to mmap 0x9d bytes from file offset 0 into virtual memory at address 0x08048000.

The loader can't do exactly that, because memory mapping only works at one page (4096 bytes) granularity. So it mmaps the .text, and everything that follows it in the file, up to one page, at address 0x08048000.

This means that whatever .data followed .text in the file after offset 0x9d will appear at address 0x0804809d and later, but with wrong permissions (Read and Execute).

The second LOAD segment instructs the loader to mmap file contents, starting at offset 0x9d at virtual address 0x0804909d.

The loader can't do exactly that either for the same "page granularity" reason.

Instead, it will round down the offset and the address, and mmap file contents starting from offset 0 at address 0x08049000.

That that means that whatever .text preceded .data in the file will appear at address before 0x0804909d, again with the wrong permissions (Read and Write this time).

You can confirm that that's what's happening by using GDB x/10i 0x8049080 -- you will see exactly the same instructions as with x/10i 0x8048080.

You can also observe the actual mmap system calls the loader performed with strace.

Why I cannot single stepping into aeskeygenassist instruction in self-modifying code?

I assume you forgot to link with --omagic to make the .text section writable.

So mov BYTE PTR ds:0x804807f,ah segfaults, and it's right before aeskeygenassist. You can't keep single-stepping after your program crashes. (You have no handler for SIGSEGV, and the default action is to terminate your program).

When I tried this on my desktop out of curiosity, I can imagine interpreting the behaviour as single-stepping getting "stuck" before aeskeygenassist, if I ignore the segfault message!!! and the fact that trying again says "the program is no longer running".

From a GDB session:

(gdb) layout reg
(gdb) starti          # like run with an implicit breakpoint on the first instruction
(gdb) si
0x0000000000401004 in _start ()
0x0000000000401008 in _start ()     ## I kept pressing return to repeat the command
0x000000000040100c in _start ()
0x000000000040100e in roundloop ()
0x0000000000401012 in roundloop ()
0x0000000000401014 in roundloop ()    # the MOV store

Program received signal SIGSEGV, Segmentation fault.
0x0000000000401014 in roundloop ()    # still pointing at the MOV store

Notice that RIP is still pointing at the mov. 0x8048074 in your 32-bit build, 0x401014 in my 64-bit build of the same source.

From the ld manual:

-N
--omagic
Set the text and data sections to be readable and writable. Also, do not page-align the data segment, and disable linking against
shared
libraries. If the output format supports Unix style magic numbers, mark the output as "OMAGIC". Note: Although a writable text
section is
allowed for PE-COFF targets, it does not conform to the format specification published by Microsoft.

Your code works fine for me if I link with:

  nasm -felf64 aes.asm &&
  ld --omagic aes.o -o aes

Alternatively, you could make an mprotect system call to give the page containing this code PROT_READ|PROT_WRITE|PROT_EXEC.

GDB's layout reg disassembly window even updates disassembly for aeskeygenassist after its immediate is modified by store.

Also note that Self-Modifying Code (SMC) is extremely slow on modern x86. Full pipeline nuke after every store near instructions being executed. You'd be much better off unrolling with an assembler macro.

Also, you can't ret from _start under Linux; it's not a function. The stack pointer points to argc, not a return address. Make an _exit system call with int 0x80 for 32-bit code. When I say "works" I meant it reaches that ret and segfaults on code-fetch from address 1 after popping argc into RIP.

Also, use default rel for RIP-relative addressing of the store; it's more compact. Or I guess you're building a 32-bit executable out of this for some reason, based on your code addresses. I didn't notice that at first, that's why I tested as a 64-bit executable. Fortunately you used labels correctly, and aeskeygenassist is the same length in both modes, so it still works.

x86 labels and LEA in GDB

Its somewhat confusing because gdb doesn't understand the concept of labels, really -- its designed to debug a program written in higher-level language (C or C++, generally) and compiled by a compiler. So it tries to map what it sees in the binary to high-level language concepts -- variables and types -- based on its best guess as to what is going on (in the absence of debug info from the compiler that tells it what is going on).

what nasm does

To the assembler, a label is value that hasn't been set yet -- it actually gets its final value when the linker runs. Generally, labels are used to refer to addresses in sections of memory -- the actual address will get defined when the linker lays out the final executable image. The assembler generates relocation records so that uses of the label can be set properly by the linker.

So when the assembler sees

mov eax, msg

it knows that msg is a label corresponding to an address in the data segment, so it generates an instruction to load that address into eax. When it sees

mov eax, [msg]

it generates an instruction to load 32-bits (the size of register eax) from memory at address of msg. In both cases, there will be a relocation generated so that the linker can plug in the final address msg ends up with.

(aside -- I have no idea what & means to nasm -- it doesn't appear anywhere in the documentation I can see, so I'm suprised it doesn't give an error. But it looks like it treats it as an alias for [])

Now LEA is a funny instruction -- it has basically the same format as a move from memory, but instead of reading memory, it stores the address it would have read from into the destination register. So

lea eax, msg

makes no sense -- the source is the label (address) msg, which is a (link time) constant and is not in memory anywhere.

lea eax, [msg]

works, as the source is in memory, so it sticks the address of the source into eax. This is the same effect as mov eax, msg. Most commonly, you only see lea used with more complex addressing modes, so that you can leverage the x86 AGU to do useful work other than just computing addresses. Eg:

lea eax, [ebx+4*ecx+32]

which does a shift and two adds in the AGU and puts the result into eax rather than loading from that address.

what gdb does

In gdb, when you type p <expression> it tries to evaluate <expression> to the best of its understanding of what the C/C++ compiler means for that expression. So when you say

(gdb) p msg

it looks at msg and says "that looks like a variable, so lets get the current value of that variable and print that". Now it knows that compilers like to put global variables into the .data segment, and that they create symbols for those variables with the same name as the varible. Since it sees msg in the symbol table as a symbol in the .data segment, it assumes that is what is going on, and fetches the memory at that symbol and prints it. Now it has no idea what TYPE that variable is (no debug info), so it guesses that it is a 32-bit int and prints it as that.

So the output

$1 = 1700946284

is the first 4 bytes of msg, treated as an integer.

For p &msg it understands you want to take the address of the variable msg, so it give the address from the symbol directly. When printing addresses, gdb prints the type information it has about those addresses, thus the "data variable, no debug info" that comes out with it.

If you want, you can use a cast to specify the type of something to gdb, and it will use that type instead of what it has guessed:

(gdb) p (char)msg
$6 = 108 'l'
(gdb) p (char [10])msg
$7 = "labeled st"
(gdb) p (char *)&msg
$8 = 0x80490e4 "labeled string\\nunlabeled-string\\n\n\n\n\n\n\n\n\n" <Address 0x804910e out of bounds>

Note in the latter case here, there's no NUL terminator on the string, so it prints out the entire data segment...

To print the unlabelled string with sys_write, you need to figure out the address
and length of string, which you almost have. For completeness you should also check the return value:

    mov ebx, 1           ; fd 1 (stdout)
    lea ecx, [msg+15]    ; address
    mov edx, 17          ; length
write_more:
    mov eax, 4           ; sys_write
    int 80H              ; write(1, &msg[15], 17)
    test eax, eax        ; check for error
    js error             ; error, eax = -ERRNO
    add ecx, eax
    sub edx, eax
    jg write_more        ; only part of the string was written

Why Segment fault when writing to writeable .data section? Using Ubuntu, x86, nasm, gdb, readelf

Debugging with gdb confirms the data is contiguous with the code at run time and readelf analysis of the program confirms the data segment is writeable.

You are expecting db '...' to immediately follow CALL one.

That does not actually happen, your .data section is in a different segment (because it needs different permissions):

readelf -Wl myshdb
Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x000000 0x08048000 0x08048000 0x00094 0x00094 R   0x1000
  LOAD           0x001000 0x08049000 0x08049000 0x0001d 0x0001d R E 0x1000
  LOAD           0x002000 0x0804a000 0x0804a000 0x00010 0x00010 RW  0x1000

 Section to Segment mapping:
  Segment Sections...
   00
   01     .text
   02     .data

Note that .data is in the second LOAD segment, and that segment begins on a different page.

What may be confusing you is that your linker may leave a copy of .data following code for two (my version doesn't -- it's all 0s for me).

In any case, your code as is tries to write to the first LOAD segment, to location immediately after the end of two, but that segment is (clearly) not writable.

Significance of Address 0X8048080