Assembly X86 Brk() Call Use

What does the brk() system call do?

In the diagram you posted, the "break"—the address manipulated by brk and sbrk—is the dotted line at the top of the heap.

simplified image of virtual memory layout

The documentation you've read describes this as the end of the "data segment" because in traditional (pre-shared-libraries, pre-mmap) Unix the data segment was continuous with the heap; before program start, the kernel would load the "text" and "data" blocks into RAM starting at address zero (actually a little above address zero, so that the NULL pointer genuinely didn't point to anything) and set the break address to the end of the data segment. The first call to malloc would then use sbrk to move the break up and create the heap in between the top of the data segment and the new, higher break address, as shown in the diagram, and subsequent use of malloc would use it to make the heap bigger as necessary.

Meantime, the stack starts at the top of memory and grows down. The stack doesn't need explicit system calls to make it bigger; either it starts off with as much RAM allocated to it as it can ever have (this was the traditional approach) or there is a region of reserved addresses below the stack, to which the kernel automatically allocates RAM when it notices an attempt to write there (this is the modern approach). Either way, there may or may not be a "guard" region at the bottom of the address space that can be used for stack. If this region exists (all modern systems do this) it is permanently unmapped; if either the stack or the heap tries to grow into it, you get a segmentation fault. Traditionally, though, the kernel made no attempt to enforce a boundary; the stack could grow into the heap, or the heap could grow into the stack, and either way they would scribble over each other's data and the program would crash. If you were very lucky it would crash immediately.

I'm not sure where the number 512GB in this diagram comes from. It implies a 64-bit virtual address space, which is inconsistent with the very simple memory map you have there. A real 64-bit address space looks more like this:

less simplified address space

              Legend:  t: text, d: data, b: BSS

This is not remotely to scale, and it shouldn't be interpreted as exactly how any given OS does stuff (after I drew it I discovered that Linux actually puts the executable much closer to address zero than I thought it did, and the shared libraries at surprisingly high addresses). The black regions of this diagram are unmapped -- any access causes an immediate segfault -- and they are gigantic relative to the gray areas. The light-gray regions are the program and its shared libraries (there can be dozens of shared libraries); each has an independent text and data segment (and "bss" segment, which also contains global data but is initialized to all-bits-zero rather than taking up space in the executable or library on disk). The heap is no longer necessarily continous with the executable's data segment -- I drew it that way, but it looks like Linux, at least, doesn't do that. The stack is no longer pegged to the top of the virtual address space, and the distance between the heap and the stack is so enormous that you don't have to worry about crossing it.

The break is still the upper limit of the heap. However, what I didn't show is that there could be dozens of independent allocations of memory off there in the black somewhere, made with mmap instead of brk. (The OS will try to keep these far away from the brk area so they don't collide.)

Linux sbrk() as a syscall in assembly

The sbrk function can be implemented by getting the current value and subtracting the desired amount manually. Some systems allow you to get the current value with brk(0), others keep track of it in a variable [which is initialized with the address of _end, which is set up by the linker to point to the initial break value].

This is a very platform-specific thing, so YMMV.

EDIT: On linux:

However, the actual Linux system call returns the new program break on success. On failure, the system call returns the current break. The glibc wrapper function does some work (i.e., checks whether the new break is less than addr) to provide the 0 and -1 return values described above.

So from assembly, you can call it with an absurd value like 0 or -1 to get the current value.

Be aware that you cannot "free" memory allocated via brk - you may want to just link in a malloc function written in C. Calling C functions from assembly isn't hard.

Why do x86-64 Linux system calls modify RCX, and what does the value mean?

The system call return value is in rax, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.

Note that sys_brk has a slightly different interface than the brk / sbrk POSIX functions; see the C library/kernel differences section of the Linux brk(2) man page. Specifically, Linux sys_brk sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.


The other interesting part of your question is:

I do not quite understand the value in the rcx register in this case

You're seeing the mechanics of how the syscall / sysret instructions are designed to allow the kernel to resume user-space execution but still be fast.

syscall doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.

It's not a coincidence that RCX=RIP and R11=RFLAGS after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace system call modified the process's saved rcx or r11 value while it was inside the kernel. (ptrace is the system call gdb uses). In that case, Linux would use iret instead of sysret to return to user space, because the slower general-case iret can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall in a 64-bit process, though.)


Instead of pushing a return address onto the kernel stack (like int 0x80 does), syscall:

  • sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed syscall).

  • masks RFLAGS with a pre-configured mask from a config register (the IA32_FMASK MSR). This lets the kernel disable interrupts (IF) until it's done swapgs and setting rsp to point to the kernel stack. Even with cli as the first instruction at the entry point, there'd be a window of vulnerability. You also get cld for free by masking off DF so rep movs / stos go upward even if user-space had used std.

    Fun fact: AMD's first proposed syscall / swapgs design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon).

  • jumps to the configured syscall entry point (setting CS:RIP = IA32_LSTAR). The old CS value isn't saved anywhere, I think.

  • It doesn't do anything else, the kernel has to use swapgs to get access to an info block where it saved the kernel stack pointer, because rsp still has its value from user-space.

So the design of syscall requires a system-call ABI that clobbers registers, and that's why the values are what they are.

How is BareMetalOS allocating memory in Assembly without malloc, brk, or mmap?

This post explains the assembly code for the os_mem_allocate function. The basic idea is that memory is allocated in 2MB chunks. There's an array of 65536 bytes (os_MemoryMap) that keeps track of which chunks are free and which are used. A value of 1 is a free chunk, a value of 2 is a used chunk. The total amount of memory that could be managed is 64K * 2MB = 128GB. Since most machines don't have that much memory there's another variable (os_MemAmount) that indicates the memory size of the machine (in MB).

The input to the os_mem_allocate function is a count, i.e. how many 2MB chunks to allocate. The function is designed to only allocate contiguous chunks. For example, if the input request is 3, then the function attempts to allocate 6MB of memory, and does this by searching the array for three 1's in a row. The return value from the function is a pointer to the allocated memory, or 0 if the request could not be satisfied.

The input count is passed in rcx. The code verifies that the request is for a non-zero number of chunks. An input of 0 results in a return value of 0.

os_mem_allocate:
push rsi # save some registers
push rdx
push rbx

cmp rcx, 0 # Is the count 0?
je os_mem_allocate_fail # If YES, then return 0

The code does a roundabout calculation to point rsi to the last usable byte in the 65536 byte array. The last two lines of the following snippet are the most interesting. Setting the direction flag means that subsequent lodsb instructions will decrement rsi. And of course pointing rsi to the last usable byte in the array is the whole point of the calculation.

    xor rax, rax
mov rsi, os_MemoryMap # Get the address of the 65536 byte array into RSI
mov eax, [os_MemAmount] # Get the memory size in MB into EAX
mov rdx, rsi # Keep os_MemoryMap in RDX for later use
shr eax, 1 # Divide by 2 because os_MemAmount is in MB, but chunks are 2MB

sub rsi, 1 # in C syntax, we're calculating &array[amount/2-1], which is the address of the last usable byte in the array
std # Set direction flag to backward
add rsi, rax # RSI now points to the last byte

Next the code has a loop that searches for N contiguous free chunks, where N is the count that was passed to the function in rcx. The loop scans backwards through the array looking for N 1's in a row. The loop succeeds if rbx reaches 0. Any time the loop finds a 2 in the array, it resets rbx back to N.

os_mem_allocate_start:       
mov rbx, rcx # RBX is the number of contiguous free chunks we need to find

os_mem_allocate_nextpage:
lodsb # read a byte into AL, and decrement RSI
cmp rsi, rdx # if RSI has reached the beginning of the array
je os_mem_allocate_fail # then the loop has failed

cmp al, 1 # Is the chunk free?
jne os_mem_allocate_start # If NO, we need to restart the count

dec rbx # If YES, decrement the count
jnz os_mem_allocate_nextpage # If the count reaches zero we've succeeded, otherwise continue looping

At this point the code has found enough contiguous chunks to satisfy the request, so now it marks all of the chunks as "used" by setting the bytes in the array to 2. The direction flag is set to forward so that subsequent stosb instructions will increment rdi.

os_mem_allocate_mark:      # We have a suitable free series of chunks, mark them as used
cld # Set direction flag to forward

xor rdi, rsi # We swap RDI and RSI to keep RDI contents, but
xor rsi, rdi # more importantly we want RDI to point to the
xor rdi, rsi # location in the array where we want to write 2's

push rcx # Save RCX since 'rep stosb' will modify it
add rdi, 1 # the previous loop decremented RSI too many times
mov al, 2 # the value 2 indicates a "used" chunk
mov rbx, rdi # RBX is going to be used to calculate the return value
rep stosb # store some 2's in the array, using the count in RCX
mov rdi, rsi # Restoring RDI

Finally, the function needs to come up with a pointer to return to the caller.

    sub rbx, rdx           # RBX is now an index into the 65536 byte array
pop rcx # Restore RCX
shl rbx, 21 # Multiply by 2MB to convert the index to a pointer
mov rax, rbx # Return the pointer in RAX
jmp os_mem_allocate_end

The next snippet handles errors by setting the return value to 0. Clearing the direction flag is important since by convention the direction is forward.

os_mem_allocate_fail:
cld # Set direction flag to forward
xor rax, rax # Failure so set RAX to 0 (No pages allocated)

Finally, restore the registers and return the pointer.

os_mem_allocate_end:
pop rbx
pop rdx
pop rsi
ret


Related Topics



Leave a reply



Submit