What Is Global _Start in Assembly Language

What is global _start in assembly language?

global directive is NASM specific. It is for exporting symbols in your code to where it points in the object code generated. Here you mark _start symbol global so its name is added in the object code (a.o). The linker (ld) can read that symbol in the object code and its value so it knows where to mark as an entry point in the output executable. When you run the executable it starts at where marked as _start in the code.

If a global directive missing for a symbol, that symbol will not be placed in the object code's export table so linker has no way of knowing about the symbol.

If you want to use a different entry point name other than _start (which is the default), you can specify -e parameter to ld like:

ld -e my_entry_point -o out a.o

What is the difference in using global _main and global _start in the text section of asm

main or _main or main_ (OpenWatcom) is known to the C language, and is called by "startup code" which is "usually" linked to - if you're using C.

_start is known to the linker ld (in Linux) as the default entrypoint (another symbol can be used) and is not called. Thus, there is no return address on the stack. Stack starts with number of arguments. Your OS may differ.

Can the _start symbol in the assembly be replaced with another word?

This is a gnu ld question not nasm. When ld links it is looking for that symbol to mark as the entry point. Your question is vague as to the target, but stating nasm indicates x86 and of course Linux is not vague.

So since you are loading the program being built from an operating system like Linux the entry point is critical, unless of course you manipulate the binary in some way or indicate to the linker in some way what your entry point is. Your program will not operate properly and quite likely simply crash, if the program is not executed in the proper order, you cant just jump into the middle of a program and hope for success, much less try to execute beginning with .data or something not code.

Now as mentioned in comments (up vote the comments please) you can change the entry point label if you don't want to use the _start label. If you do not specify _start, ld will give a warning and continue, but if you don't give it another label then you are at risk of it entering in the wrong place.

If this were bare-metal for a microcontroller for example then you don't have an operating system loading the program into memory and entering anywhere in the binary that you specify, you are instead governed by the hardware/logic and have to conform to its rules and craft the code, linker script, command line, etc to generate the binary to match the logic specified entry point, and in that case you can go without the _start all together, take whatever default ld puts in its output binary which is then at some point used to program the flash/rom in the mcu (stripping all of that knowledge from the binary file including the entry point).

I am not so sure about nasm, but assume you are always in some section, so the label will land somewhere. If it is not in a .text section and you are using it as the entry point (by default, by not specifying something else). Even if it is the last line before a .text section declaration, the linker is going to put that label with the other labels in the section it lands, so because it is in the file just before a .text declaration rather than just after let's say, it may land with an address that is nowhere near the code that follows in the source file.

Some examples, using gnu tools, the question is ld specific so the target and assembler don't necessarily matter here.

MEMORY
{
one : ORIGIN = 0x1000, LENGTH = 0x1000
two : ORIGIN = 0x2000, LENGTH = 0x1000
three : ORIGIN = 0x3000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > one
.data : { *(.data*) } > two
.bss : { *(.bss*) } > three
}

.globl _start
_start:
nop

Building and use readelf

  Entry point address:               0x1000

Now if I

.globl here
here:
nop

.globl _start
_start:
nop

.globl there
there:
nop


00001000 <here>:
1000: e1a00000 nop ; (mov r0, r0)

00001004 <_start>:
1004: e1a00000 nop ; (mov r0, r0)

00001008 <there>:
1008: e1a00000 nop ; (mov r0, r0)

Entry point address: 0x1000

And that may be confusing... but let's move on.

arm-linux-gnueabi-ld -nostdlib -nostartfiles -e _start -T so.ld so.o -o so.elf

Entry point address: 0x1004

Or instead

ENTRY(_start)
MEMORY
{
one : ORIGIN = 0x1000, LENGTH = 0x1000
...


Entry point address: 0x1004

But I can also do this:

    .globl here
here:
nop

nop

.globl there
there:
nop

ENTRY(there)
MEMORY
{
one : ORIGIN = 0x1000, LENGTH = 0x1000

Entry point address: 0x1008

Noting that the linker didn't warn about _start

If I now remove ENTRY() from the linker script.

  Entry point address:               0x1000

But if I do this:

arm-none-eabi-ld so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000

Which means no linker script so it is going to use defaults, then it is looking for it. Which we can do ourselves with

ENTRY(_start)
MEMORY
{

but no defined _start global label

arm-linux-gnueabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000

So if you are simply doing

nasm stuff myprog.asm stuff myprog.o
ld myprog.o -o myprog

You are using whatever default linker settings/script for the tool/environment and it likely has an ENTRY(_start) or equivalent as the default. If you are in complete control of the linker and you want to load a program into Linux then you need a safe/sane entry point for the program to work otherwise ld defaults to the beginning of the binary or beginning of .text which we can test:

SECTIONS
{
.text : { *(.text*) } > two
.data : { *(.data*) } > one
.bss : { *(.bss*) } > three
}

.globl here
here:
nop

.data
.word 0x12345678

arm-linux-gnueabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000002000


Disassembly of section .text:

00002000 <here>:
2000: e1a00000 nop ; (mov r0, r0)

Disassembly of section .data:

00001000 <.data>:
1000: 12345678

so beginning of .text not beginning or first address space in the binary

ENTRY(somedata)
MEMORY
{
one : ORIGIN = 0x1000, LENGTH = 0x1000
two : ORIGIN = 0x2000, LENGTH = 0x1000
three : ORIGIN = 0x3000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > two
.data : { *(.data*) } > one
.bss : { *(.bss*) } > three
}


.globl here
here:
nop

.data
.globl somedata
somedata: .word 0x12345678

Entry point address: 0x1000

This is as trivial to do with nasm and ld as demonstrated above with gas and ld. This shows that _start isn't actually magic any more than main() is with respect to ld (or even gcc). _start seems/feels magic because default linker scripts call it out, so folks think it is magic. main() is magic because the language defines it as such but in reality it is the bootstrap that makes it so and if you simply

gcc helloworld.c -o helloworld

You are getting default bootstrap and linker script. But you could make your own bootstrap or modify the one in your C library and use it and not have a main() in your program and the tools don't care it will just work fine. (not all tools of course as some tools do detect main() and add critical stuff that might not normally get added, especially for C++). But, the gnu tools are particularly flexible and generic which makes them usable for so many targets, bare-metal to kernel drivers to operating system applications.

Use the tools you have, they are very powerful, do experiments like the above first.

nasm: Is global _start: allowed?

I tried to find out why NASM might not reject that:

GLOBAL, like EXTERN, allows object formats to define private
extensions by means of a colon. The elf object format, for example,
lets you specify whether global data items are functions or data:

global hashlookup:function, hashtable:data

Like EXTERN, the primitive
form of GLOBAL differs from the user-level form only in that it can
take only one argument at a time.

-- http://www.nasm.us/doc/nasmdoc6.html

So it's just a quirk of the parser. It assembles, but I didn't check what it assembled to.

Do Not Do This. It's obviously wrong, and you shouldn't expect it to work. It only happens to work by luck. As always in these cases, stick to the normal syntax when there's nothing to be gained from doing otherwise. Even if you don't confuse the compiler/assembler, you will confuse other human readers.

global main in Assembly

global main basically means that the symbol should be visible to the linker because other object files will use it. Without it, the symbol main is considered local to the object file it's assembled to, and will not appear after the assembly file is assembled.

Return values in main vs _start

TL:DR: function return values and system-call arguments use separate registers because they're completely unrelated.


When you compile with gcc, it links CRT startup code that defines a _start. That _start (indirectly) calls main, and passes main's return value (which main leaves in EAX) to the exit() library function. (Which eventually makes an exit system call, after doing any necessary libc cleanup like flushing stdio buffers.)

See also Return vs Exit from main function in C - this is exactly analogous to what you're doing, except you're using _exit() which bypasses libc cleanup, instead of exit(). Syscall implementation of exit()

An int $0x80 system call takes its argument in EBX, as per the 32-bit system-call ABI (which you shouldn't be using in 64-bit code). It's not a return value from a function, it's the process exit status. See Hello, world in assembly language with Linux system calls? for more about system calls.

Note that _start is not a function; it can't return in that sense because there's no return address on the stack. You're taking a casual description like "return to the OS" and conflating that with a function's "return value". You can call exit from main if you want, but you can't ret from _start.

EAX is the return-value register for int-sized values in the function-calling convention. (The high 32 bits of RAX are ignored because main returns int. But also, $? exit status can only get the low 8 bits of the value passed to exit().)

Related:

  • Why am I allowed to exit main using ret?
  • What happens with the return value of main()?
  • where goes the ret instruction of the main
  • What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? explains why you should use syscall, and shows some of the kernel side of what happens inside the kernel after a system call.

Understanding assembly language _start label in a C program

Here is the well commented assembly source of the code you posted.

Summarized, it does the following things:

  1. establish a sentinel stack frame with ebp = 0 so code that walks the stack can find its end easily
  2. Pop the number of command line arguments into esi so we can pass them to __libc_start_main
  3. Align the stack pointer to a multiple of 16 bits in order to comply with the ABI. This is not guaranteed to be the case in some versions of Linux so it has to be done manually just in case.
  4. The addresses of __libc_csu_fini, __libc_csu_init, the argument vector, the number of arguments and the address of main are pushed as arguments to __libc_start_main
  5. __libc_start_main is called. This function (source code here) sets up some glibc-internal variables and eventually calls main. It never returns.
  6. If for any reason __libc_start_main should return, a hlt instruction is placed afterwards. This instruction is not allowed in user code and should cause the program to crash (hopefully).
  7. The final series of nop instructions is padding inserted by the assembler so the next function starts at a multiple of 16 bytes for better performance. It is never reached in normal execution.

Default value of _start


However, when I remove the line .globl _start ...

The .globl line means that the name _start is "visible" outside the file file.s. If you remove that line, the name _start is only for use inside the file file.s and in a larger program (containing multiple files) you could even use the name _start in multiple files.

(This is similar to static variables in C/C++: If you generate assembler code from C or C++, the difference between real global variables and static variables is that there is a .globl line for the global variables and no .globl line for static variables. And if you are familiar with C, you know that static variables cannot be used in other files.)

The linker (ld) is also not able to use the name _start if it can be used inside the file only.

What does 0000000000400078 mean?

Obviously 0x400078 is the address of the first byte of your program. ld assumes that the program starts at the first byte if no symbol named _start is found.

... why is it even necessary to declare .globl _start?

It is not guaranteed that _start is located at the first byte of your program.

Counterexample:

.globl _start

write_stdout:
mov $4, %eax
mov $1, %ebx
int $0x80
ret

exit:
mov $1, %eax
mov $0, %ebx
int $0x80
jmp exit

_start:
mov $text, %ecx
mov $(textend-text), %edx
call write_stdout
mov $text2, %ecx
mov $(textend2-text2), %edx
call write_stdout
call exit

text:
.ascii "Hello\n"
textend:
text2:
.ascii "World\n"
textend2:

If you remove the .globl line, ld will not be able to find the _start: line and assume that your program starts at the first byte - which is the write_stdout: line!

... and if you have multiple .s files in a larger program (or even a combination of .s, .c and .cc), you don't have control about which code is located at the first byte of your program!

Assembly, ld cannot find _start;

hard do tell without seeing the whole code

here's a very small program, that you can use as skeleton:

section .text
global _start ;must be declared for using gcc
_start: ;tell linker entry point
mov edx, len ;message length
mov ecx, msg ;message to write
mov ebx, 1 ;fd to use is stdout
mov eax, 4 ;sys_write
int 0x80 ;call kernel
mov eax, 1 ;sys_exit
int 0x80 ;call kernel

section .data

msg db 'Hello, stack overflow!',0xa ; say hello to the guys
len equ $ - msg ;length of the string

to assemble (into an object file) use nasm -g -f elf sample.s

to link ld -o sample sample.o



Related Topics



Leave a reply



Submit