What is global _start in assembly language?
global
directive is NASM specific. It is for exporting symbols in your code to where it points in the object code generated. Here you mark _start
symbol global so its name is added in the object code (a.o
). The linker (ld
) can read that symbol in the object code and its value so it knows where to mark as an entry point in the output executable. When you run the executable it starts at where marked as _start
in the code.
If a global
directive missing for a symbol, that symbol will not be placed in the object code's export table so linker has no way of knowing about the symbol.
If you want to use a different entry point name other than _start
(which is the default), you can specify -e
parameter to ld like:
ld -e my_entry_point -o out a.o
What is the difference in using global _main and global _start in the text section of asm
main
or _main
or main_
(OpenWatcom) is known to the C language, and is call
ed by "startup code" which is "usually" linked to - if you're using C.
_start
is known to the linker ld
(in Linux) as the default entrypoint (another symbol can be used) and is not call
ed. Thus, there is no return address on the stack. Stack starts with number of arguments. Your OS may differ.
Can the _start symbol in the assembly be replaced with another word?
This is a gnu ld question not nasm. When ld links it is looking for that symbol to mark as the entry point. Your question is vague as to the target, but stating nasm indicates x86 and of course Linux is not vague.
So since you are loading the program being built from an operating system like Linux the entry point is critical, unless of course you manipulate the binary in some way or indicate to the linker in some way what your entry point is. Your program will not operate properly and quite likely simply crash, if the program is not executed in the proper order, you cant just jump into the middle of a program and hope for success, much less try to execute beginning with .data or something not code.
Now as mentioned in comments (up vote the comments please) you can change the entry point label if you don't want to use the _start label. If you do not specify _start, ld will give a warning and continue, but if you don't give it another label then you are at risk of it entering in the wrong place.
If this were bare-metal for a microcontroller for example then you don't have an operating system loading the program into memory and entering anywhere in the binary that you specify, you are instead governed by the hardware/logic and have to conform to its rules and craft the code, linker script, command line, etc to generate the binary to match the logic specified entry point, and in that case you can go without the _start all together, take whatever default ld puts in its output binary which is then at some point used to program the flash/rom in the mcu (stripping all of that knowledge from the binary file including the entry point).
I am not so sure about nasm, but assume you are always in some section, so the label will land somewhere. If it is not in a .text section and you are using it as the entry point (by default, by not specifying something else). Even if it is the last line before a .text section declaration, the linker is going to put that label with the other labels in the section it lands, so because it is in the file just before a .text declaration rather than just after let's say, it may land with an address that is nowhere near the code that follows in the source file.
Some examples, using gnu tools, the question is ld specific so the target and assembler don't necessarily matter here.
MEMORY
{
one : ORIGIN = 0x1000, LENGTH = 0x1000
two : ORIGIN = 0x2000, LENGTH = 0x1000
three : ORIGIN = 0x3000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > one
.data : { *(.data*) } > two
.bss : { *(.bss*) } > three
}
.globl _start
_start:
nop
Building and use readelf
Entry point address: 0x1000
Now if I
.globl here
here:
nop
.globl _start
_start:
nop
.globl there
there:
nop
00001000 <here>:
1000: e1a00000 nop ; (mov r0, r0)
00001004 <_start>:
1004: e1a00000 nop ; (mov r0, r0)
00001008 <there>:
1008: e1a00000 nop ; (mov r0, r0)
Entry point address: 0x1000
And that may be confusing... but let's move on.
arm-linux-gnueabi-ld -nostdlib -nostartfiles -e _start -T so.ld so.o -o so.elf
Entry point address: 0x1004
Or instead
ENTRY(_start)
MEMORY
{
one : ORIGIN = 0x1000, LENGTH = 0x1000
...
Entry point address: 0x1004
But I can also do this:
.globl here
here:
nop
nop
.globl there
there:
nop
ENTRY(there)
MEMORY
{
one : ORIGIN = 0x1000, LENGTH = 0x1000
Entry point address: 0x1008
Noting that the linker didn't warn about _start
If I now remove ENTRY() from the linker script.
Entry point address: 0x1000
But if I do this:
arm-none-eabi-ld so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000
Which means no linker script so it is going to use defaults, then it is looking for it. Which we can do ourselves with
ENTRY(_start)
MEMORY
{
but no defined _start global label
arm-linux-gnueabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
So if you are simply doing
nasm stuff myprog.asm stuff myprog.o
ld myprog.o -o myprog
You are using whatever default linker settings/script for the tool/environment and it likely has an ENTRY(_start) or equivalent as the default. If you are in complete control of the linker and you want to load a program into Linux then you need a safe/sane entry point for the program to work otherwise ld defaults to the beginning of the binary or beginning of .text which we can test:
SECTIONS
{
.text : { *(.text*) } > two
.data : { *(.data*) } > one
.bss : { *(.bss*) } > three
}
.globl here
here:
nop
.data
.word 0x12345678
arm-linux-gnueabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000002000
Disassembly of section .text:
00002000 <here>:
2000: e1a00000 nop ; (mov r0, r0)
Disassembly of section .data:
00001000 <.data>:
1000: 12345678
so beginning of .text not beginning or first address space in the binary
ENTRY(somedata)
MEMORY
{
one : ORIGIN = 0x1000, LENGTH = 0x1000
two : ORIGIN = 0x2000, LENGTH = 0x1000
three : ORIGIN = 0x3000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > two
.data : { *(.data*) } > one
.bss : { *(.bss*) } > three
}
.globl here
here:
nop
.data
.globl somedata
somedata: .word 0x12345678
Entry point address: 0x1000
This is as trivial to do with nasm and ld as demonstrated above with gas and ld. This shows that _start isn't actually magic any more than main() is with respect to ld (or even gcc). _start seems/feels magic because default linker scripts call it out, so folks think it is magic. main() is magic because the language defines it as such but in reality it is the bootstrap that makes it so and if you simply
gcc helloworld.c -o helloworld
You are getting default bootstrap and linker script. But you could make your own bootstrap or modify the one in your C library and use it and not have a main() in your program and the tools don't care it will just work fine. (not all tools of course as some tools do detect main() and add critical stuff that might not normally get added, especially for C++). But, the gnu tools are particularly flexible and generic which makes them usable for so many targets, bare-metal to kernel drivers to operating system applications.
Use the tools you have, they are very powerful, do experiments like the above first.
nasm: Is global _start: allowed?
I tried to find out why NASM might not reject that:
GLOBAL, like EXTERN, allows object formats to define private
extensions by means of a colon. The elf object format, for example,
lets you specify whether global data items are functions or data:
global hashlookup:function, hashtable:data
Like EXTERN, the primitive
form of GLOBAL differs from the user-level form only in that it can
take only one argument at a time.
-- http://www.nasm.us/doc/nasmdoc6.html
So it's just a quirk of the parser. It assembles, but I didn't check what it assembled to.
Do Not Do This. It's obviously wrong, and you shouldn't expect it to work. It only happens to work by luck. As always in these cases, stick to the normal syntax when there's nothing to be gained from doing otherwise. Even if you don't confuse the compiler/assembler, you will confuse other human readers.
global main in Assembly
global main
basically means that the symbol should be visible to the linker because other object files will use it. Without it, the symbol main
is considered local to the object file it's assembled to, and will not appear after the assembly file is assembled.
Return values in main vs _start
TL:DR: function return values and system-call arguments use separate registers because they're completely unrelated.
When you compile with gcc
, it links CRT startup code that defines a _start
. That _start
(indirectly) calls main
, and passes main
's return value (which main leaves in EAX) to the exit()
library function. (Which eventually makes an exit system call, after doing any necessary libc cleanup like flushing stdio buffers.)
See also Return vs Exit from main function in C - this is exactly analogous to what you're doing, except you're using _exit()
which bypasses libc cleanup, instead of exit()
. Syscall implementation of exit()
An int $0x80
system call takes its argument in EBX, as per the 32-bit system-call ABI (which you shouldn't be using in 64-bit code). It's not a return value from a function, it's the process exit status. See Hello, world in assembly language with Linux system calls? for more about system calls.
Note that _start
is not a function; it can't return in that sense because there's no return address on the stack. You're taking a casual description like "return to the OS" and conflating that with a function's "return value". You can call exit
from main
if you want, but you can't ret
from _start
.
EAX is the return-value register for int
-sized values in the function-calling convention. (The high 32 bits of RAX are ignored because main
returns int
. But also, $?
exit status can only get the low 8 bits of the value passed to exit()
.)
Related:
- Why am I allowed to exit main using ret?
- What happens with the return value of main()?
- where goes the ret instruction of the main
- What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? explains why you should use
syscall
, and shows some of the kernel side of what happens inside the kernel after a system call.
Understanding assembly language _start label in a C program
Here is the well commented assembly source of the code you posted.
Summarized, it does the following things:
- establish a sentinel stack frame with ebp = 0 so code that walks the stack can find its end easily
- Pop the number of command line arguments into
esi
so we can pass them to__libc_start_main
- Align the stack pointer to a multiple of 16 bits in order to comply with the ABI. This is not guaranteed to be the case in some versions of Linux so it has to be done manually just in case.
- The addresses of
__libc_csu_fini
,__libc_csu_init
, the argument vector, the number of arguments and the address ofmain
are pushed as arguments to__libc_start_main
__libc_start_main
is called. This function (source code here) sets up some glibc-internal variables and eventually callsmain
. It never returns.- If for any reason
__libc_start_main
should return, ahlt
instruction is placed afterwards. This instruction is not allowed in user code and should cause the program to crash (hopefully). - The final series of
nop
instructions is padding inserted by the assembler so the next function starts at a multiple of 16 bytes for better performance. It is never reached in normal execution.
Default value of _start
However, when I remove the line
.globl _start
...
The .globl
line means that the name _start
is "visible" outside the file file.s
. If you remove that line, the name _start
is only for use inside the file file.s
and in a larger program (containing multiple files) you could even use the name _start
in multiple files.
(This is similar to static
variables in C/C++: If you generate assembler code from C or C++, the difference between real global variables and static
variables is that there is a .globl
line for the global variables and no .globl
line for static
variables. And if you are familiar with C, you know that static
variables cannot be used in other files.)
The linker (ld
) is also not able to use the name _start
if it can be used inside the file only.
What does
0000000000400078
mean?
Obviously 0x400078
is the address of the first byte of your program. ld
assumes that the program starts at the first byte if no symbol named _start
is found.
... why is it even necessary to declare
.globl _start
?
It is not guaranteed that _start
is located at the first byte of your program.
Counterexample:
.globl _start
write_stdout:
mov $4, %eax
mov $1, %ebx
int $0x80
ret
exit:
mov $1, %eax
mov $0, %ebx
int $0x80
jmp exit
_start:
mov $text, %ecx
mov $(textend-text), %edx
call write_stdout
mov $text2, %ecx
mov $(textend2-text2), %edx
call write_stdout
call exit
text:
.ascii "Hello\n"
textend:
text2:
.ascii "World\n"
textend2:
If you remove the .globl
line, ld
will not be able to find the _start:
line and assume that your program starts at the first byte - which is the write_stdout:
line!
... and if you have multiple .s
files in a larger program (or even a combination of .s
, .c
and .cc
), you don't have control about which code is located at the first byte of your program!
Assembly, ld cannot find _start;
hard do tell without seeing the whole code
here's a very small program, that you can use as skeleton:
section .text
global _start ;must be declared for using gcc
_start: ;tell linker entry point
mov edx, len ;message length
mov ecx, msg ;message to write
mov ebx, 1 ;fd to use is stdout
mov eax, 4 ;sys_write
int 0x80 ;call kernel
mov eax, 1 ;sys_exit
int 0x80 ;call kernel
section .data
msg db 'Hello, stack overflow!',0xa ; say hello to the guys
len equ $ - msg ;length of the string
to assemble (into an object file) use nasm -g -f elf sample.s
to link ld -o sample sample.o
Related Topics
How to Run Crontab Job Every Week on Sunday
Are There Standards for Linux Command Line Switches and Arguments
Do I Need -D_Reentrant with -Pthreads
Maximum Number of Bash Arguments != Max Num Cp Arguments
How to Download a File from Server Using Ssh
Apache Not Accepting Incoming Connections from Outside of Localhost
How to Split a File and Keep the First Line in Each of the Pieces
How to Install Nuget from Command Line on Linux
Recursive Search and Replace in Text Files on MAC and Linux
Meaning of Tilde in Linux Bash (Not Home Directory)
A General Linux File Permissions Question: Apache and Wordpress
How to Set Process Id in Linux for a Specific Program
Using Ls to List Directories and Their Total Sizes
How Do Unix Domain Sockets Differentiate Between Multiple Clients