Can _Start Be the Thumb Function

ARM-C Inter-working

I assume you mean interworking not internetworking? The LPC1769 is a cortex-m3 which is thumb/thumb2 only so it doesnt support arm instructions so there is no interworking available for that platform. Nevertheless, playing with the compiler to see what goes on:

Get the compiler to do it for you first, then try it yourself in asm...

start.s

.thumb
.globl _start
_start:
    ldr r0,=hello
    mov lr,pc
    bx r0
hang : b hang

hello.c

extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
    return(two(h)+7);
}

two.c

unsigned int two ( unsigned int t )
{
    return(t+5);
}

Makefile

hello.list : start.s hello.c two.c
    arm-none-eabi-as -mthumb start.s -o start.o
    arm-none-eabi-gcc -c -O2 hello.c -o hello.o
    arm-none-eabi-gcc -c -O2 -mthumb two.c -o two.o
    arm-none-eabi-ld -Ttext=0x1000 start.o hello.o two.o -o hello.elf
    arm-none-eabi-objdump -D hello.elf > hello.list

clean :
    rm -f *.o
    rm -f *.elf
    rm -f *.list

produces hello.list

Disassembly of section .text:

00001000 <_start>:
    1000:   4801        ldr r0, [pc, #4]    ; (1008 <hang+0x2>)
    1002:   46fe        mov lr, pc
    1004:   4700        bx  r0

00001006 <hang>:
    1006:   e7fe        b.n 1006 <hang>
    1008:   0000100c    andeq   r1, r0, ip

0000100c <hello>:
    100c:   e92d4008    push    {r3, lr}
    1010:   eb000004    bl  1028 <__two_from_arm>
    1014:   e8bd4008    pop {r3, lr}
    1018:   e2800007    add r0, r0, #7
    101c:   e12fff1e    bx  lr

00001020 <two>:
    1020:   3005        adds    r0, #5
    1022:   4770        bx  lr
    1024:   0000        movs    r0, r0
    ...

00001028 <__two_from_arm>:
    1028:   e59fc000    ldr ip, [pc]    ; 1030 <__two_from_arm+0x8>
    102c:   e12fff1c    bx  ip
    1030:   00001021    andeq   r1, r0, r1, lsr #32
    1034:   00000000    andeq   r0, r0, r0

hello.o disassembled by itself:

00000000 <hello>:
   0:   e92d4008    push    {r3, lr}
   4:   ebfffffe    bl  0 <two>
   8:   e8bd4008    pop {r3, lr}
   c:   e2800007    add r0, r0, #7
  10:   e12fff1e    bx  lr

the compiler uses bl assuming/hoping it will be calling arm from arm. but it didnt, so what they did was put a trampoline in there.

0000100c <hello>:
    100c:   e92d4008    push    {r3, lr}
    1010:   eb000004    bl  1028 <__two_from_arm>
    1014:   e8bd4008    pop {r3, lr}
    1018:   e2800007    add r0, r0, #7
    101c:   e12fff1e    bx  lr


00001028 <__two_from_arm>:
    1028:   e59fc000    ldr ip, [pc]    ; 1030 <__two_from_arm+0x8>
    102c:   e12fff1c    bx  ip
    1030:   00001021    andeq   r1, r0, r1, lsr #32
    1034:   00000000    andeq   r0, r0, r0

the bl to __two_from_arm is an arm mode to arm mode branch link. the address of the destination function (two) with the lsbit set, which tells bx to switch to thumb mode, is loaded into the disposable register ip (r12?) then the bx ip happens switching modes. the branch link had setup the return address in lr, which was an arm mode address no doubt (lsbit zero).

00001020 <two>:
    1020:   3005        adds    r0, #5
    1022:   4770        bx  lr
    1024:   0000        movs    r0, r0

the two() function does its thing and returns, note you have to use bx lr not mov pc,lr when interworking. Basically if you are not running an ARMv4 without the T, or an ARMv5 without the T, mov pc,lr is an okay habit. But anything ARMv4T or newer (ARMv5T or newer) use bx lr to return from a function unless you have a special reason not to. (avoid using pop {pc} as well for the same reason unless you really need to save that instruction and are not interworking). Now being on a cortex-m3 which is thumb+thumb2 only, well you cant interwork so you can use mov pc,lr and pop {pc}, but the code is not portable, and it is not a good habit as that habit will bite you when you switch back to arm programming.

So since hello was in arm mode when it used bl which is what set the link register, the bx in two_from_arm does not touch the link register, so when two() returns with a bx lr it is returning to arm mode after the bl __two_from_arm line in the hello() function.

Also note the extra 0x0000 after the thumb function, this was to align the program on a word boundary so that the following arm code was aligned...

to see how the compiler does thumb to arm change two as follows

unsigned int three ( unsigned int );
unsigned int two ( unsigned int t )
{
    return(three(t)+5);
}

and put that function in hello.c

extern unsigned int two ( unsigned int );
unsigned int hello ( unsigned int h )
{
    return(two(h)+7);
}

unsigned int three ( unsigned int t )
{
    return(t+3);
}

and now we get another trampoline

00001028 <two>:
    1028:   b508        push    {r3, lr}
    102a:   f000 f80b   bl  1044 <__three_from_thumb>
    102e:   3005        adds    r0, #5
    1030:   bc08        pop {r3}
    1032:   bc02        pop {r1}
    1034:   4708        bx  r1
    1036:   46c0        nop         ; (mov r8, r8)
...
00001044 <__three_from_thumb>:
    1044:   4778        bx  pc
    1046:   46c0        nop         ; (mov r8, r8)
    1048:   eafffff4    b   1020 <three>
    104c:   00000000    andeq   r0, r0, r0

Now this is a very cool trampoline. the bl to three_from_thumb is in thumb mode and the link register is set to return to the two() function with the lsbit set no doubt to indicate to return to thumb mode.

The trampoline starts with a bx pc, pc is set to two instructions ahead and the pc internally always has the lsbit clear so a bx pc will always take you to arm mode if not already in arm mode, and in either mode two instructions ahead. Two instructions ahead of the bx pc is an arm instruction that branches (not branch link!) to the three function, completing the trampoline.

Notice how I wrote the call to hello() in the first place

_start:
        ldr r0,=hello
        mov lr,pc
        bx r0
    hang : b hang

that actually wont work will it? It will get you from arm to thumb but not from thumb to arm. I will leave that as an exercise for the reader.

If you change start.s to this

.thumb

.globl _start
_start:
    bl hello
hang : b hang

the linker takes care of us:

00001000 <_start>:
    1000:   f000 f820   bl  1044 <__hello_from_thumb>

00001004 <hang>:
    1004:   e7fe        b.n 1004 <hang>
    ...

00001044 <__hello_from_thumb>:
    1044:   4778        bx  pc
    1046:   46c0        nop         ; (mov r8, r8)
    1048:   eaffffee    b   1008 <hello>

I would and do always disassemble programs like these to make sure the compiler and linker resolved these issues. Also note that for example __hello_from_thumb can be used from any thumb function, if I call hello from several places, some arm, some thumb, and hello was compiled for arm, then the arm calls would call hello directly (if they can reach it) and all the thumb calls would share the same hello_from_thumb (if they can reach it).

The compiler in these examples was assuming code that stays in the same mode (simple branch link) and the linker added the interworking code...

If you really meant inter-networking and not interworking, then please describe what that is and I will delete this answer.

EDIT:

You were using a register to preserve lr during the call to Double, that will not work, no register will work for that you need to use memory, and the easiest is the stack. See how the compiler does it:

00001008 <hello>:
    1008:   e92d4008    push    {r3, lr}
    100c:   eb000009    bl  1038 <__two_from_arm>
    1010:   e8bd4008    pop {r3, lr}
    1014:   e2800007    add r0, r0, #7
    1018:   e12fff1e    bx  lr

r3 is pushed likely to align the stack on a 64 bit boundary (makes it faster). the thing to notice is the link register is preserved on the stack, but the pop does not pop to pc because this is not an ARMv4 build, so a bx is needed to return from the function. Because this is arm mode we can pop to lr and simply bx lr.

For thumb you can only push r0-r7 and lr directly and pop r0-r7 and pc directly you dont want to pop to pc because that only works if you are staying in the same mode (thumb or arm). this is fine for a cortex-m, or fine if you know what all of your callers are, but in general bad. So

00001024 <two>:
    1024:   b508        push    {r3, lr}
    1026:   f000 f811   bl  104c <__three_from_thumb>
    102a:   3005        adds    r0, #5
    102c:   bc08        pop {r3}
    102e:   bc02        pop {r1}
    1030:   4708        bx  r1

same deal r3 is used as a dummy register to keep the stack aligned for performance (I used the default build for gcc 4.8.0 which is likely a platform with a 64 bit axi bus, specifying the architecture might remove that extra register). Because we cannot pop pc, I assume because r1 and r3 would be out of order and r3 was chosen (they could have chosen r2 and saved an instruction) there are two pops, one to get rid of the dummy value on the stack and the other to put the return value in a register so that they can bx to it to return.

Your Start function does not conform to the ABI and as a result when you mix it in with such large libraries as a printf call, no doubt you will crash. If you didnt it was dumb luck. Your assembly listing of main shows that neither r4 nor r10 were used and assuming main() is not called other than the bootstrap, then that is why you got away with either r4 or r10.

If this really is an LPC1769 this this whole discussion is irrelevant as it does not support ARM and does not support interworking (interworking = mixing of ARM mode code and thumb mode code). Your problem was unrelated to interworking, you are not interworking (note the pop {pc} at the end of the functions). Your problem was likely related to your assembly code.

EDIT2:

Changing the makefile to specify the cortex-m

00001008 <hello>:
    1008:   b508        push    {r3, lr}
    100a:   f000 f805   bl  1018 <two>
    100e:   3007        adds    r0, #7
    1010:   bd08        pop {r3, pc}
    1012:   46c0        nop         ; (mov r8, r8)

00001014 <three>:
    1014:   3003        adds    r0, #3
    1016:   4770        bx  lr

00001018 <two>:
    1018:   b508        push    {r3, lr}
    101a:   f7ff fffb   bl  1014 <three>
    101e:   3005        adds    r0, #5
    1020:   bd08        pop {r3, pc}
    1022:   46c0        nop         ; (mov r8, r8)

first and foremost it is all thumb since there is no arm mode on a cortex-m, second the bx is not needed for function returns (Because there are no arm/thumb mode changes). So pop {pc} will work.

it is curious that the dummy register is still used on a push, I tried an arm7tdmi/armv4t build and it still did that, so there is some other flag to use to get rid of that behavior.

If your desire was to learn how to make an assembly function that you can call from C, you should have just done that. Make a C function that somewhat resembles the framework of the function you want to create in asm:

extern unsigned int Double ( unsigned int );
unsigned int Start ( void )
{
    return(Double(42));
}

assemble then disassemble

00000000 <Start>:
   0:   b508        push    {r3, lr}
   2:   202a        movs    r0, #42 ; 0x2a
   4:   f7ff fffe   bl  0 <Double>
   8:   bd08        pop {r3, pc}
   a:   46c0        nop         ; (mov r8, r8)

and start with that as you assembly function.

.globl Start
.thumb_func
Start:
   push {lr}
   mov  r0, #42 
   bl   Double
   pop  {pc}

That, or read the arm abi for gcc and understand what registers you can and cant use without saving them on the stack, what registers are used for passing and returning parameters.

Does the .so file still contain infomation about label

TL;DR - your question is hard to answer, because it is mixing a few concepts. For typical assembler labels, we use PC relative and labels are resolve at assemble time. For other 'external' labels, there are many cases and the resolution depends on the case.

There are four conceptual ways to address on almost all CPUs, and definitely on the ARM.

PC relative address. Current instruction +/- offset.
Absolute address. This is the one you are conceptually thinking of.
Register computed address. Calculated at run time. ldr pc, [rn, #xx]
Table based addressing. Global offset table, etc. Much like registers computed addresses. ldr pc, [Rbase, Rindex, lsl #2]

The first two fit in a single instruction and are very efficient. The first is most desirable as the code can execute at ANY address as long as it maintains it's original layout (ie, you don't load it by splitting the code up).

In the table above, there is also the concept of 'build time' and 'run time'. The distinction is the difference between a linker and a loader. You have tagged this 'linux' and refer to an 'so' or shared library. Also, you are referring to assembler 'labels'. They are very similar concepts, but can be different as they will be one of the four classes of addressing above.

Most often in assembler, the labels are PC relative. There is no additional structure to be implemented with PC relative, except to keep the chunk of code continuous. In the case of an assembler that is a 'module' (compilation unit, for a compile) or is processed by the assembler and produced an 'object', it will use a PC relative addressing.

The object format can be annotate with external addresses and there are many choices in how an assembler may output these address. They are generally controlled by 'psuedo-ops'. That is a note (separate section with defined format) in the object file; the instruction is semi-complete in this form. It may prepare to use an offset table, use a register based compute (like r9+constant), etc.

For the typical case of linking (done at build time), we will either use PC relative or absolute. If we fix our binary to only run at one address, the assembler can setup for absolute addressing and resolve these through linking. In this case, the binary must be loaded at a fixed address. The assembler 'modules' or object files can be completely glued together to have everything resolved. Then there is no 'load' time fix ups. Other determining factor are whether code/data are separate, and whether the system is using an MMU. It is often desirable to keep code constant, so that many processes can use the same RAM/ROM pages, but they will have separate data. As well as memory efficient, this can provide some form of security (although it is not extremely robust) it will prevent accidental code overwrites and will provide debugging help in the form of SIGSEGV.

It is possible to write a PC-relative initialization routine which will do the fix-ups to create a table in your own binary. So a 'loader' is just to determine where you are running and then make calculations. For statically shared libraries, you typically know the libraries you will run, but not where they are. For dynamically shared libraries, you might not even know at compile time what the library is that you will run.

A Linux distribution can use either. If you have some sort of standard Linux Desktop distribution, (Ubuntu/Debian, Redhat, etc). You will have something base on ARM ELF LSB and dynamic shared libraries. You need to use the assembler pseudo ops to support this type of addressing or use a compiler to do it for you. The majority of all 'labels' in a shared library will be PC relative and not show up. Some labels can show up for debugging reasons (man strip) and some are absolutely needed to resolve addresses at run time.

I have also asked a question that I find related some time ago, Using GCC pre-processor as an assembler... So the key concept is that the assembler is generally 'two pass' and needs to do these local address fix ups. Then this question asks a 2nd level Concept A/B where we are adding shared libraries. The online book Linkers and Loaders is a great resource if you want to known more.

When are GAS ELF the directives .type, .thumb, .size and .section needed?

I have been programming arm/thumb for many years lots of assembler and have needed very few of the many directives out there.

.thumb_func is quite important as pointed out by another responder.

for example


.globl _start
_start:
    b   reset

reset:

.arm

.globl one
one:
    add r0,r0,#1
    bx lr

.thumb

.globl two
two:
    add r0,r0,#2
    bx lr

.thumb_func
.globl three
three:
    add r0,r0,#3
    bx lr


.word two
.word three

.arm or used to be something like .code32 or .code 32 tells it this is arm code not thumb code, which for your cortex-m3 you won't need to use.

.thumb likewise, used to be .code 16 or maybe that still works, same deal makes the following code thumb not arm.

If the labels you are using are not global labels that you need to branch to from other files or indirectly, then won't need the .thumb_func. But in order for the address of a branch to one of these global labels to be computed properly (lsbit is a 1 for thumb and 0 for arm) you want to mark it as a thumb or arm label and the thumb_func does that, otherwise you have to set that bit before branching adding more code and the label is not callable from C.



00000000 <_start>:
   0:   eaffffff    b   4 <one>

00000004 <one>:
   4:   e2800001    add r0, r0, #1
   8:   e12fff1e    bx  lr

0000000c <two>:
   c:   3002        adds    r0, #2
   e:   4770        bx  lr

00000010 <three>:
  10:   3003        adds    r0, #3
  12:   4770        bx  lr
  14:   0000000c    andeq   r0, r0, ip
  18:   00000011    andeq   r0, r0, r1, lsl r0

Up to the .thumb the assembler is arm code as desired.

Both the two and three labels/functions are thumb code as desired but the two label has an even numbered address and three has the proper odd numbered address.

The latest codesourcery tools were used to assemble, link, and dump the above sample.

Now for the cortex-m3 where everything is thumb(/thumb2) thumb_func may not be as important, it may just work with command line switches (very easy to do an experiment to find out). It is a good habit to have though in case you move away from a thumb only processor to a normal arm/thumb core.

Assemblers generally like to add all of these directive and other ways of making things look/feel more like a high level language. I am just saying you don't have to use them, I switched assemblers for arm and use many different assemblers for many different processors and prefer the less is more approach, meaning focus on the assembly itself and use as few tool specific items as possible. I am usually the exception not the rule though, so you can probably figure out the more often used directives by looking at what directives the compiler output generates (and verify with documentation).


unsigned int one ( unsigned int x )
{
    return(x+1);
}


    .arch armv5te
    .fpu softvfp
    .eabi_attribute 20, 1
    .eabi_attribute 21, 1
    .eabi_attribute 23, 3
    .eabi_attribute 24, 1
    .eabi_attribute 25, 1
    .eabi_attribute 26, 2
    .eabi_attribute 30, 2
    .eabi_attribute 18, 4
    .file   "bob.c"
    .text
    .align  2
    .global one
    .type   one, %function
one:
    .fnstart
.LFB0:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    add r0, r0, #1
    bx  lr
    .fnend
    .size   one, .-one
    .ident  "GCC: (Sourcery G++ Lite 2010.09-50) 4.5.1"
    .section    .note.GNU-stack,"",%progbits

I do use the .align when mixing arm and thumb assembler or data in with assembler, you would expect the assembler for such a platform to know something as obvious as thumb instructions are on halfword boundaries and arm instructions are aligned on word boundaries. The tools are not always that smart. Sprinkling .aligns about won't hurt.

.text is the default so that is a bit redundant, but won't hurt. .text and .data are standard attributes (not specific to arm) if you are compiling for a combination of rom and ram on your target you may care (depends on what you do with your linker script), otherwise .text will work for everything.

.size apparently the size of the function start to that directive. The assembler cannot figure this out on its own, so if the size of this function is important for your code, linker script, debugger, loader, whatever then this needs to be right, otherwise you don't have to bother. A function is a high level concept anyway assembler doesn't really have functions much less a need to declare their size. And the C compiler certainly doesn't care, it is only looking for a label to branch to and in the case of the arm family is it thumb code or arm code that is being branched to.

you may find the .pool directive (there is a newer equivalent) useful if you are lazy with your immediates (ldr rx,=0x12345678) on long stretches of code. Here again the tools are not always smart enough to place this data after an unconditional branch, you sometimes have tell them. I say lazy half seriously, it is painful to do the label: .word thing all the time and I believe both the arm and gcc tools allowed for that shortcut, so I use it as much as anyone else.

Also note llvm outputs an additional .eabi_attribute or two that is supported by code sourcery's version/mods to binutils but not supported (perhaps yet) by the gnu released binutils. Two solutions that work, modify llvm's asm print function to not write the eabi_attributes or at least write them with a comment (@), or get the binutils source/mods from code sourcery and build binutils that way. code sourcery tends to lead gnu (thumb2 support for example) or perhaps backports new features, so I assume these llvm attrubutes will be present in the mainline binutils before long. I have suffered no ill effects by trimming the eabi_attributes off of the llvm compiled code.

Here is the llvm output for the same function above, apparently this is the llc that I modified to comment out the eabi_attributes.


    .syntax unified
@   .eabi_attribute 20, 1
@   .eabi_attribute 21, 1
@   .eabi_attribute 23, 3
@   .eabi_attribute 24, 1
@   .eabi_attribute 25, 1
@   .eabi_attribute 44, 1
    .file   "bob.bc"
    .text
    .globl  one
    .align  2
    .type   one,%function
one:                                    @ @one
@ BB#0:                                 @ %entry
    add r0, r0, #1
    bx  lr
.Ltmp0:
    .size   one, .Ltmp0-one

The elf file format is well documented and very easy to parse if you want to really see what the elf specific directives (if any) are doing. Many of these directives are to help the linker more than anything. .thumb_func, .text, .data for example.

Using B instructions in Cortex-M3 (thumb)

the target address needs to have the lsbit a 1 for the bx (and blx) instruction, the 1 is stripped when it goes into the pc. the b instruction is pc relative and the math shown in the arm docs makes it clear it is even.

Generally you dont have to worry about this in any case at any time if you let the tools do their job.

thumb.s

.thumb

.globl _start
_start:
    b reset
    nop
    nop
.thumb_func
reset:
    nop
    nop
    nop
    nop
    ldr r0,=reset
    bx r0

then

arm-none-eabi-as thumb.s -o thumb.o
arm-none-eabi-ld -Ttext=0x1000 thumb.o -o thumb.elf
arm-none-eabi-objdump -D thumb.elf

which gives

thumb.elf:     file format elf32-littlearm


Disassembly of section .text:

00001000 <_start>:
    1000:   e001        b.n 1006 <reset>
    1002:   46c0        nop         ; (mov r8, r8)
    1004:   46c0        nop         ; (mov r8, r8)

00001006 <reset>:
    1006:   46c0        nop         ; (mov r8, r8)
    1008:   46c0        nop         ; (mov r8, r8)
    100a:   46c0        nop         ; (mov r8, r8)
    100c:   46c0        nop         ; (mov r8, r8)
    100e:   4801        ldr r0, [pc, #4]    ; (1014 <reset+0xe>)
    1010:   4700        bx  r0
    1012:   10070000    andne   r0, r7, r0
    ...

the branch takes care of itself

    1000:   e001        b.n 1006 <reset>
...    
00001006 <reset>:

the encoding in a branch is in units of 16 bit quantities not units of bytes, then they multiply that by 2 (shift it) to get the byte address which is always even. the pc is never odd, it is the value you feed bx or blx that is odd.

Now because I used .thumb_func before reset that told the assembler this is a thumb label not an arm label. So when I said please load the address of reset into r0 the assembler then allocated some data for the value 0x00001007 which shows up weird in the disassembly but it is there. and they have set the lsbit for us

00001006 <reset>:
 ...
    100e:   4801        ldr r0, [pc, #4]    ; (1014 <reset+0xe>)
    1010:   4700        bx  r0
    1012:   10070000    andne   r0, r7, r0

now if you were to remove the .thumb_func

100c:   46c0        nop         ; (mov r8, r8)
100e:   4801        ldr r0, [pc, #4]    ; (1014 <reset+0xe>)
1010:   4700        bx  r0
1012:   10060000    andne   r0, r6, r0

the assembler thinks it is an arm address and does not set the lsbit and this code would crash. Now if you are concerned about it you can always add the extra orr r0,#1 but that is really just a hack. Learn for whichever assembler you are using how to declare the label as a thumb label not an arm. Yes it seems stupid that gnu assembler knows this code segment is thumb because we told it to yet it cant figure out that labels within thumb code are ... thumb labels. very stupid tool.

And I would assume there are other more verbose gnu assembler directives that will also allow you to declare this a function or a thumb label or whatever. And of course every assembler is different so dont assume that gnu assembler directives work on other assembler directives.

If you mix C and asm the C compiler is not stupid it knows that -mthumb makes all the functions and globals (labels) thumb and depending on how and where you use them in the code the linker places the correct value. It even can go so far as to correctly switch modes for you, bl main in thumb code where main is arm code and it places a trampoline in the code for you that switches modes. or vice versa, at least I have see the tool do this (and demonstrated it a number of times in stack overflow answers). I dont remember if it was tricky to get it to work or not you should always disassemble periodically and insure the linker is doing this for you otherwise make it do it or you can always fall back on doing it yourself.

Remember that only bx and blx need the lsbit set for thumb and two lsbits reset for branching to arm. The blx and bx instructions will remove that lsbit and leave an even numbered pc in the pc (very simple do a mov r0,pc and then look at it in thumb code).

Ideally the unconditional and conditional branches (not bx) should never switch modes arm to arm and thumb to thumb. Same for bl, but I have seen the gnu tools help out with that, if you want your code pure then load the address in a register which the tools have to do right otherwise the whole toolchain is a fail, and blx instead of bl to that label and not rely on the toolchain doing a trampoline for you.

What is function in assembly?

ELF symbol metadata can be set by some assemblers, e.g. in NASM, global main:function to mark the symbol type as FUNC. (https://nasm.us/doc/nasmdoc8.html#section-8.9.5).

The GAS syntax equivalent (which C compilers emit) is .type main, function. e.g. put some code on https://godbolt.org and disable filtering to see asm directives in compiler output.

But note this is just metadata for linkers and debuggers to use; the CPU doesn't see that when executing. That's why nobody bothers with it for NASM examples.

Assembly language doesn't truly have functions, just the tools to implement that concept, e.g. jump and store a return address somewhere = call, indirect jump to a return address = ret. On x86, return addresses are pushed and popped on the stack.

The model of execution is purely sequential and local, one instruction at a time (on most ISAs, but some ISAs are VLIW and execute 3 at a time for example, but still local in scope), with each instruction just making a well-defined change to the architectural state. The CPU itself doesn't know or care that it's "in a function" or anything about nesting, other than the return-address predictor stack which optimistically assumes that ret will actually use a return address pushed by a corresponding call. But that's a performance optimization; you do sometimes get mismatched call/ret if code is doing something weird (e.g. a context switch).

A C compiler won't put any instructions outside of functions.

Technically the _start entry point that indirectly calls main isn't a function; it can't return and has to make an exit system call, but that's written in asm and is part of libc. It's not generated by the C compiler proper, only linked with the C compiler's output to make a working program.) See Linux x86 Program Start Up
or - How the heck do we get to main()? for example.

How to get qemu to run an arm thumb binary?

There are a couple of concepts to disentangle here:

(1) Arm vs Thumb : these are two different instruction sets. Most CPUs support both, some support only one. Both are available simultaneously if the CPU supports both. To simplify a little bit, if you jump to an address with the least significant bit set that means "go to Thumb mode", and jumping to an address with that bit clear means "go to Arm mode". (Interworking is a touch more complicated than that, but that's a good initial mental model.) Note that all Arm instructions are 4 bytes long, but Thumb instructions can be either 2 or 4 by