Strlen in Assembly

strlen in assembly

You are not accessing bytes (characters), but doublewords. So your code is not looking for a single terminating zero, it is looking for 4 consecutive zeroes. Note that won't always return correct value +4, it depends on what the memory after your string contains.

To fix, you should use byte accesses, for example by changing edx to dl.

How to traverse a string in assembly until I reach null? (strlen loop)

You just inc %rbx to increment the pointer value. (%rbx) dereferences that register, using its value as a memory address. On x86, every byte has its own address (this property is called "byte addressable"), and addresses are just integers that fit in a register.

The characters in an ASCII string are all 1 byte wide so incrementing a pointer by 1 moves to the next character in an ASCII string. (This isn't true in the general case of UTF-8 with characters outside the 1..127 range of codepoints, but ASCII is a subset of UTF-8.)

Terminology: ASCII code 0 is called NUL (one L), not NULL. In C, NULL is a pointer concept. C-style implicit-length strings can be described as 0-terminated or NUL-terminated, but "null-terminated" is misusing the terminology.

You should pick a different register (one that's call-clobbered) so you don't need to push/pop it around your function. Your code doesn't make any function calls, so there's no need to keep your induction variable in a call-preserved register.

I didn't find a good simple example in other SO Q&As. They either have 2 branches inside the loop (including one unconditional jmp) like the one I linked in comments, or they waste instructions incrementing a pointer and a counter. Using an indexed addressing mode inside the loop is not terrible, but is less efficient on some CPUs so I'd still recommend doing a pointer increment -> subtract end-start after the loop.

This is how I'd write a minimal strlen that only checks 1 byte at a time (slow and simple). I kept the loop itself small, and this is IMO a reasonable example of a good way of writing loops in general. Often keeping your code compact makes it easier to understand a function in asm. (Give it a different name than strlen so you can test it without needing gcc -fno-builtin-strlen or whatever.)

.globl simple_strlen
simple_strlen:
    lea     -1(%rdi), %rax     # p = start-1 to counteract the first inc
 .Lloop:                       # do {
    inc     %rax                  # ++p
    cmpb    $0, (%rax)
    jne     .Lloop             # }while(*p != 0);
                           # RAX points at the terminating 0 byte = one-past-end of the real data
    sub     %rdi, %rax     # return length = end - start
    ret

The return value of strlen is the array index of the 0 byte = length of the data not including the terminator.

If you were inlining this manually (because it's just a 3-instruction loop), you'd often just want a pointer to the 0 terminator so you wouldn't bother with the sub crap, just use RAX at the end of the loop.

Avoiding the offsetting LEA/INC instructions before the first load (which cost 2 cycles of latency before the first cmp) could be done by peeling the first iteration, or with a jmp to enter the loop at the cmp/jne, after the inc. Why are loops always compiled into "do...while" style (tail jump)?.

Incrementing the pointer with LEA between cmp/jcc (like cmp ; lea 1(%rax), %rax ; jne) could be worse because it defeats macro-fusion of cmp/jcc into a single uop. (Actually, macro-fusion of cmp $imm, (%reg) / jcc doesn't happen on Intel CPUs like Skylake anyway. cmp micro-fuses the memory operand, though. Maybe AMD fuses the cmp/jcc.) Also, you'd leave the loop with RAX 1 higher than you want.

So it would be just as efficient (on Intel Sandybridge-family) to movzx (aka movzbl) load and zero-extend a byte into %ecx, and test %ecx, %ecx / jnz as the loop condition. But larger code-size.

Most CPUs will run my loop at 1 iteration per clock cycle. We could maybe get close to 2 bytes per cycle (while still only checking each byte separately) with some loop unrolling.

Checking 1 byte at a time is about 16x slower for large strings than we could go with SSE2. If you aren't aiming for minimal code size and simplicity, see Why is this code 6.5x slower with optimizations enabled? for a simple SSE2 strlen that uses an XMM register. SSE2 is baseline for x86-64 so you should always use it when it gives a speedup, for stuff that's worth writing by hand in asm.

Re: your updated question with a buggy port of the implementation from Why does rax and rdi work the same in this situation?

RDI and RBX both hold pointers. Adding them together doesn't make a valid address! In the code you were trying to port, RCX (the index) is initialized to zero before the loop. But instead of xor %ebx, %ebx, you did mov %rdi, %rbx. Use a debugger to examine register values while you single-step your code.

strlen in NASM Linux

I believe you are trying to make 32bit program in Linux, but your examples are 16bit.

In Linux, all pointers are 32bit. So, use extended registers: esi, edi, ebx etc. You still can use 8 and 16bit registers for arithmetics and data processing but not as memory pointers.
In strlen you have to compare byte [esi+ebx], 0 not word.
Don't set the segment registers in Linux. They will be set by the OS and you can't touch them. In Linux all memory is one flat area and you don't have to use segment registers anymore.

How does this assembly function return a value?

how does call 1070 strlen@plt return a value?

The strlen puts its result into rax register, which conveniently is also where your length() function should put its return value.

Under optimization your length() could be compiled into a single instruction: jmp strlen -- the parameter is already in rdi, and the return value will be in rax.

P.S.

Lastly here is the code at address 1070

That isn't the actual code of strlen. This is a "PLT jump stub". To understand what that is, you could read this blog post.

Also, from that small address, you can see this is a PIE executable: those are just offsets from the image base address; the runtime address will be something like 0x55...

Getting length of string in NASM with strlen call

Judging from the error you get, it would seem you compiled the C file before the ASM file, not after as you described.

To complicate things, the resulting object files will have the same filename. Since you compiled the ASM file last, getLength.o is the compiled ASM file.

The result is that you're trying to link multiple functions named getLength (from the ASM file) and you don't have a main function to link at all.

You can fix it by using different names for the object files (e.g. length.o for the C file and getLength.o for the ASM file):

gcc -c length.c -o length.o -m32
nasm -f elf32 getLength.asm -o getLength.o
gcc length.o getLength.o -o length -m32

By the way, your getLength function appears to be incorrect:

You forgot to move the argument to strlen onto the stack. push eax before calling strlen.
You moved the return value from eax into edx after calling strlen. This shouldn't be necessary since eax will already have the correct value.
Because you need to push eax, you also need to restore the stack pointer after strlen returns. You can either use add esp, 4 or mov esp, ebp to accomplish this, but it must be done before you pop ebp.

writing a function in ARM assembly language which inserts a string into another string at a specific location

Caveat: I've been doing asm for 40+, I've looked at arm a bit, but not used it. However, I pulled the arm ABI document.

As the problem stated, a1-a4 are not preserved across a call, which matches the ABI. You saved your a1, but you did not save your a2 or a3.

strlen [or any other function] is permitted to use a1-a4 as scratch regs. So, for efficiency, my guess is that strlen [or malloc] is using a2-a4 as scratch and [from your perspective] corrupting some of the register values.

By the time you get to loop:, a2 is probably a bogus journey :-)

UPDATE

I started to clean up your asm. Style is 10x more important in asm than C. Every asm line should have a sidebar comment. And add a blank line here or there. Because you didn't post your updated code, I had to guess at the changes and after a bit, I realized you only had about 25% or so. Plus, I started to mess things up.

I split the problem into three parts:

- Code in C

- Take C code and generate arm pseudo code in C

- Code in asm

If you take a look at the C code and pseudo code, you'll notice that any misuse of instructions aside, your logic was wrong (e.g. you needed two strlen calls before the malloc)

So, here is your assembler cleaned for style [not much new code]. Notice that I may have broken some of your existing logic, but my version may be easier on the eyes. I used tabs to separate things and got everything to line up. That can help. Also, the comments show intent or note limitations of instructions or architecture.

.global csinsert
csinsert:

    stmfd   sp!,{v1-v6,lr}              // preserve caller registers

    // preserve our arguments across calls
    mov     v1,a1
    mov     v2,a2
    mov     v3,a3

    // get length of destination string
    mov     a1,v1                       // set dest addr as strlen arg
    bl      strlen                      // call strlen

    add     a1,a1,#1                    // increment length
    mov     v4,a1                       // save it

    add     v3,v3                       // src = src + src (what???)
    mov     v5,v2                       // save it

    add     v3,v3                       // double the offset (what???)

    bl      malloc                      // get heap memory

    mov     v4,#0                       // set index for loop

loop:
    ldrb    v7,[v1],#1
    subs    v2,v2,#1
    add     v7,v7,a2
    strb    v7,[a1],#1
    bne     loop

    ldmfd   sp!,{v1-v6,pc} @std         // restore caller registers

    .end

At first, you should prototype in real C:

// csinsert_real -- real C code
char *
csinsert_real(char *s1,char *s2,int loc)
{
    int s1len;
    int s2len;
    char *bp;
    int chr;
    char *bf;

    s1len = strlen(s1);
    s2len = strlen(s2);

    bf = malloc(s1len + s2len + 1);
    bp = bf;

    // copy over s1 up to but not including the "insertion" point
    for (;  loc > 0;  --loc, ++s1, ++bp) {
        chr = *s1;
        if (chr == 0)
            break;
        *bp = chr;
    }

    // "insert" the s2 string
    for (chr = *s2++;  chr != 0;  chr = *s2++, ++bp)
        *bp = chr;

    // copy the remainder of s1 [if any]
    for (chr = *s1++;  chr != 0;  chr = *s1++, ++bp)
        *bp = chr;

    *bp = 0;

    return bf;
}

Then, you can [until you're comfortable with arm], prototype in C "pseudocode":

// csinsert_pseudo -- pseudo arm code
char *
csinsert_pseudo()
{

    // save caller registers

    v1 = a1;
    v2 = a2;
    v3 = a3;

    a1 = v1;
    strlen();
    v4 = a1;

    a1 = v2;
    strlen();

    a1 = a1 + v4 + 1;

    malloc();
    v5 = a1;

    // NOTE: load/store may only use r0-r7
    // and a1 is r0
#if 0
    r0 = a1;
#endif
    r1 = v1;
    r2 = v2;

    // copy over s1 up to but not including the "insertion" point
loop1:
    if (v3 == 0) goto eloop1;
    r3 = *r1;
    if (r3 == 0) goto eloop1;
    *r0 = r3;
    ++r0;
    ++r1;
    --v3;
    goto loop1;
eloop1:

    // "insert" the s2 string
loop2:
    r3 = *r2;
    if (r3 == 0) goto eloop2;
    *r0 = r3;
    ++r0;
    ++r2;
    goto loop2;
eloop2:

    // copy the remainder of s1 [if any]
loop3:
    r3 = *r1;
    if (r3 == 0) goto eloop3;
    *r0 = r3;
    ++r0;
    ++r1;
    goto loop3;
eloop3:

    *r0 = 0;

    a1 = v5;

    // restore caller registers
}

strlen in assembly, off by 1?

You are counting the terminating zero char. Either start with -1 or increment after the comparison.

scanf a string and print strlen in assembly gas 64-bit

On principle level, you are using .lcomm d2, 255 to allocate 255 bytes for the string data. One byte is 8 bits, 1 bit is either 0 or 1. So maximum value of one byte is 2⁸-1 when treated as unsigned binary value. Which is for me the most common way, how I think about bytes (as a number 0..255), but those 8 bits can represent also other values, like sometimes signed 8 bit is used (-128..+127), or particular bits are addressed giving them specific functionality for the particular code accessing them. (this part is good)

Then you use scanf with "%s\0\n" definitions (it will compile as bytes '%', 's', 0, 10 ... not sure what the 10 is good for there after null terminator). I would use .asciiz "%254s" instead, to prevent malicious user entering more that 255 bytes of input into that reserved d2 space. (note it's .asciiz with z at end, so it will add the zero byte on it's own)

Then you use printf. Rather provide another formatting string separately for output, this time like formatOut: .asciiz "%s\n".

Finally you want strlen.

Which means I will return back to input. If you are running in normal 64b OS (linux), your input string is very likely UTF-8 encoded (unless your OS is set in other specific Locale, then I'm not sure which Locale will scanf pick up).

UTF-8 encoding is variable-length encoding, so you should decide whether your strlen will return number of characters, or number of bytes occupied.

For the simplicity I will assume number of bytes (not chars) is enough for you, and if your input strings will consist only of basic 7b ASCII characters ([0-9A-Za-z !@#$%^&*,.;'\<>?:"|{}] etc... check any ASCII table ... no accent chars allowed (like á), that would produce multi-byte UTF8 code), then number of bytes will be also equal to number of characters (UTF-8 encoding is sort of compatible with 7b ASCII).

That means for example for input "Hell 1234" the memory at address d2 will contain these values (hexadecimal) 48 65 6C 6C 20 31 32 33 34 00. Once again, if you will check ASCII table, you will realize that for example byte 0x20 is the space character, etc... And the string is "nul terminated", the last value zero is part of the string, but it is not displayed, instead it is used by various C functions as "end of string marker".

So what you want to do in strlen is to load some register with d2 address, let's say rdi. And then scan byte by byte (byte, because ASCII encoding works in "1 char = 1 byte" way, and we will ignore UTF-8 variable-length codes), until you reach zero value in memory, and meanwhile count how many bytes it did take to reach it. If you would ponder on this idea a bit to make it "short" for CPU, and you will use the SCASB for scanning (you can also write it "manually" with ordinary mov/cmp/inc/jne/jnz if you wish), you may end with this:

rdi = d2 address
rdx = rdi  ; (copy of d2 address)
ecx = 255  ; maximum length of string
al  = 0    ; value to test against
repne scasb  ; repeat SCASB instruction until zero is found
; here rdi points at the zero byte
; (or it's d2+255 if the zero terminator is missing)
rdi -= rdx ; rdi = length of string
; return result as you wish

So you need first correct understand what values you are manipulating with, where they are, what is their bit/byte size, and what structure it has.

Then you can write instructions which produce any reasonable calculation based on those data.

In your case the calculation is "length_of_string = number of non-zero bytes in 7b ASCII encoded string stored in memory at address d2" (I mean after successful scanf part of code).

Considering how your source looks it looks to me like you don't understand what x86 CPU instruction do, and you just copy them from some examples. That will get you into trouble soon.

For example cmp 0, %rcx is checking if rcx (8 bytes "wide" value) is equal to zero. And you did load rcx with value from rdx, which was something from stack (maybe d2 address), so the rcx will be never zero.

And even if you would actually load the character values from memory into rcx, you would load 8 of them at the same time, so you would miss the 0 value as it would be only single byte inside some garbage, like 0xCCCCCCCC00343332 (I'm using 0xCC for the undefined memory after d2 buffer just for example, there may be any value).

So that code doesn't make any sense. If you at least partially understand what are CPU registers and what instructions like mov/inc/cmp/... do, then you have some chance to produce working code by simply using debugger a lot, to verify almost every 1-2 new instructions added to source, if it does manipulate the correct values, and fix them until you get it right.

Which requires you to have clear idea what is the "correct behaviour" first! (like in this case "fetching byte-by-byte values from d2 address, one after another, incrementing "length" counter, and looking for zero byte) So you can tell when the code does what you need, or not.

What I did want to point out with this answer is, that instructions themselves, while important, are less important than your vision of data/structures/algorithm used. Your question sounds like you have no idea what is "C string" in x86 assembly, or which algorithm to use. That makes it impossible for you to just "guess" some instructions into source and then verify if you guessed right or not. Because you can't tell what you want it to do. That's why I told you should check also non-gas x86 Assembly resources for the very basics, what is bit/byte/computer memory/etc... up until you somewhat understand what numeric values are manipulated for example to create "strings".

Once you will have good idea what it should do, it will be easy for you to catch in debugger things like swapped arguments (for example: movq %rcx, d2 - why do you put 8 bytes from rcx into memory at address d2? That will overwrite the input string), and similar, so you actually don't need to understand the instructions and gas syntax 100% well, just enough to produce something, and then over several iterations to "fix" it. Like checking the register+memory view, realizing the rcx didn't change, but instead the string data were damaged => try it other way...

Oh, and I completely forgot... you need to find documentation for your 64b platform ABI, so you know what is the correct way to pass arguments to C functions.

For example in linux these tutorials may help:
http://cs.lmu.edu/~ray/notes/gasexamples/

And search here for word "ABI" for further resources:
https://stackoverflow.com/tags/x86/info