How to Convert Linux 32-Bit Gcc Inline Assembly to 64-Bit Code

Inline 64bit Assembly in 32bit GCC C Program

No, this isn't possible. You can't run 64-bit assembly from a 32-bit binary, as the processor will not be in long mode while running your program.

Copying 64-bit code to an executable page will result in that code being interpreted incorrectly as 32-bit code, which will have unpredictable and undesirable results.

Some inline assembly doesn't work in 64 bit mode?

push and pop in 64bit mode can not have 32bit operands, only 16bit or 64bit, it is not the case that pop only works on 32bit registers. In general a lot of code will work the same in 32bit and 64bit modes, but some little-used instructions (decimal math, instructions that have to do with segmentation) have been removed completely from 64bit mode.

Combining C and Assembly(32 bit code) on Linux 64 bit

Compiling / Linking 32-bit program on 64 bit ubuntu need gcc-multilib, try:

sudo apt-get install gcc-multilib libc6-i386 lib6-dev-i386

However, this may have other problem when you try to link other libraries.

You would have better luck using a 32-bit chroot environment (i.e. running a 32-bit root on your 64-bit ubuntu).

Could anyone help me to read 64 bit from console in 32 bit RISC-V

Yeah, if you can't use the toy system calls, read a string and do total = total*10 + digit on it, where digit = c-'0'. You'll need to do extended-precision multiply, so it's probably easier to do extended-precision shifts like (total << 3) + (total << 1).

Check compiler output on Godbolt. For example, GCC using shifts, clang using mul/mulhu(high unsigned) for the lo * lo 32x32=>64-bit partial product, and a mul for the high half cross product (hi * lo). It's fewer instructions, but depends on a RISC-V CPU with a fast multiplier to be faster than shift/or.

(RISC-V extended-precision addition is inconvenient since it doesn't have a carry flag, you need to emulate carry-out as unsigned sum = a+b; carry = sum<a;)

#include <stdint.h>

uint64_t strtou64(unsigned char*p){
    uint64_t total = 0;
    unsigned digit = *p - '0';    // peeling the first iteration is usually good in asm
    while (digit < 10) {     // loop until any non-digit character
        total = total*10 + digit;
        p++;                // *p was checked before the loop or last iteration
        digit = *p - '0';   // get a digit ready for the loop branch
    }
    return total;
}

Clang's output is shorter, so I'll show it. It of course follows the standard calling convention, taking the pointer in a0, and returning a 64-bit integer in a pair of registers, a1:a0:

# rv32gc clang 14.0  -O3
strtou64:
        mv      a2, a0
        lbu     a0, 0(a0)         # load the first char
        addi    a3, a0, -48       # *p - '0'
        li      a0, 9
        bltu    a0, a3, .LBB0_4   # return 0 if the first char is a non-digit
        li      a0, 0               # total in a1:a0 = 0   ;  should have done these before the branch
        li      a1, 0                           # so a separate ret wouldn't be needed
        addi    a2, a2, 1           # p++
        li      a6, 10              # multiplier constant
.LBB0_2:                            # do{
        mulhu   a5, a0, a6            # high half of (lo(total) * 10)
        mul     a1, a1, a6            # hi(total) * 10
        add     a1, a1, a5            # add the high-half partial products
        mul     a5, a0, a6            # low half of  (lo(total) * 10)
        lbu     a4, 0(a2)                # load *p
        add     a0, a5, a3            # lo(total) =  lo(total*10) + digit
        sltu    a3, a0, a5            # carry-out from that
        add     a1, a1, a3            # propagate carry into hi(total)
        addi    a3, a4, -48             # digit = *p - '0'
        addi    a2, a2, 1                # p++ done after the load; clang peeled one pointer increment before the loop
        bltu    a3, a6, .LBB0_2     # }while(digit < 10)
        ret
.LBB0_4:
        li      a0, 0               # return 0 special case
        li      a1, 0               # because clang was dumb and didn't load these regs before branching
        ret

If you want to go with GCC's shift/or strategy, it should be straightforward to see how that slots in to the same logic clang is using. You can look at compiler output for a function like return u64 << 3 to see which instructions are part of that.

And BTW, I wrote the C with compiling to decent asm in mind, making it easy for the compiler to transform it into a do{}while loop with the condition at the bottom. I based it on the x86 asm in my answer on NASM Assembly convert input to integer?

Running 32 bit assembly code on a 64 bit Linux & 64 bit Processor : Explain the anomaly

Remember that everything by default on a 64-bit OS tends to assume 64-bit. You need to make sure that you are (a) using the 32-bit versions of your #includes where appropriate (b) linking with 32-bit libraries and (c) building a 32-bit executable. It would probably help if you showed the contents of your makefile if you have one, or else the commands that you are using to build this example.

FWIW I changed your code slightly (_start -> main):

#include <asm/unistd.h>
#include <syscall.h>
#define STDOUT 1

    .data
hellostr:
    .ascii "hello wolrd\n" ;
helloend:

    .text
    .globl main

main:
    movl $(SYS_write) , %eax  //ssize_t write(int fd, const void *buf, size_t count);
    movl $(STDOUT) , %ebx
    movl $hellostr , %ecx
    movl $(helloend-hellostr) , %edx
    int $0x80

    movl $(SYS_exit), %eax //void _exit(int status);
    xorl %ebx, %ebx
    int $0x80

    ret

and built it like this:

$ gcc -Wall test.S -m32 -o test

verfied that we have a 32-bit executable:

$ file test
test: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared libs), not stripped

and it appears to run OK:

$ ./test
hello wolrd

How to Compile a C program which contains 32bit asm into .o file?

Use gcc -m32 -c evil_puts.c -o evil_puts.o

You're getting that error because you don't have the 32-bit libraries installed.

If using Ubuntu:

sudo apt-get install gcc-multilib

x86_64 Inline Assembly ; Copying 64-bit register directly to 64-bit memory location

It's the input "0" (*_rax) which is foxing it... it seems that "0" does not work with a "=m" memory constraint, nor with "+m". (I do not know why.)

Changing your second function to compile and work:

uint32_t cpuid_0(uint32_t* _eax, uint32_t* _ebx, uint32_t* _ecx, uint32_t* _edx)
{
  __asm__
  (
    "mov $0,  %%eax\n"
    "cpuid\n"
    "mov %%eax, %0\n"
    "mov %%ebx, %1\n"
    "mov %%ecx, %2\n"
    "mov %%edx, %3\n"
    : "=m" (*_eax), "=m" (*_ebx), "=m" (*_ecx), "=m" (*_edx)
    : //"0" (*_eax) -- not required and throws errors !!
    : "%rax", "%rbx", "%rcx", "%rdx"  // ESSENTIAL "clobbers"
  ) ;
  return *_eax ;
}

where that:

does everything as uint32_t, for consistency.
discards the redundant int a, b, c, d;
omits the "0" input, which in any case was not being used.
declares simple "=m" output for (*_eax)
"clobbers" all "%rax", "%rbx", "%rcx", "%rdx"
discards the redundant volatile.

The last is essential, because without it the compiler has no idea that those registers are affected.

The above compiles to:

   push   %rbx                 # compiler (now) knows %rbx is "clobbered"
   mov    %rdx,%r8             # likewise %rdx
   mov    %rcx,%r9             # ditto %rcx

     mov    $0x0,%eax          # the __asm__(....
     cpuid  
     mov    %eax,(%rdi)
     mov    %ebx,(%rsi)
     mov    %ecx,(%r8)
     mov    %edx,(%r9)         # ....) ;

   mov    (%rdi),%eax
   pop    %rbx
   retq

NB: without the "clobbers" compiles to:

   mov    $0x0,%eax
   cpuid  
   mov    %eax,(%rdi)
   mov    %ebx,(%rsi)
   mov    %ecx,(%rdx)
   mov    %edx,(%rcx)
   mov    (%rdi),%eax
   retq

which is shorter, but sadly doesn't work !!

You could also (version 2):

struct cpuid
{
  uint32_t  eax ;
  uint32_t  ebx ;
  uint32_t  ecx ;
  uint32_t  edx ;
};

uint32_t cpuid_0(struct cpuid* cid)
{
  uint32_t eax ;

  __asm__
  (
    "mov $0,  %%eax\n"
    "cpuid\n"
    "mov %%ebx, %1\n"
    "mov %%ecx, %2\n"
    "mov %%edx, %3\n"
    : "=a" (eax), "=m" (cid->ebx), "=m" (cid->ecx), "=m" (cid->edx)
    :: "%ebx", "%ecx", "%edx"
  ) ;

  return cid->eax = eax ;
}

which compiles to something very slightly shorter:

   push   %rbx
   mov    $0x0,%eax
   cpuid  
   mov    %ebx,0x4(%rdi)
   mov    %ecx,0x8(%rdi)
   mov    %edx,0xc(%rdi)
   pop    %rbx
   mov    %eax,(%rdi)
   retq

Or you could do something more like your first version (version 3):

uint32_t cpuid_0(struct cpuid* cid)
{
  uint32_t eax, ebx, ecx, edx ;

  eax = 0 ;
  __asm__(" cpuid\n" : "+a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx));

  cid->edx = edx ;
  cid->ecx = ecx ;
  cid->ebx = ebx ;
  return cid->eax = eax ;
}

which compiles to:

   push   %rbx
   xor    %eax,%eax
   cpuid  
   mov    %ebx,0x4(%rdi)
   mov    %edx,0xc(%rdi)
   pop    %rbx
   mov    %ecx,0x8(%rdi)
   mov    %eax,(%rdi)
   retq

This version uses the "+a", "=b" etc. magic to tell the compiler to allocate specific registers to the various variables. This reduces the amount of assembler to the bare minimum, which is generally a Good Thing. [Note that the compiler knows that xor %eax,%eax is better (and shorter) than mov $0,%eax and thinks there is some advantage to doing the pop %rbx earlier.]

Better yet -- following comment by @Peter Cordes (version 4):

uint32_t cpuid_1(struct cpuid* cid)
{
  __asm__
  (
    "xor %%eax, %%eax\n"
    "cpuid\n"
    : "=a" (cid->eax), "=b" (cid->ebx), "=c" (cid->ecx), "=d" (cid->edx)
  ) ;

  return cid->eax ;
}

where the compiler figures out that cid->eax is already in %eax, and so compiles to:

   push   %rbx
   xor    %eax,%eax
   cpuid  
   mov    %ebx,0x4(%rdi)
   mov    %eax,(%rdi)
   pop    %rbx
   mov    %ecx,0x8(%rdi)
   mov    %edx,0xc(%rdi)
   retq

which is the same as version 3, apart from a small difference in the order of the instructions.

FWIW: an __asm__() is defined to be:

asm asm-qualifiers (AssemblerTemplate : OutputOperands [ : InputOperands [ : Clobbers ] ] )

The key to inline assembler is to understand that the compiler:

has no idea what the AssemblerTemplate part means.
It does expand the %xx place holders, but understands nothing else.
does understand the OutputOperands, InputOperands (if any) and Clobbers (if any)...
...these tell the compiler what the assembler needs as parameters, and how to expand the various %xx.
...but these also tell the compiler what the AssemblerTemplate does, in terms that the compiler understands.

So, what the compiler understands is a sort of "data flow". It understands that the assembler takes a number of inputs, returns a number of outputs and (may) as a side effect "clobber" some registers and/or amounts of memory. Armed with this information, the compiler can integrate the "black box" assembler sequence with the code generated around it. Among other things the compiler will:

allocate registers for output and input operands
and arrange for the inputs to be in the required registers (as required).
NB: the compiler looks on the assembler as a single operation, where all inputs are consumed before any outputs are generated. If an input is not used after the __asm__() the compiler can allocate a given register as an input and as an output. Hence the need so the so-called "early clobber".
move the "black box" around wrt the surrounding code, maintaining the dependencies the assembler has on the sources of its inputs and the dependencies the code that follows has on the assembler's outputs.
discard the "black box" altogether if nothing seems to depend on its outputs !

x86 inline yasm convert to x64

Without your code my best guess is you should read this for AMD64 ABI and see calling convention standard in x64 platform. I think this should work for you. As on that document says you must pass parameter as follow (please note that you must classified your arguments first with method describing in ABI standard) :

If the class is MEMORY, pass the argument on the stack.

If the class is INTEGER, the next available register of the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.

If the class is SSE, the next available vector register is used, the registers are taken in the order from %xmm0 to %xmm7.

...