Inline 64bit Assembly in 32bit GCC C Program
No, this isn't possible. You can't run 64-bit assembly from a 32-bit binary, as the processor will not be in long mode while running your program.
Copying 64-bit code to an executable page will result in that code being interpreted incorrectly as 32-bit code, which will have unpredictable and undesirable results.
Some inline assembly doesn't work in 64 bit mode?
push
and pop
in 64bit mode can not have 32bit operands, only 16bit or 64bit, it is not the case that pop
only works on 32bit registers. In general a lot of code will work the same in 32bit and 64bit modes, but some little-used instructions (decimal math, instructions that have to do with segmentation) have been removed completely from 64bit mode.
Combining C and Assembly(32 bit code) on Linux 64 bit
Compiling / Linking 32-bit program on 64 bit ubuntu need gcc-multilib, try:
sudo apt-get install gcc-multilib libc6-i386 lib6-dev-i386
However, this may have other problem when you try to link other libraries.
You would have better luck using a 32-bit chroot environment (i.e. running a 32-bit root on your 64-bit ubuntu).
Could anyone help me to read 64 bit from console in 32 bit RISC-V
Yeah, if you can't use the toy system calls, read a string and do total = total*10 + digit
on it, where digit = c-'0'
. You'll need to do extended-precision multiply, so it's probably easier to do extended-precision shifts like (total << 3) + (total << 1)
.
Check compiler output on Godbolt. For example, GCC using shifts, clang using mul
/mulhu
(high unsigned) for the lo * lo
32x32=>64-bit partial product, and a mul
for the high half cross product (hi * lo
). It's fewer instructions, but depends on a RISC-V CPU with a fast multiplier to be faster than shift/or.
(RISC-V extended-precision addition is inconvenient since it doesn't have a carry flag, you need to emulate carry-out as unsigned sum = a+b;
carry = sum<a;
)
#include <stdint.h>
uint64_t strtou64(unsigned char*p){
uint64_t total = 0;
unsigned digit = *p - '0'; // peeling the first iteration is usually good in asm
while (digit < 10) { // loop until any non-digit character
total = total*10 + digit;
p++; // *p was checked before the loop or last iteration
digit = *p - '0'; // get a digit ready for the loop branch
}
return total;
}
Clang's output is shorter, so I'll show it. It of course follows the standard calling convention, taking the pointer in a0
, and returning a 64-bit integer in a pair of registers, a1:a0
:
# rv32gc clang 14.0 -O3
strtou64:
mv a2, a0
lbu a0, 0(a0) # load the first char
addi a3, a0, -48 # *p - '0'
li a0, 9
bltu a0, a3, .LBB0_4 # return 0 if the first char is a non-digit
li a0, 0 # total in a1:a0 = 0 ; should have done these before the branch
li a1, 0 # so a separate ret wouldn't be needed
addi a2, a2, 1 # p++
li a6, 10 # multiplier constant
.LBB0_2: # do{
mulhu a5, a0, a6 # high half of (lo(total) * 10)
mul a1, a1, a6 # hi(total) * 10
add a1, a1, a5 # add the high-half partial products
mul a5, a0, a6 # low half of (lo(total) * 10)
lbu a4, 0(a2) # load *p
add a0, a5, a3 # lo(total) = lo(total*10) + digit
sltu a3, a0, a5 # carry-out from that
add a1, a1, a3 # propagate carry into hi(total)
addi a3, a4, -48 # digit = *p - '0'
addi a2, a2, 1 # p++ done after the load; clang peeled one pointer increment before the loop
bltu a3, a6, .LBB0_2 # }while(digit < 10)
ret
.LBB0_4:
li a0, 0 # return 0 special case
li a1, 0 # because clang was dumb and didn't load these regs before branching
ret
If you want to go with GCC's shift/or strategy, it should be straightforward to see how that slots in to the same logic clang is using. You can look at compiler output for a function like return u64 << 3
to see which instructions are part of that.
And BTW, I wrote the C with compiling to decent asm in mind, making it easy for the compiler to transform it into a do{}while
loop with the condition at the bottom. I based it on the x86 asm in my answer on NASM Assembly convert input to integer?
Running 32 bit assembly code on a 64 bit Linux & 64 bit Processor : Explain the anomaly
Remember that everything by default on a 64-bit OS tends to assume 64-bit. You need to make sure that you are (a) using the 32-bit versions of your #includes where appropriate (b) linking with 32-bit libraries and (c) building a 32-bit executable. It would probably help if you showed the contents of your makefile if you have one, or else the commands that you are using to build this example.
FWIW I changed your code slightly (_start -> main):
#include <asm/unistd.h>
#include <syscall.h>
#define STDOUT 1
.data
hellostr:
.ascii "hello wolrd\n" ;
helloend:
.text
.globl main
main:
movl $(SYS_write) , %eax //ssize_t write(int fd, const void *buf, size_t count);
movl $(STDOUT) , %ebx
movl $hellostr , %ecx
movl $(helloend-hellostr) , %edx
int $0x80
movl $(SYS_exit), %eax //void _exit(int status);
xorl %ebx, %ebx
int $0x80
ret
and built it like this:
$ gcc -Wall test.S -m32 -o test
verfied that we have a 32-bit executable:
$ file test
test: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.4, dynamically linked (uses shared libs), not stripped
and it appears to run OK:
$ ./test
hello wolrd
How to Compile a C program which contains 32bit asm into .o file?
Use gcc -m32 -c evil_puts.c -o evil_puts.o
You're getting that error because you don't have the 32-bit libraries installed.
If using Ubuntu:
sudo apt-get install gcc-multilib
x86_64 Inline Assembly ; Copying 64-bit register directly to 64-bit memory location
It's the input "0" (*_rax)
which is foxing it... it seems that "0"
does not work with a "=m"
memory constraint, nor with "+m"
. (I do not know why.)
Changing your second function to compile and work:
uint32_t cpuid_0(uint32_t* _eax, uint32_t* _ebx, uint32_t* _ecx, uint32_t* _edx)
{
__asm__
(
"mov $0, %%eax\n"
"cpuid\n"
"mov %%eax, %0\n"
"mov %%ebx, %1\n"
"mov %%ecx, %2\n"
"mov %%edx, %3\n"
: "=m" (*_eax), "=m" (*_ebx), "=m" (*_ecx), "=m" (*_edx)
: //"0" (*_eax) -- not required and throws errors !!
: "%rax", "%rbx", "%rcx", "%rdx" // ESSENTIAL "clobbers"
) ;
return *_eax ;
}
where that:
does everything as uint32_t, for consistency.
discards the redundant
int a, b, c, d;
omits the
"0"
input, which in any case was not being used.declares simple "=m" output for
(*_eax)
"clobbers" all "%rax", "%rbx", "%rcx", "%rdx"
discards the redundant volatile.
The last is essential, because without it the compiler has no idea that those registers are affected.
The above compiles to:
push %rbx # compiler (now) knows %rbx is "clobbered"
mov %rdx,%r8 # likewise %rdx
mov %rcx,%r9 # ditto %rcx
mov $0x0,%eax # the __asm__(....
cpuid
mov %eax,(%rdi)
mov %ebx,(%rsi)
mov %ecx,(%r8)
mov %edx,(%r9) # ....) ;
mov (%rdi),%eax
pop %rbx
retq
NB: without the "clobbers" compiles to:
mov $0x0,%eax
cpuid
mov %eax,(%rdi)
mov %ebx,(%rsi)
mov %ecx,(%rdx)
mov %edx,(%rcx)
mov (%rdi),%eax
retq
which is shorter, but sadly doesn't work !!
You could also (version 2):
struct cpuid
{
uint32_t eax ;
uint32_t ebx ;
uint32_t ecx ;
uint32_t edx ;
};
uint32_t cpuid_0(struct cpuid* cid)
{
uint32_t eax ;
__asm__
(
"mov $0, %%eax\n"
"cpuid\n"
"mov %%ebx, %1\n"
"mov %%ecx, %2\n"
"mov %%edx, %3\n"
: "=a" (eax), "=m" (cid->ebx), "=m" (cid->ecx), "=m" (cid->edx)
:: "%ebx", "%ecx", "%edx"
) ;
return cid->eax = eax ;
}
which compiles to something very slightly shorter:
push %rbx
mov $0x0,%eax
cpuid
mov %ebx,0x4(%rdi)
mov %ecx,0x8(%rdi)
mov %edx,0xc(%rdi)
pop %rbx
mov %eax,(%rdi)
retq
Or you could do something more like your first version (version 3):
uint32_t cpuid_0(struct cpuid* cid)
{
uint32_t eax, ebx, ecx, edx ;
eax = 0 ;
__asm__(" cpuid\n" : "+a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx));
cid->edx = edx ;
cid->ecx = ecx ;
cid->ebx = ebx ;
return cid->eax = eax ;
}
which compiles to:
push %rbx
xor %eax,%eax
cpuid
mov %ebx,0x4(%rdi)
mov %edx,0xc(%rdi)
pop %rbx
mov %ecx,0x8(%rdi)
mov %eax,(%rdi)
retq
This version uses the "+a"
, "=b"
etc. magic to tell the compiler to allocate specific registers to the various variables. This reduces the amount of assembler to the bare minimum, which is generally a Good Thing. [Note that the compiler knows that xor %eax,%eax
is better (and shorter) than mov $0,%eax
and thinks there is some advantage to doing the pop %rbx
earlier.]
Better yet -- following comment by @Peter Cordes (version 4):
uint32_t cpuid_1(struct cpuid* cid)
{
__asm__
(
"xor %%eax, %%eax\n"
"cpuid\n"
: "=a" (cid->eax), "=b" (cid->ebx), "=c" (cid->ecx), "=d" (cid->edx)
) ;
return cid->eax ;
}
where the compiler figures out that cid->eax
is already in %eax
, and so compiles to:
push %rbx
xor %eax,%eax
cpuid
mov %ebx,0x4(%rdi)
mov %eax,(%rdi)
pop %rbx
mov %ecx,0x8(%rdi)
mov %edx,0xc(%rdi)
retq
which is the same as version 3, apart from a small difference in the order of the instructions.
FWIW: an __asm__()
is defined to be:
asm
asm-qualifiers (
AssemblerTemplate :
OutputOperands [ :
InputOperands [ :
Clobbers ] ] )
The key to inline assembler is to understand that the compiler:
has no idea what the AssemblerTemplate part means.
It does expand the
%xx
place holders, but understands nothing else.does understand the OutputOperands, InputOperands (if any) and Clobbers (if any)...
...these tell the compiler what the assembler needs as parameters, and how to expand the various
%xx
....but these also tell the compiler what the AssemblerTemplate does, in terms that the compiler understands.
So, what the compiler understands is a sort of "data flow". It understands that the assembler takes a number of inputs, returns a number of outputs and (may) as a side effect "clobber" some registers and/or amounts of memory. Armed with this information, the compiler can integrate the "black box" assembler sequence with the code generated around it. Among other things the compiler will:
allocate registers for output and input operands
and arrange for the inputs to be in the required registers (as required).
NB: the compiler looks on the assembler as a single operation, where all inputs are consumed before any outputs are generated. If an input is not used after the
__asm__()
the compiler can allocate a given register as an input and as an output. Hence the need so the so-called "early clobber".move the "black box" around wrt the surrounding code, maintaining the dependencies the assembler has on the sources of its inputs and the dependencies the code that follows has on the assembler's outputs.
discard the "black box" altogether if nothing seems to depend on its outputs !
x86 inline yasm convert to x64
Without your code my best guess is you should read this for AMD64 ABI and see calling convention standard in x64 platform. I think this should work for you. As on that document says you must pass parameter as follow (please note that you must classified your arguments first with method describing in ABI standard) :
- If the class is MEMORY, pass the argument on the stack.
- If the class is INTEGER, the next available register of the sequence
%rdi
,%rsi
,%rdx
,%rcx
,%r8
and%r9
is used.- If the class is SSE, the next available vector register is used, the registers are taken in the order from
%xmm0
to%xmm7
.
...
Related Topics
Is Garbage Allowed in High Bits of Parameter and Return Value Registers in X86-64 Sysv Abi
Secure Way to Run Other People Code (Sandbox) on My Server
Sending Udp Packets from the Linux Kernel
How to Get the Source Code for the Linux Utility Tail
Pthread Mutex Lock Unlock by Different Threads
Prevent Linux Thread from Being Interrupted by Scheduler
Self Modifying Code Always Segmentation Faults on Linux
Difference Between Pid and Tid
Insert Multiple Lines into a File After Specified Pattern Using Shell Script
How to Increase Ble Advertisement Frequency in Bluez
Does Linux Schedule a Process or a Thread
Why Is the Page Size of Linux (X86) 4 Kb, How Is That Calculated
Get Ceiling Integer from Number in Linux (Bash)
Randomly Shuffling Lines in Linux/Bash
Replacing Control Character in Sed
Relinking an Anonymous (Unlinked But Open) File
How to Install PHP 7 on Ec2 T2.Micro Instance Running Amazon Linux Distro