Linux Assembler Error "Impossible Constraint in 'Asm'"

Linux assembler error impossible constraint in ‘asm’

__asm__ __volatile__ ("addl %%ebx,%%eax" : "=a"(foo) : "a"(foo), "b"(bar));

seems to work. I believe that the syntax for register constraints changed at some point, but it's not terribly well documented. I find it easier to write raw assembly and avoid the hassle.

error: Impossible constraint in 'asm' i

const means you can't modify the variable, not that it's a compile-time constant. That's only the case if the caller passes a constant, and you compile with optimization enabled so constant-propagation can get that value to the asm statement. Even C++ constexpr doesn't require a constant expression in most contexts, it only allows it, and guarantees that compile-time constant-propagation is possible.

A stand-alone version of this function can't exist, but you didn't make it static so the compiler has to create a non-inline definition that can get called from other compilation units, even if it inlines into every call-site in this file. But this is impossible, because const int b doesn't have a known value.

For example,

int foo(const int x){
return x*37;
}

int bar(){
return foo(2);
}

On Godbolt compiled for AArch64: notice that foo can't just return a constant, it needs to work with a run-time variable argument, whatever value it happens to be. Only in bar with optimization enabled can it inline and not need the value of x in a register, just return a constant. (Which it used as an immediate for mov).

foo(int):
mov w1, 37
mul w0, w0, w1
ret
bar():
mov w0, 74
ret

In a shared library, your function also has to be __attribute__((visibility("hidden"))) so it can actually inline, otherwise the possibility of symbol interposition means that the compiler can't assume that foo(123) is actually going to call int foo(int) defined in the same .c

(Or static inline.)



Is there have an efficient way to avoid using 256 if-else statement?

Not sure what you're doing with your vector exactly, but if you don't have a shuffle that can work with runtime-variable counts, store to a 16-byte array can be the least bad option. But storing one byte and then reloading the whole vector will cause a store-forwarding stall, probably similar to the cost on x86 if not worse.

Doing your algorithm efficiently with AArch64 SIMD instructions is a separate question, and you haven't given enough info to figure out anything about that. Ask a different question if you want help implementing some algorithm to avoid this in the first place, or an efficient runtime-variable byte insert using other shuffles.

ARM inline assembly code with error impossible constraint in asm

Your inline assembly code makes a number of mistakes:

  • It tries to use a 64-bit structure as an operand with a 32-bit output register ("=r") constraint. This is what gives you the error.
  • It doesn't use that output operand anywhere
  • It doesn't tell the compiler where the output actually is (S0/S1)
  • It doesn't tell the compiler that len is supposed to be an input
  • It clobbers a number of registers, R3, S11, S12, S13, S14, S14, without telling the compiler.
  • It uses a label .loop that unnecessarily prevents the compiler from inlining your code in multiple places.
  • It doesn't actually appear to be the equivalent of the C++ code you've shown, calculating something else instead.

I'm not going to bother to explain how you can fix all these mistakes, because you shouldn't be using inline assembly. You can write your code in C++ and let the compiler do the vectorization.

For example compiling following code, equivalent to your example C++ code, with GCC 4.9 and the -O3 -funsafe-math-optimizations options:

dcmplx ComplexConv(int len, dcmplx *hat, dcmplx *buf)
{
int i;
dcmplx xout;
xout.re = xout.im = 0.0;
for (i = 0; i < len; i++) {
xout.re += hat[i].re * buf[i].re + hat[i].im * buf[i].im;
xout.im += hat[i].re * buf[i].im - hat[i].im * buf[i].re;
}
return xout;
}

generates the following assembly as its inner loop:

.L97:
add lr, lr, #1
cmp ip, lr
vld2.32 {d20-d23}, [r5]!
vld2.32 {d24-d27}, [r4]!
vmul.f32 q15, q12, q10
vmul.f32 q14, q13, q10
vmla.f32 q15, q13, q11
vmls.f32 q14, q12, q11
vadd.f32 q9, q9, q15
vadd.f32 q8, q8, q14
bhi .L97

Based on your inline assembly code, it's likely that the compiler generated better than what you would've come up with if you tried to vectorize it yourself.

The -funsafe-math-optimizations is necessary because the NEON instructions aren't fully IEEE 754 conformant. As the GCC documentation states:

If the selected floating-point hardware includes the NEON extension
(e.g. -mfpu=‘neon’), note that floating-point operations are not
generated by GCC's auto-vectorization pass unless
-funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for
floating-point arithmetic (in particular denormal values are treated
as zero), so the use of NEON instructions may lead to a loss of
precision.

I should also note that the compiler generates almost as good as code above if you don't roll your own complex type, like in the following example:

#include <complex>
typedef std::complex<float> complex;
complex ComplexConv_std(int len, complex *hat, complex *buf)
{
int i;
complex xout(0.0f, 0.0f);
for (i = 0; i < len; i++) {
xout += std::conj(hat[i]) * buf[i];
}
return xout;
}

One advantage to using your own type however, is that you can improve the code compiler generates making one small change to how you declare struct dcmplx:

typedef struct {
float re;
float im;
} __attribute__((aligned(8)) dcmplx;

By saying it needs to be 8-byte (64-bit) aligned, this allows the compiler to skip the check to see if it is suitably aligned and then fall back on the slower scalar implementation instead.

Now, hypothetically, lets say you were unsatisfied with how GCC vectorized your code and thought you could do better. Would this justify using inline assembly? No, the next thing to try are the ARM NEON intrinsics. Using intrinics is just like normal C++ programming, you don't have worry about a bunch of special rules you need to follow. For example here's how I converted the vectorized assembly above into this untested code that uses intrinsics:

#include <assert.h>
#include <arm_neon.h>
dcmplx ComplexConv(int len, dcmplx *hat, dcmplx *buf)
{
int i;
dcmplx xout;

/* everything needs to be suitably aligned */
assert(len % 4 == 0);
assert(((unsigned) hat % 8) == 0);
assert(((unsigned) buf % 8) == 0);

float32x4_t re, im;
for (i = 0; i < len; i += 4) {
float32x4x2_t h = vld2q_f32(&hat[i].re);
float32x4x2_t b = vld2q_f32(&buf[i].re);
re = vaddq_f32(re, vmlaq_f32(vmulq_f32(h.val[0], b.val[0]),
b.val[1], h.val[1]));
im = vaddq_f32(im, vmlsq_f32(vmulq_f32(h.val[1], b.val[1]),
b.val[0], h.val[0]));
}
float32x2_t re_tmp = vadd_f32(vget_low_f32(re), vget_high_f32(re));
float32x2_t im_tmp = vadd_f32(vget_low_f32(im), vget_high_f32(im));
xout.re = vget_lane_f32(vpadd_f32(re_tmp, re_tmp), 0);
xout.im = vget_lane_f32(vpadd_f32(im_tmp, im_tmp), 0);
return xout;
}

Finally if this wasn't good enough and you needed to tweak out every bit of performance you could then it's still not a good idea to use inline assembly. Instead your last resort should be to use regular assembly instead. Since your rewriting most of the function in assembly, you might as well write it completely in assembly. That means you don't have worry about telling the compiler about everything you're doing in the inline assembly. You only need to conform to the ARM ABI, which can be tricky enough, but is a lot easier than getting everything correct with inline assembly.

C embedded assembly error: ‘asm’ operand has impossible constraints

You declare clobbers on most of the integer registers, but then you ask for 13 different input variables. 32-bit ARM only has 16 registers, and 2 of those are PC and SP leaving only 14 at best really general purpose registers.

We can test that too many clobbers + operands are the problem by removing all the clobbers on r0.. r12; this lets it compile (into incorrect code!!). https://godbolt.org/z/Z6x78N This is not the solution because it introduces huge bugs, it's just how I confirmed that this is the problem.

Any time your inline asm template starts with mov to copy from an input register operand into a hard-coded register, you're usually doing it wrong. Even if you had enough registers, the compiler is going to have to emit code to get the variable into a register, then your hand-written asm uses another mov to copy it for no reason.

See https://stackoverflow.com/tags/inline-assembly/info for more guides.

Instead ask the compiler for the input in that register in the first place with register int foo asm("r0"), or better let the compiler do register allocation by using %0 or the equivalent named operand like %[src1] instead of a hard-coded r0 everywhere inside your asm template. The syntax for naming an operand is [name] "r" (C_var_name). They don't have to match, but they don't have to be unique either; it's often convenient to use the same asm operand name as the C var name.

Then you can remove the clobbers on most of the GP registers. You do need to tell the compiler about any input registers you modify, e.g. by using a "+r" constraint instead of "r" (and then not using that C variable after the asm modifies it). Or use an "=r" output constraint and a matching input constraint like "0" (var) to put that input in the same register as output operand 0. "+r" is much easier in a wrapper function where the C variable is not used afterwards anyway.

You can remove the clobbers on vector registers if you use dummy output operands to get the compiler to do register allocation, but it's basically fine if you just leave those hard-coded.

asm(  // "mov        r0, %[src1]; "   // remove this and just use %[src1] instead of r0

"... \n\t"
"VST1.16 {d30[0]}, [%[dstData]]! \n\t" //restore img_temp[m][n] to pointer data
"... \n\t"

: [src1]"+&r"(src1), [src2]"+&r"(src2), [dstData]"+&r"(dstData),
[dstSum]"+&r"(dstSum), [height]"+&r"(height)

: [temp_comp1] "r"(temp_comp1), [niter] "r"(numofiterations),
[temp_comp2] "r"(temp_comp2), [temp_comp3] "r"(temp_comp3),
...
: "memory", "cc", all the q and d regs you use. // but not r0..r13
);

You can look at the compiler's asm output to see how it filled in the %0 and %[name] operands in the asm template you gave it. Use "instruction \n\t" to make this readable, ; puts all the instructions onto the same line in the asm output. (C string-literal concatenation doesn't introduce newlines).

The early-clobber declarations on the read/write operands makes sure that none of the input-only operands share a register with them, even if they have the compiler knows that temp_comp1 == height for example. Because the original value of temp_comp1 still needs to be readable from the register %[temp_comp1], even after something has modified %[height]. So they can't both be r4 for example. Otherwise, without the & in "+&r", the compiler could choose that to gain efficiency if outputs are only written after all inputs are read. (e.g. when wrapping a single instruction, like GNU C inline asm is designed to do efficiently).


side-note: char array1[16] and 2 don't need to be volatile; the "memory" clobber on the asm statement is sufficient even though you just pass pointers to them, not use them as "m" input operands.

error: impossible register constraint in 'asm'

That inline assembly is buggy:

  1. It uses multi-line strings which effectively concatenate. Without \n all appears on one line. Whether your assembler accepts statements separated by semicolons makes all the difference there ... some may not.
  2. It specifies the same variable as input/output constraint instead of using "+r"(value) as ordinarily suggested for this situation.

Without seeing the rest of the code it's not quite clear why the inline assembly statement looks the way it does; Personally, I'd suggest to write it like:

asm("ror %%cl, %0" : "+r"(value) : "c"((((uintptr_t)address) & 3) << 3)));

because there's little need to do the calculation itself in assembly. The uintptr_t (from <stdint.h>) cast makes this 32/64bit agnostic as well.

Edit:

If you want it for a different CPU but x86 / x64, then it obviously needs to be different ... For ARM (not Thumb2), it'd be:

asm("ROR %0, %0, %1" : "+r"(value) : "r"((((uintptr_t)address) & 3) << 3)));

since that's how the rotate instruction there behaves.

Edit (add reference):

Regarding the operation performed here as such, this blog post gives an interesting perspective - namely, that the compiler is quite likely to create the same output for:

(a >> shift | a << (8 * sizeof(a) - shift))

as for the x86 inline

asm("ror %%cl, %0" : "+r"(a) : "c"(shift))

Testing this:

#include <stdint.h>

int main(int argc, char **argv)
{
unsigned int shift = (int)((((uintptr_t)argv) & 3) << 3);
unsigned int a = argc;
#ifdef USE_ASM
/*
* Mark the assembly version with a "nop" instruction in output
*/
asm("nop\n\t"
"ror %%cl, %0" : "+r"(a) : "c"(shift));
return a;
#else
return (a >> shift | a << (8 * sizeof(a) - shift));
#endif
}

Compile / disassemble it:

$ gcc -DUSE_ASM -O8 -c tf.c; objdump -d tf.o

tf.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 :
0: 83 e6 03 and $0x3,%esi
3: 8d 0c f5 00 00 00 00 lea 0x0(,%rsi,8),%ecx
a: 90 nop
b: d3 cf ror %cl,%edi
d: 89 f8 mov %edi,%eax
f: c3 retq
$ gcc -O8 -c tf.c; objdump -d tf.o

tf.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 :
0: 83 e6 03 and $0x3,%esi
3: 8d 0c f5 00 00 00 00 lea 0x0(,%rsi,8),%ecx
a: d3 cf ror %cl,%edi
c: 89 f8 mov %edi,%eax
e: c3 retq

Ergo, this inline assembly is unnecessary.

compiling error impossible constraint in 'asm'

The problem is that the code you're trying to compile assumes that any target CPU that's not a PowerPC must be an x86 processor. The code simply doesn't doesn't support your SPARC CPU.

Fortunately the code doesn't seem to be critical, it's apparently only used to seed a random number generator, which is then used to create random C programs. The goal being to prevent multiple instances of the program that are started at the same time from generating the same random programs. I'd replace the code with something more portable that's not dependent on the CPU. Something like this:

#ifdef WIN32

unsigned long platform_gen_seed()
{
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
return now.LowPart;
}

#else /* assume something Unix-like */

static unsigned long generic_gen_seed() {
pid_t pid = getpid();
time_t now;
time(&now);
return (unsigned long)(now ^ (pid << 16 | ((pid >> 16) & 0xFFFF)));
}

#ifdef CLOCK_REALTIME

unsigned long platform_gen_seed()
{
struct timespec now, resolution;
if (clock_gettime(CLOCK_REALTIME, &now) == -1
|| clock_getres(CLOCK_REALTIME, &resolution) == -1
|| resolution.tv_sec > 0 || resolution.tv_nsec > 1000000) {
return generic_gen_seed();
}
return (now.tv_nsec / resolution.tv_nsec
+ now.tv_sec * resolution.tv_nsec);
}

#else

unsigned long platform_gen_seed()
{
return generic_gen_seed();
}

#endif /* CLOCK_REALTIME */

#endif /* WIN32 */

The code has been test in isolation on Linux and Windows. It should also work in isolation on Solaris SPARC, but I don't know how well it work in context of the actual program.



Related Topics



Leave a reply



Submit