Is Inline Assembly Language Slower Than Native C++ Code

Is inline assembly language slower than native C++ code?

Yes, most times.

First of all you start from wrong assumption that a low-level language (assembly in this case) will always produce faster code than high-level language (C++ and C in this case). It's not true. Is C code always faster than Java code? No because there is another variable: programmer. The way you write code and knowledge of architecture details greatly influence performance (as you saw in this case).

You can always produce an example where handmade assembly code is better than compiled code but usually it's a fictional example or a single routine not a true program of 500.000+ lines of C++ code). I think compilers will produce better assembly code 95% times and sometimes, only some rare times, you may need to write assembly code for few, short, highly used, performance critical routines or when you have to access features your favorite high-level language does not expose. Do you want a touch of this complexity? Read this awesome answer here on SO.

Why this?

First of all because compilers can do optimizations that we can't even imagine (see this short list) and they will do them in seconds (when we may need days).

When you code in assembly you have to make well-defined functions with a well-defined call interface. However they can take in account whole-program optimization and inter-procedural optimization such
as register allocation, constant propagation, common subexpression elimination, instruction scheduling and other complex, not obvious optimizations (Polytope model, for example). On RISC architecture guys stopped worrying about this many years ago (instruction scheduling, for example, is very hard to tune by hand) and modern CISC CPUs have very long pipelines too.

For some complex microcontrollers even system libraries are written in C instead of assembly because their compilers produce a better (and easy to maintain) final code.

Compilers sometimes can automatically use some MMX/SIMDx instructions by themselves, and if you don't use them you simply can't compare (other answers already reviewed your assembly code very well).
Just for loops this is a short list of loop optimizations of what is commonly checked for by a compiler (do you think you could do it by yourself when your schedule has been decided for a C# program?) If you write something in assembly, I think you have to consider at least some simple optimizations. The school-book example for arrays is to unroll the cycle (its size is known at compile time). Do it and run your test again.

These days it's also really uncommon to need to use assembly language for another reason: the plethora of different CPUs. Do you want to support them all? Each has a specific microarchitecture and some specific instruction sets. They have different number of functional units and assembly instructions should be arranged to keep them all busy. If you write in C you may use PGO but in assembly you will then need a great knowledge of that specific architecture (and rethink and redo everything for another architecture). For small tasks the compiler usually does it better, and for complex tasks usually the work isn't repaid (and compiler may do better anyway).

If you sit down and you take a look at your code probably you'll see that you'll gain more to redesign your algorithm than to translate to assembly (read this great post here on SO), there are high-level optimizations (and hints to compiler) you can effectively apply before you need to resort to assembly language. It's probably worth to mention that often using intrinsics you will have performance gain your're looking for and compiler will still be able to perform most of its optimizations.

All this said, even when you can produce a 5~10 times faster assembly code, you should ask your customers if they prefer to pay one week of your time or to buy a 50$ faster CPU. Extreme optimization more often than not (and especially in LOB applications) is simply not required from most of us.

Why is my assembly code much slower than the C implementation

Peter Cordes comment explains what is happening here. srt1 and srt2 are inlined while srt is not.
Quoting Peter Cordes :

Oh right, simply being a non-inline function is the problem. x86-64
System V doesn't have any call-preserved XMM registers, so the add
dependency chain through v includes a store/reload for srt(), but not
when srt1 or srt2 inline

.

(Edited) When should one use inline assembly in c (outside of optimization)?

(Most of this was written for the original version of the question. It was edited after).

You mean purely for performance reasons, so excluding using special instructions in an OS kernel?

What you really ultimately want is machine code that executes efficiently. And the ability to modify some text files and recompile to get different machine code. You can usually get both of those things without needing inline asm, therefore:

https://gcc.gnu.org/wiki/DontUseInlineAsm

GNU C inline assembly is hard to use correctly, but if you do use it correctly has very low overhead. Still, it blocks many important optimizations like constant-propagation.

See https://stackoverflow.com/tags/inline-assembly/info for guides on how to use it efficiently / safely. (e.g. use constraints instead of stupid mov instructions as the first or last instruction in the asm template.)


Pretty much always inappropriate, unless you know exactly what you're doing and can't hand-hold the compiler to make asm that's quite as good with pure C or intrinsics. Manual vectorization with intrinsics certainly still has its place; compilers are still terrible at some things, like auto-vectorizing complex shuffles. GCC/Clang won't auto-vectorize at all for search loops like a pure C implementation of memchr, or any loop where the trip-count isn't known before the first iteration.

And of course performance on current microarchitectures has to trump maintainability and optimizing differently for future CPUs. If it's ever appropriate, only for small hot loops where your program spends a lot of time, and typically CPU-bound. If memory-bound then there's usually not much to gain.

Over large scales, compilers are excellent (especially with link-time optimization). Humans can't compete on that scale, not while keeping code maintainable. The only place humans can still compete is in the small scale where you can afford the time to think about every single instruction in a loop that will run many iterations over the course of a program.

The more widely-used and performance-sensitive your code is (e.g. a video encoder like x264 or x265), the more reason there is to consider hand-tuned asm for anything. Saving a few cycles over millions of computers running your code every day starts to add up to being worth considering the maintenance / testing / portability downsides.


The one notable exception is ARM SIMD (NEON) where compilers are often still bad. I think especially for 32-bit ARM (where each 128-bit q0..15 register is aliased by 2x 64-bit d0..32 registers, so you can avoid shuffling by accessing the 2 halves as separate registers. Compilers don't model this well, and can easily shoot themselves in the foot when compiling intrinsics that you'd expect to be able to compile efficiently. Compilers are good at producing efficient asm from SIMD intrinsics for x86 (SSE/AVX) and PowerPC (altivec), but for some unknown reason are bad at optimizing ARM NEON intrinsics and often make sub-optimal asm.

Some compilers are not bad, e.g. apparently Apple clang/LLVM for AArch64 does ok more often than it used to. But still, see Arm Neon Intrinsics vs hand assembly - Jake Lee found the intrinsics version of his 4x4 float matmul was 3x slower than his hand-written version using clang, in Dec 2017. Jake is an ARM optimization expert so I'm inclined to believe that's fairly realistic.


or __asm (in the case of VC++)

MSVC-style asm is usually only useful for writing whole loops because having to take inputs via memory operands destroys (some of) the benefit. So amortizing that overhead over a whole loop helps.

For wrapping single instructions, introducing extra store-forwarding latency is just dumb, and there are MSVC intrinsics for almost everything you can't easily express in pure C. See What is the difference between 'asm', '__asm' and '__asm__'? for examples with a single instruction: you get much worse asm from using MSVC inline asm than you would for pure C or an intrinsic if you look at the big picture (including compiler-generated asm outside your asm block).


C++ code for testing the Collatz conjecture faster than hand-written assembly - why? shows a concrete example where hand-written asm is faster on current CPUs than anything I was able to get GCC or clang to emit by tweaking C source. They apparently don't know how to optimize for lower-latency LEA when it's part of a loop-carried dependency chain.

(The original question there was a great example of why you shouldn't write by hand in asm unless you know exactly what you're doing and use optimized compiler output as a starting point. But my answer shows that for a long-running hot tight loop, there are significant gains that compilers are missing with just micro-optimizations, even leaving aside algorithmic improvements.)

If you're considering asm, always benchmark it against the best you can get the compiler to emit. Working on a hand-written asm version may give you ideas that you can apply to your C to hand-hold compilers into making better asm. Then you can get the benefit without actually including any non-portable inline asm in your code.

Assembly - Are there any languages other than C and C++ that allow for interaction with Assembly using inline code?

Yes, D, Rust, Delphi, and quite a few other ahead-of-time-compiled languages have some form of inline asm.

Java doesn't, nor do most other languages that are normally JIT-compiled from a portable binary (like Java's .class bytecode, or C#'s CIL). Code injecting/assembly inlining in Java?.

Very high level languages like Python don't even have simple object-representations for numbers, e.g. an integer variable isn't just a 32-bit object, it has type info, and (in Python specifically) can be arbitrary length for large values. So even if a Python implementation did have inline-asm facilities, it would be a challenge to let you do anything to Python objects, except maybe for NumPy arrays which are laid out like C arrays.

It's possible to call native machine-code functions (e.g. libraries compiled from C, or hand-written asm) from most high-level languages - that's usually important for writing some kinds of applications. For example, in Java there's JNI (Java Native Interface). Even node.js JavaScript can call native functions. "Marshalling" args into a form that makes sense to pass to a C function can be expensive, depending on the high-level language and whether you want to let the C / asm function modify an array or just return a value.



Different forms of inline asm in different languages

Often they're not MSVC's inefficient form like you're using (which forces a store/reload for inputs and outputs). Better designs, like Rust's modeled on GNU C inline asm can use registers. e.g. like GNU C asm("lzcnt %1, %0" : "=r"(leading_zero_count) : "rm"(input)); letting the compiler pick an output register, and pick register or a memory addressing mode for the input.

(But even better to use intrinsics like _lzcnt_u32 or __builtin_clz for operations the compiler knows about, only inline asm for instructions the compiler doesn't have intrinsics for, or if you want to micro-optimize a loop in a certain way. https://gcc.gnu.org/wiki/DontUseInlineAsm)

Some (like Delphi) have inputs via a "calling convention" similar to a function call, with args in registers, not quite free mixing of asm and high-level code. So it's more like an asm block with fixed inputs, and one output in a specific register (plus side-effects) which the compiler can inline like it would a function.


For syntax like you show to work, either

  • You have to manually save/restore every register you use inside the asm block (really bad for performance unless you're wrapping a big loop - apparently Borland Turbo C++ was like this)
  • Or the compiler has to understand every single instruction to know what registers it might write (MSVC is like this). The design notes / discussion for Rust's inline asm mention this requirement for D or MSVC compilers to implement what's effectively a DSL (Domain Specific Language), and how much extra work that is, especially for portability to new ISAs.

Note that MSVC's specific implementation of inline asm was so brittle and clunky that it doesn't work safely in functions with register args, which meant not supporting it at all for x86-64, or ARM/AArch64 where the standard calling convention uses register args. Instead, they provide intriniscs for basically every instruction, including privileged ones like invlpg, making it possible to write a kernel (such as Windows) in Visual C++. (Where other compilers would expect you to use asm() for such things). Windows almost certainly has a few parts written in separate .asm files, like interrupt and system-call entry points, and maybe a context-switch function that has to loads a new stack pointer, but with good intrinsics support you don't need asm, if you trust you compiler to make good-enough asm on its own.

When is assembly faster than C?

Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).


Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see

  • Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.
  • _umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.

C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
long long a_long = a; // cast to 64 bit.

long long product = a_long * b; // perform multiplication

return (int) (product >> 16); // shift by the fixed point bias
}

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b)
{
return (int) __ll_rshift(__emul(a,b),16);
}

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.


Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)


Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.


Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

What are some real-life uses of inline assembly?

Inline assembly (and on a related note, calling external functions written purely in assembly) can be extremely useful or absolutely essential for reasons such as writing device drivers, direct access to hardware or processor capabilities not defined in the language, hardware-supported parallel processing (as opposed to multi-threading) such as CUDA, interfacing with FPGAs, performance, etc.

It is also important because some things are only possible by going "beneath" the level of abstraction provided by the Standard (both C++ and C).

The Standard(s) recognize that some things will be inherently implementation-defined, and allow for that throughout the Standard. One of these allowances (perhaps the lowest-level) is recognition of asm. Well, "sort of" recognition:

In C (N1256), it is found in the Standard under "Common extensions":

J.5.10 The asm keyword

1 The asm keyword may be used to insert assembly language directly into the translator output (6.8). The most common implementation is via a statement of the form:

asm ( character-string-literal );

In C++ (N3337), it has similar caveats:

§7.4/1

An asm declaration has the form

asm-definition:

asm ( string-literal ) ;

The asm declaration is conditionally-supported; its meaning is implementation-defined. [ Note: Typically it is used to pass information through the implementation to an assembler. —end note]

It should be noted that an important development in recent years is that attempting to increase performance by using inline assembly is often counter-productive, unless you know exactly what you are doing. Compiler/optimizer register usage decisions, awareness of pipeline and branch prediction behavior, etc., are almost always enough for most uses.

On the other hand, processors in recent years have added CPU-level support for higher-level operations (such as Intel's AES extensions) that can increase performance by several orders of magnitude for specialized applications.

So:

Legacy feature? Not at all. It is absolutely essential for some requirements.

Educational feature? In an ideal world, only if accompanied by a series of lectures explaining why you'll probably never need it, and if you ever do need it, how to limit it's visible surface area to the rest of your application as much as possible.



Related Topics



Leave a reply



Submit