Why Does C++ Code For Testing the Collatz Conjecture Run Faster Than Hand-Written Assembly

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

C vs Haskell Collatz conjecture speed comparison

The big problem with your Haskell code is that you are dividing, which you don't in the C version.

Yes, you wrote n % 2 and n / 2, but the compiler replaces that with shifts and bitwise and. GHC has unfortunately not yet been taught to do that.

If you do the substitution yourself

module Main where

import System.Environment (getArgs)
import Data.Int (Int32, Int64)
import Data.Bits

main :: IO ()
main = do
    arg <- getArgs
    print $ maxCol 0 (read (head arg) :: Int64)

col :: Int64 -> Int32
col x = col' x 0

col' :: Int64 -> Int32 -> Int32
col' 1 n            = n
col' x n
    | x .&. 1 == 0  = col' (x `shiftR` 1) (n + 1)
    | otherwise     = col' (3 * x + 1) (n + 1)

maxCol :: Int32 -> Int64 -> Int32
maxCol maxS 2   = maxS
maxCol maxS n
    | s > maxS  = maxCol s (n - 1)
    | otherwise = maxCol maxS (n - 1)
    where s = col n

with a 64-bit GHC you get comparable speed (0.35s vs C's 0.32s on my box for a limit of 1000000). If you compile using the LLVM backend, you don't even need to replace the % 2 and / 2 with bitwise operations, LLVM does that for you (but it produces slower code, 0.4s, for your original Haskell source, surprisingly - normally, LLVM is not worse than the native code generator at loop optimisation).

With a 32-bit GHC, you won't get comparable speed, since with those, the primitive operations on 64-bit integers are implemented through C calls - there never was enough demand for fast 64-bit operations on 32-bit systems for them to be implemented as primops; the few people working on GHC spent their time doing other, more important, things.

TL;DR: Is Haskell code quick to write and simple to maintain only for computationally simple tasks and loses this characteristic when performance is crucial?

That depends. You must have some idea of what sort of code GHC generates from what sort of input, and you must avoid some performance traps. With a bit of practice, it is quite easy to get within say 2× the speed of gcc -O3 for such tasks.

Which is faster in practice: decent C code, or decent hand-written assembler?

To me it looks both answers given so far are correct. The answer depends, among other things, on the particular CPU architecture we're talking about. The more complex architecture is, the harder it is to write efficient ASM code by hand.

On one end of the spectrum are CISC cores such as x86. They have multiple execution units, long pipelines, variable instruction latencies per instruction, etc. In many cases ASM code that looks "clean" or "optimal" to a human isn't in fact optimal for a CPU and can be improved by using instructions or techniques from dark corners of processor manuals. Compilers "know" about this and can produce decently optimised code. True, in many cases emitted code can be improved by a skilful human, but with the right compiler and optimisations settings code is often already very good. Also, with C code at hand you won't need to re-optimise it manually for every new CPU generation (yes, optimisations often depend on particular CPU family, not only on instruction set), so writing in C is a way of "future-proofing" your code.

On another end of the spectrum are simple RISC cores, such as 8051 (or others simple 8-bit controllers). They have much simpler scheduling semantics and smaller instruction sets. Compilers still do a decent job of optimisation here, but it's also much simpler to write a decent ASM code by hand (or fix performance issues in emitted code).

Why is this Ruby code far faster than the equivalent C++ code?

You're overflowing int, so it's not terminating. Use int64_t instead of int in your c++ code.
You'll probably need to include stdint.h for that..

When is assembly faster than C?

Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).

Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see

Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.
_umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.

C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
  long long a_long = a; // cast to 64 bit.

  long long product = a_long * b; // perform multiplication

  return (int) (product >> 16);  // shift by the fixed point bias
}

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b)
{
    return (int) __ll_rshift(__emul(a,b),16);
}

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.

Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)

Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.

Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

Is it possible to get better performance than a compiler by fully comprehending modern pc architecture?

Sometimes the human can produce better code, if some of these requirements are true:

The human needs specific knowledge about the targeted architecture.
The human knows all tricks from the compiler like (leftshift instead multiplication).
Furthermore the human will need to know a lot about assembly/processors, like pipeline stalls, cache misses, ...
The human will need a lot of time for non-trivial programs.

Like, what if he write the code with 100% assembly, focusing on the architecture?

This program will be really fast on this CPU, but you would have to rewrite it from scratch for every cpu. (Like you wrote for Processor-1 with a faster shr instruction, but Processor-2 has a faster div instruction.)
Furthermore the development time will be significantly longer (Up to 20x high)==>Higher costs

And if it does make a difference, is it worthwhile?

Only in a small set of applications, like writing for a microcontroller, or if you really need pure performance (Data processing for data, that can't be done on the GPU).

For more information:
When is assembly faster than C?

BUT:
First use other algorithms, like using the Coppersmith–Winograd algorithm instead of the naive algorithm for matrix multiplication, as an example. Only if every other possibility is used, use assembly, otherwise you can end in a maintenace nightmare quite quickly.

GHC Optimization: Collatz conjecture

Some problems with your (mutable array) code:

You use a fold to find the maximal chain length, for that the array has to be converted to an association list, that takes time and allocation the C++ version doesn't need.
You use even and div for testing resp dividing by 2. These are slow. g++ optimises both operations to the faster bit operations (on platforms where that is supposedly faster, at least), but GHC doesn't do these low-level optimisations (yet), so for the time being, they have to be done by hand.
You use readArray and writeArray. The extra bounds-checking that isn't done in the C++ code also takes time, once the other problems are dealt with, that amounts to a significant portion of the running time (ca. 25% on my box), since there are done a lot of reads and writes in the algorithm.

Incorporating that into the implementation, I get

import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Data.Bits

collatz_array :: ST s (STUArray s Int Int)
collatz_array = do
    let upper = 10000000
    arr <- newArray (0,upper) 0
    unsafeWrite arr 2 1
    let check i
            | upper < i = return arr
            | i .&. 1 == 0 = do
                l <- unsafeRead arr (i `shiftR` 1)
                unsafeWrite arr i (l+1)
                check (i+1)
            | otherwise = do
                let j = (3*i+1) `shiftR` 1
                    find k l
                        | upper < k = find (next k) $! l+1
                        | k < i     = do
                            m <- unsafeRead arr k
                            return (m+l)
                        | otherwise = do
                            m <- unsafeRead arr k
                            if m == 0
                              then do
                                  n <- find (next k) 1
                                  unsafeWrite arr k n
                                  return (n+l)
                              else return (m+l)
                          where
                            next h
                                | h .&. 1 == 0 = h `shiftR` 1
                                | otherwise = (3*h+1) `shiftR` 1
                l <- find j 1
                unsafeWrite arr i l
                check (i+1)
    check 3

collatz_max :: ST s (Int,Int)
collatz_max = do
    car <- collatz_array
    (_,upper) <- getBounds car
    let find w m i
            | upper < i = return (w,m)
            | otherwise = do
                l <- unsafeRead car i
                if m < l
                  then find i l (i+1)
                  else find w m (i+1)
    find 1 0 2

main :: IO ()
main = print (runST collatz_max)

And the timings (both for 10 million):

$ time ./cccoll
8400511 429

real    0m0.210s
user    0m0.200s
sys     0m0.009s
$ time ./stcoll
(8400511,429)

real    0m0.341s
user    0m0.307s
sys     0m0.033s

which doesn't look too bad.

Important note: That code only works on 64-bit GHC (so, in particular, on Windows, you need ghc-7.6.1 or later, previous GHCs were 32-bit even on 64-bit Windows) since intermediate chain elements exceed 32-bit range. On 32-bit systems, one would have to use Integer or a 64-bit integer type (Int64 or Word64) for following the chains, at a drastic performance cost, since the primitive 64-bit operations (arithmetic and shifts) are implemented as foreign calls to C functions in 32-bit GHCs (fast foreign calls, but still much slower than direct machine ops).

Is there a way to optimise the Collatz conjecture into a branchless algorithm?

Sure. You need the loop, of course, but the work inside can be done like this:

n /= (n&-n);  // remove all trailing 0s
while(n > 1) {
    n = 3*n+1;
    n /= (n&-n);  // remove all trailing 0s
}

It also helps that this technique does all the divisions by 2 at once, instead of requiring a separate iteration for each of them.

Why Does C++ Code For Testing the Collatz Conjecture Run Faster Than Hand-Written Assembly