Using Software Floating Point on X86 Linux

Using software floating point on x86 linux

Unless you want to bootstrap your entire toolchain by hand, you could start with uclibc toolchain (the i386 version, I imagine) -- soft float is (AFAIK) not directly supported for "native" compilation on debian and derivatives, but it can be used via the "embedded" approach of the uclibc toolchain.

Detecting floating point software emulation

The Floating-point unit (FPU) on modern x86 is natively double (in fact, it's even bigger than double), not float (the "32" in 32-bit describes the integer register widths, not the floating-point width). This is not true, however, if your code is taking advantage of vectorized SSE instructions, which do either 4 single or 2 double operations in parallel.

If not, then your main speed hit by switching your app from float to double will be in the increased memory bandwidth.

Use of floating point in the Linux kernel

Because...

many programs don't use floating point or don't use it on any given time slice; and
saving the FPU registers and other FPU state takes time; therefore

...an OS kernel may simply turn the FPU off. Presto, no state to save and restore, and therefore faster context-switching. (This is what mode meant, it just meant that the FPU was enabled.)

If a program attempts an FPU op, the program will trap into the kernel, the kernel will turn the FPU on, restore any saved state that may already exist, and then return to re-execute the FPU op.

At context switch time, it knows to actually go through the state save logic. (And then it may turn the FPU off again.)

By the way, I believe the book's explanation for the reason kernels (and not just Linux) avoid FPU ops is ... not perfectly accurate.¹

The kernel can trap into itself and does so for many things. (Timers, page faults, device interrupts, others.) The real reason is that the kernel doesn't particularly need FPU ops and also needs to run on architectures without an FPU at all. Therefore, it simply avoids the complexity and runtime required to manage its own FPU context by not doing ops for which there are always other software solutions.

It's interesting to note how often the FPU state would have to be saved if the kernel wanted to use FP . . . every system call, every interrupt, every switch between kernel threads. Even if there was a need for occasional kernel FP,² it would probably be faster to do it in software.

^{1. That is, dead wrong.

2. There are a few cases I know about where kernel software contains a floating point arithmetic implementation. Some architectures implement traditional FPU ops in hardware but leave some complex IEEE FP operations to software. (Think: denormal arithmetic.) When some odd IEEE corner case happens they trap to software which contains a pedantically correct emulation of the ops that can trap.}

How does GCC compile the 80 bit wide 10 byte float __float80 on x86_64?

GCC docs for Additional Floating Types:

ISO/IEC TS 18661-3:2015 defines C support for additional floating types _Floatn and _Floatnx
... GCC does not currently support _Float128x on any systems.

I think _Float128x is IEEE binary128, i.e. a true 128-bit float with a huge exponent range. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1691.pdf.

__float80 is obviously the x87 10-byte type. In the x86-64 SysV ABI, it's the same as long double; both have 16-byte alignment in that ABI.

__float80 is available on the i386, x86_64, and IA-64 targets, and supports the 80-bit (XFmode) floating type. It is an alias for the type name _Float64x on these targets.

I think __float128 is an extended-precision type using SSE2, presumably a "double double" format with twice the mantissa width but the same exponent limits as 64-bit double. (i.e. less exponent range than __float80)

On i386, x86_64, and ..., __float128 is an alias for _Float128

float128 and double-double arithmetic
Optimize for fast multiplication but slow addition: FMA and doubledouble
double-double implementation resilient to FPU rounding mode

Those are probably the same doubledouble that gcc gives you with __float128. Or maybe it's a pure software floating point 128-bit

Godbolt compiler explorer for gcc7.3 -O3 (same as gcc4.6, apparently these types aren't new)

//long double add_ld(long double x) { return x+x; }  // same as __float80
__float80 add80(__float80 x) { return x+x; }

    fld     TBYTE PTR [rsp+8]    # arg on the stack
    fadd    st, st(0)
    ret                          # and returned in st(0)


__float128 add128(__float128 x) { return x+x; }

          # IDK why not movapd or better movaps, silly compiler
    movdqa  xmm1, xmm0           # x arg in xmm0
    sub     rsp, 8               # align the stack
    call    __addtf3             # args in xmm0, xmm1
    add     rsp, 8
    ret                          # return value in xmm0, I assume


int size80 = sizeof(__float80);    // 16
int sizeld = sizeof(long double);  // 16

int size128 = sizeof(__float128);  // 16

So gcc calls a libgcc function for __float128 addition, not inlining an increment to the exponent or anything clever like that.

ARM vs x86 for floating point

Below are my Linpack Benchmark results for PCs via Linux, Raspberry Pi and Android devices (I have lots more via Windows). These are based on my C/C++ 1996 conversion for PCs that was approved by Jack Dongarra, the original author, and obtainable via.

http://www.netlib.no/netlib/benchmark/linpack-pc.c

This is for a matrix of order 100, in double precision. Results below include some at single precision. Dongarra’s historic results for this and supercomputer varieties are in:

http://netlib.org/benchmark/performance.pdf

This is just one benchmark and others give a different story. You can obtain lots more from my site including source codes and MP varieties, (Free with no ads):

http://www.roylongbottom.org.uk/

Linux 32/64 Bit Results

Double Precision 100x100 compiled at 32 and 64 bits 

                                   Opt    No opt
CPU                      MHz    MFLOPS    MFLOPS

Atom N455     32b  Ub   1666       196        94
Atom N455     64b  Ub   1666       226        89

Core 2 Mob    32b  Ub   1830       983       307

Athlon 64     32b  Ub   2211       936       231
Athlon 64     64b  Ub   2211      1118       221

Core 2 Duo    32b  Ub   2400      1288       404
Core 2 Duo    64b  Ub   2400      1577       378

Phenom II     32b  Ub   3000      1464       411
Phenom II     64b  Ub   3000      1887       411
Phenom II     64b  Fe   3000      1872       407

Core i7 930   64b  Ub   ****      2265       511

Core i7 4820K 32b  Ub   $$$1      2534       988
Core i7 4820K 64b  Ub   $$$1      3672       900
Core i7 4820K AVX  Ub   $$$12     5413       935

  Ub = Ubuntu Linux,   Fe = Fedora Linux        
 ****  Rated as 2800 MHz but running at up to   
       3066 MHz using Turbo Boost               
 $$$1  Rated as 3700 MHz but running at up to   
       3900 MHz, using Turbo Boost              
 $$$12 As $$$1, but compiled with GCC 4.8.2 that
       produces AVX SIMD insructions.

######################################################

      Android and Raspberry Pi Versions

Double Precision and Single Precision (SP) 100x100

                               v7/v5       v5 
CPU          MHz   Android    MFLOPS    MFLOPS

ARM 926EJ    800       2.2       5.7       5.6
ARM v7-A8    800     2.3.5      80.2          
ARM v7-A9    800     2.3.4     101.4      10.6
ARM v7-A9   1300a    4.1.2     151.1      17.1
ARM v7-A9   1500     4.0.3     171.4          
ARM v7-A9   1500a    4.0.3     155.5      16.9
ARM v7-A9   1400     4.0.4     184.4      19.9
ARM v7-A9   1600     4.0.3     196.5          
ARM v7-A15  2000b    4.2.2     459.2      28.8

                               v7 SP     Java 
CPU          MHz   Android    MFLOPS    MFLOPS

ARM 926EJ    800       2.2       9.6       2.3
ARM v7-A9    800     2.3.4     129.1      33.3
ARM v7-A9   1300a    4.1.2     201.3      56.4
ARM v7-A9   1500a    4.0.3     204.6      56.9
ARM v7-A9   1400     4.0.4     235.5      57.0
ARM v7-A15  2000b    4.2.2     803.0     143.1


Atom   Ax86 1666     2.2.1                15.7
Core 2 Ax86 2400     2.2.1                53.3

Raspberry Pi                    DP        SP  
CPU          MHz     Linux    MFLOPS    MFLOPS

ARM  1176    700     3.6.11     42        58  
ARM  1176   1000     3.6.11     68        88  

                              NEON SP         
CPU          MHz   Android    MFLOPS          

ARM v7-A9    800     2.3.4     255.8          
ARM v7-A9   1300a    4.1.2     376.0          
ARM v7-A9   1500a    4.0.3     382.5          
ARM v7-A9   1400     4.0.4     454.2          
ARM v7-A15  2000b    4.2.2    1334.9

Why am I able to perform floating point operations inside a Linux kernel module?

I thought you couldn't perform floating point operations in the Linux kernel

You can't safely: failure to use kernel_fpu_begin() / kernel_fpu_end() doesn't mean FPU instructions will fault (not on x86 at least).

Instead it will silently corrupt user-space's FPU state. This is bad; don't do that.

The compiler doesn't know what kernel_fpu_begin() means, so it can't check / warn about code that compiles to FPU instructions outside of FPU-begin regions.

There may be a debug mode where the kernel does disable SSE, x87, and MMX instructions outside of kernel_fpu_begin / end regions, but that would be slower and isn't done by default.

It is possible, though: setting CR0::TS = 1 makes x87 instructions fault, so lazy FPU context switching is possible, and there are other bits for SSE and AVX.

There are many ways for buggy kernel code to cause serious problems. This is just one of many. In C, you pretty much always know when you're using floating point (unless a typo results in a 1. constant or something in a context that actually compiles).

Why is the FP architectural state different from integer?

Linux has to save/restore the integer state any time it enters/exits the kernel. All code needs to use integer registers (except for a giant straight-line block of FPU computation that ends with a jmp instead of a ret (ret modifies rsp).)

But kernel code avoids FPU generally, so Linux leaves the FPU state unsaved on entry from a system call, only saving before an actual context switch to a different user-space process or on kernel_fpu_begin. Otherwise, it's common to return to the same user-space process on the same core, so FPU state doesn't need to be restored because the kernel didn't touch it. (And this is where corruption would happen if a kernel task actually did modify the FPU state. I think this goes both ways: user-space could also corrupt your FPU state).

The integer state is fairly small, only 16x 64-bit registers + RFLAGS and segment regs. FPU state is more than twice as large even without AVX: 8x 80-bit x87 registers, and 16x XMM or YMM, or 32x ZMM registers (+ MXCSR, and x87 status + control words). Also the MPX bnd0-4 registers are lumped in with "FPU". At this point "FPU state" just means all non-integer registers. On my Skylake, dmesg says x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.

See Understanding FPU usage in linux kernel; modern Linux doesn't do lazy FPU context switches by default for context switches (only for kernel/user transitions). (But that article explains what Lazy is.)

Most processes use SSE for copying/zeroing small blocks of memory in compiler-generated code, and most library string/memcpy/memset implementations use SSE/SSE2. Also, hardware supported optimized save/restore is a thing now (xsaveopt / xrstor), so "eager" FPU save/restore may actually do less work if some/all FP registers haven't actually been used. e.g. save just the low 128b of YMM registers if they were zeroed with vzeroupper so the CPU knows they're clean. (And mark that fact with just one bit in the save format.)

With "eager" context switching, FPU instructions stay enabled all the time, so bad kernel code can corrupt them at any time.

Floating-point constant without using floating-point registers (Linux kernel module)

How about using something like this:

union val {
    float fval;
    int ival;
};

static const union val my_val1 = { .fval = 3.8 * 0.98 / 1000.0 };

int *vp = whatever;
*vp = my_val1.ival;

The use of static const ought to be enough to prevent floating-point calculations at run-time.

Using Software Floating Point on X86 Linux