Using Software Floating Point on X86 Linux

Using software floating point on x86 linux

Unless you want to bootstrap your entire toolchain by hand, you could start with uclibc toolchain (the i386 version, I imagine) -- soft float is (AFAIK) not directly supported for "native" compilation on debian and derivatives, but it can be used via the "embedded" approach of the uclibc toolchain.

Detecting floating point software emulation

The Floating-point unit (FPU) on modern x86 is natively double (in fact, it's even bigger than double), not float (the "32" in 32-bit describes the integer register widths, not the floating-point width). This is not true, however, if your code is taking advantage of vectorized SSE instructions, which do either 4 single or 2 double operations in parallel.

If not, then your main speed hit by switching your app from float to double will be in the increased memory bandwidth.

Use of floating point in the Linux kernel

Because...

  • many programs don't use floating point or don't use it on any given time slice; and
  • saving the FPU registers and other FPU state takes time; therefore

...an OS kernel may simply turn the FPU off. Presto, no state to save and restore, and therefore faster context-switching. (This is what mode meant, it just meant that the FPU was enabled.)

If a program attempts an FPU op, the program will trap into the kernel, the kernel will turn the FPU on, restore any saved state that may already exist, and then return to re-execute the FPU op.

At context switch time, it knows to actually go through the state save logic. (And then it may turn the FPU off again.)

By the way, I believe the book's explanation for the reason kernels (and not just Linux) avoid FPU ops is ... not perfectly accurate.1

The kernel can trap into itself and does so for many things. (Timers, page faults, device interrupts, others.) The real reason is that the kernel doesn't particularly need FPU ops and also needs to run on architectures without an FPU at all. Therefore, it simply avoids the complexity and runtime required to manage its own FPU context by not doing ops for which there are always other software solutions.

It's interesting to note how often the FPU state would have to be saved if the kernel wanted to use FP . . . every system call, every interrupt, every switch between kernel threads. Even if there was a need for occasional kernel FP,2 it would probably be faster to do it in software.



1. That is, dead wrong.

2. There are a few cases I know about where kernel software contains a floating point arithmetic implementation. Some architectures implement traditional FPU ops in hardware but leave some complex IEEE FP operations to software. (Think: denormal arithmetic.) When some odd IEEE corner case happens they trap to software which contains a pedantically correct emulation of the ops that can trap.

How does GCC compile the 80 bit wide 10 byte float __float80 on x86_64?

GCC docs for Additional Floating Types:

ISO/IEC TS 18661-3:2015 defines C support for additional floating types _Floatn and _Floatnx

... GCC does not currently support _Float128x on any systems.

I think _Float128x is IEEE binary128, i.e. a true 128-bit float with a huge exponent range. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1691.pdf.


__float80 is obviously the x87 10-byte type. In the x86-64 SysV ABI, it's the same as long double; both have 16-byte alignment in that ABI.

__float80 is available on the i386, x86_64, and IA-64 targets, and supports the 80-bit (XFmode) floating type. It is an alias for the type name _Float64x on these targets.


I think __float128 is an extended-precision type using SSE2, presumably a "double double" format with twice the mantissa width but the same exponent limits as 64-bit double. (i.e. less exponent range than __float80)

On i386, x86_64, and ..., __float128 is an alias for _Float128

  • float128 and double-double arithmetic
  • Optimize for fast multiplication but slow addition: FMA and doubledouble
  • double-double implementation resilient to FPU rounding mode

Those are probably the same doubledouble that gcc gives you with __float128. Or maybe it's a pure software floating point 128-bit


Godbolt compiler explorer for gcc7.3 -O3 (same as gcc4.6, apparently these types aren't new)

//long double add_ld(long double x) { return x+x; }  // same as __float80
__float80 add80(__float80 x) { return x+x; }

fld TBYTE PTR [rsp+8] # arg on the stack
fadd st, st(0)
ret # and returned in st(0)


__float128 add128(__float128 x) { return x+x; }

# IDK why not movapd or better movaps, silly compiler
movdqa xmm1, xmm0 # x arg in xmm0
sub rsp, 8 # align the stack
call __addtf3 # args in xmm0, xmm1
add rsp, 8
ret # return value in xmm0, I assume


int size80 = sizeof(__float80); // 16
int sizeld = sizeof(long double); // 16

int size128 = sizeof(__float128); // 16

So gcc calls a libgcc function for __float128 addition, not inlining an increment to the exponent or anything clever like that.

ARM vs x86 for floating point

Below are my Linpack Benchmark results for PCs via Linux, Raspberry Pi and Android devices (I have lots more via Windows). These are based on my C/C++ 1996 conversion for PCs that was approved by Jack Dongarra, the original author, and obtainable via.

http://www.netlib.no/netlib/benchmark/linpack-pc.c

This is for a matrix of order 100, in double precision. Results below include some at single precision. Dongarra’s historic results for this and supercomputer varieties are in:

http://netlib.org/benchmark/performance.pdf

This is just one benchmark and others give a different story. You can obtain lots more from my site including source codes and MP varieties, (Free with no ads):

http://www.roylongbottom.org.uk/

Linux 32/64 Bit Results

Double Precision 100x100 compiled at 32 and 64 bits

Opt No opt
CPU MHz MFLOPS MFLOPS

Atom N455 32b Ub 1666 196 94
Atom N455 64b Ub 1666 226 89

Core 2 Mob 32b Ub 1830 983 307

Athlon 64 32b Ub 2211 936 231
Athlon 64 64b Ub 2211 1118 221

Core 2 Duo 32b Ub 2400 1288 404
Core 2 Duo 64b Ub 2400 1577 378

Phenom II 32b Ub 3000 1464 411
Phenom II 64b Ub 3000 1887 411
Phenom II 64b Fe 3000 1872 407

Core i7 930 64b Ub **** 2265 511

Core i7 4820K 32b Ub $$$1 2534 988
Core i7 4820K 64b Ub $$$1 3672 900
Core i7 4820K AVX Ub $$$12 5413 935

Ub = Ubuntu Linux, Fe = Fedora Linux
**** Rated as 2800 MHz but running at up to
3066 MHz using Turbo Boost
$$$1 Rated as 3700 MHz but running at up to
3900 MHz, using Turbo Boost
$$$12 As $$$1, but compiled with GCC 4.8.2 that
produces AVX SIMD insructions.

######################################################

      Android and Raspberry Pi Versions

Double Precision and Single Precision (SP) 100x100

v7/v5 v5
CPU MHz Android MFLOPS MFLOPS

ARM 926EJ 800 2.2 5.7 5.6
ARM v7-A8 800 2.3.5 80.2
ARM v7-A9 800 2.3.4 101.4 10.6
ARM v7-A9 1300a 4.1.2 151.1 17.1
ARM v7-A9 1500 4.0.3 171.4
ARM v7-A9 1500a 4.0.3 155.5 16.9
ARM v7-A9 1400 4.0.4 184.4 19.9
ARM v7-A9 1600 4.0.3 196.5
ARM v7-A15 2000b 4.2.2 459.2 28.8

v7 SP Java
CPU MHz Android MFLOPS MFLOPS

ARM 926EJ 800 2.2 9.6 2.3
ARM v7-A9 800 2.3.4 129.1 33.3
ARM v7-A9 1300a 4.1.2 201.3 56.4
ARM v7-A9 1500a 4.0.3 204.6 56.9
ARM v7-A9 1400 4.0.4 235.5 57.0
ARM v7-A15 2000b 4.2.2 803.0 143.1


Atom Ax86 1666 2.2.1 15.7
Core 2 Ax86 2400 2.2.1 53.3

Raspberry Pi DP SP
CPU MHz Linux MFLOPS MFLOPS

ARM 1176 700 3.6.11 42 58
ARM 1176 1000 3.6.11 68 88

NEON SP
CPU MHz Android MFLOPS

ARM v7-A9 800 2.3.4 255.8
ARM v7-A9 1300a 4.1.2 376.0
ARM v7-A9 1500a 4.0.3 382.5
ARM v7-A9 1400 4.0.4 454.2
ARM v7-A15 2000b 4.2.2 1334.9

Why am I able to perform floating point operations inside a Linux kernel module?


I thought you couldn't perform floating point operations in the Linux kernel

You can't safely: failure to use kernel_fpu_begin() / kernel_fpu_end() doesn't mean FPU instructions will fault (not on x86 at least).

Instead it will silently corrupt user-space's FPU state. This is bad; don't do that.

The compiler doesn't know what kernel_fpu_begin() means, so it can't check / warn about code that compiles to FPU instructions outside of FPU-begin regions.

There may be a debug mode where the kernel does disable SSE, x87, and MMX instructions outside of kernel_fpu_begin / end regions, but that would be slower and isn't done by default.

It is possible, though: setting CR0::TS = 1 makes x87 instructions fault, so lazy FPU context switching is possible, and there are other bits for SSE and AVX.


There are many ways for buggy kernel code to cause serious problems. This is just one of many. In C, you pretty much always know when you're using floating point (unless a typo results in a 1. constant or something in a context that actually compiles).


Why is the FP architectural state different from integer?

Linux has to save/restore the integer state any time it enters/exits the kernel. All code needs to use integer registers (except for a giant straight-line block of FPU computation that ends with a jmp instead of a ret (ret modifies rsp).)

But kernel code avoids FPU generally, so Linux leaves the FPU state unsaved on entry from a system call, only saving before an actual context switch to a different user-space process or on kernel_fpu_begin. Otherwise, it's common to return to the same user-space process on the same core, so FPU state doesn't need to be restored because the kernel didn't touch it. (And this is where corruption would happen if a kernel task actually did modify the FPU state. I think this goes both ways: user-space could also corrupt your FPU state).

The integer state is fairly small, only 16x 64-bit registers + RFLAGS and segment regs. FPU state is more than twice as large even without AVX: 8x 80-bit x87 registers, and 16x XMM or YMM, or 32x ZMM registers (+ MXCSR, and x87 status + control words). Also the MPX bnd0-4 registers are lumped in with "FPU". At this point "FPU state" just means all non-integer registers. On my Skylake, dmesg says x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.

See Understanding FPU usage in linux kernel; modern Linux doesn't do lazy FPU context switches by default for context switches (only for kernel/user transitions). (But that article explains what Lazy is.)

Most processes use SSE for copying/zeroing small blocks of memory in compiler-generated code, and most library string/memcpy/memset implementations use SSE/SSE2. Also, hardware supported optimized save/restore is a thing now (xsaveopt / xrstor), so "eager" FPU save/restore may actually do less work if some/all FP registers haven't actually been used. e.g. save just the low 128b of YMM registers if they were zeroed with vzeroupper so the CPU knows they're clean. (And mark that fact with just one bit in the save format.)

With "eager" context switching, FPU instructions stay enabled all the time, so bad kernel code can corrupt them at any time.

Floating-point constant without using floating-point registers (Linux kernel module)

How about using something like this:

union val {
float fval;
int ival;
};

static const union val my_val1 = { .fval = 3.8 * 0.98 / 1000.0 };

int *vp = whatever;
*vp = my_val1.ival;

The use of static const ought to be enough to prevent floating-point calculations at run-time.



Related Topics



Leave a reply



Submit