Overhead of Supporting Floating Point Arithmetic Inside the Linux Kernel

Overhead of supporting Floating Point Arithmetic inside the Linux Kernel

The usual answer is that if the kernel does not use floating point, it does not have to save the floating-point registers on entry to the kernel or restore them on exit. This shaves several hundred cycles off the cost of all system calls.

I do not know if anyone has tried to compare this savings against the performance improvements that might be available if the kernel could make indiscriminate use of those registers. Note that you can use them in the kernel if you take proper care, and this is done in contexts where tremendous speed benefits are available, e.g. using SSE instructions to accelerate memcpy and the like. (Look for calls to kernel_fpu_begin in the Linux sources.)

Use of floating point in the Linux kernel

Because...

  • many programs don't use floating point or don't use it on any given time slice; and
  • saving the FPU registers and other FPU state takes time; therefore

...an OS kernel may simply turn the FPU off. Presto, no state to save and restore, and therefore faster context-switching. (This is what mode meant, it just meant that the FPU was enabled.)

If a program attempts an FPU op, the program will trap into the kernel, the kernel will turn the FPU on, restore any saved state that may already exist, and then return to re-execute the FPU op.

At context switch time, it knows to actually go through the state save logic. (And then it may turn the FPU off again.)

By the way, I believe the book's explanation for the reason kernels (and not just Linux) avoid FPU ops is ... not perfectly accurate.1

The kernel can trap into itself and does so for many things. (Timers, page faults, device interrupts, others.) The real reason is that the kernel doesn't particularly need FPU ops and also needs to run on architectures without an FPU at all. Therefore, it simply avoids the complexity and runtime required to manage its own FPU context by not doing ops for which there are always other software solutions.

It's interesting to note how often the FPU state would have to be saved if the kernel wanted to use FP . . . every system call, every interrupt, every switch between kernel threads. Even if there was a need for occasional kernel FP,2 it would probably be faster to do it in software.



1. That is, dead wrong.

2. There are a few cases I know about where kernel software contains a floating point arithmetic implementation. Some architectures implement traditional FPU ops in hardware but leave some complex IEEE FP operations to software. (Think: denormal arithmetic.) When some odd IEEE corner case happens they trap to software which contains a pedantically correct emulation of the ops that can trap.

What are coding conventions for using floating-point in Linux device drivers?

Short answer: Kernel code can use floating point if this use is surrounded by kernel_fpu_begin()/kernel_fpu_end(). These function handle saving and restoring the fpu context. Also, they call preempt_disable()/preempt_enable(), which means no sleeping, page faults etc. in the code between those functions. Google the function names for more information.

If I understand correctly, whenever a
KM is running, it is using a hardware
context (or hardware thread or
register set -- whatever you want to
call it) that has been preempted from
some application thread.

No, a kernel module can run in user context as well (eg. when userspace calls syscalls on a device provided by the KM). It has, however, no relation to the float issue.

If you write your KM in c, the
compiler will correctly insure that
the general-purpose registers are
properly saved and restored (much as
in an application), but that doesn't
automatically happen with
floating-point registers.

That is not because of the compiler, but because of the kernel context-switching code.

Error while compiling Linux kernel 2.6.39.4

Google for "linux kernel float usage". It's a special thing. If you can avoid using floating point types, avoid it.

How to avoid FPU when given float numbers?

If you want to tell gcc to use a software floating point library there's apparently a switch for that, albeit perhaps not turnkey in the standard environment:

Using software floating point on x86 linux

In fact, this article suggests that linux kernel and its modules are already compiled with -msoft-float:

http://www.linuxsmiths.com/blog/?p=253

That said, @PaulR's suggestion seems most sensible. And if you offer an API which does whatever conversions you like then I don't see why it's any uglier than anything else.

How to include math.h #include math.h on kernel source file?

In experts view , its NOT a good approach to communicate data between kernel space and user space. Either fully work on kernel space OR only on user space.

But one solution can, use read() and write() command in a kernel module to send the information between user space and kernel space.

What's the difference between hard and soft floating point numbers?

Hard floats use an on-chip floating point unit. Soft floats emulate one in software. The difference is speed. It's strange to see both used on the same target architecture, since the chip either has an FPU or doesn't. You can enable soft floating point in GCC with -msoft-float. You may want to recompile your libc to use hardware floating point if you use it.

Performance comparison of FPU with software emulation

A general answer will obviously very vague, because performance depends on so many factors.

However, based on my understanding, in processors that do not implement floating point (FP) operations in hardware, a software implementation will typically be 10 to 100 times slower (or even worse, if the implementation is bad) than integer operations, which are always implemented in hardware on CPUs.

The exact performance will depend on a number of factors, such as the features of the integer hardware - some CPUs lack a FPU, but have features in their integer arithmetic that help implement a fast software emulation of FP calculations.

The paper mentioned by njuffa, Cristina Iordache and Ping Tak Peter Tang, An Overview of Floating-Point Support and Math Library on the Intel XScale Architecture supports this. For the Intel XScale processor the list as latencies (excerpt):

integer addition or subtraction:  1 cycle
integer multiplication: 2-6 cycles
fp addition (emulated): 34 cycles
fp multiplication (emulated): 35 cycles

So this would result in a factor of about 10-30 between integer and FP arithmetic. The paper also mentions that the GNU implementation (the one the GNU compiler uses by default) is about 10 times slower, which is a total factor of 100-300.

Finally, note that the above is for the case where the FP emulation is compiled into the program by the compiler. Some operating systems (e.g. Linux and WindowsCE) also have an FP emulation in the OS kernel. The advantage is that even code compiled without FP emulation (i.e. using FPU instructions) can run on a process without an FPU - the kernel will transparently emulate unsupported FPU instructions in software. However, this emulation is even slower (about another factor 10) than a software emulation compiled into the program, because of additional overhead. Obviously, this case is only relevant on processor architectures where some processors haven an FPU, and some do not (such as x86 and ARM).

Note: This answer compares the performance of (emulated) FP operations with integer operations on the same processor. Your question might also be read to be about the performance
of (emulated) FP operations compared to hardware FP operations (not sure what you meant). However, the result would be about the same, because if FP is implemented in hardware, it is typically (almost) as fast as integer operations.

80-bit extended precision floating-point in OCaml

The implementation of such a library is possible outside the compiler, thanks to the ffi support of the language.

The library must be split in two parts: the native ocaml source part, and the C runtime part.
the OCaml source must contain the datatype declaration, as well as the declaration of all the imported functions. For instance, the add operation would be:

(** basic binary operations on long doubles *)
external add : t -> t -> t = "ml_float80_add"
external sub : t -> t -> t = "ml_float80_sub"
external mul : t -> t -> t = "ml_float80_mul"
external div : t -> t -> t = "ml_float80_div"

in the C code, the ml_float80_add function should be defined, as described in the OCaml manual:

CAMLprim value ml_float80_add(value l, value r){
float80 rlf = Float80_val(l);
float80 rrf = Float80_val(r);
float80 llf = rlf + rrf;
value res = ml_float80_copy(llf);
return res;
}

Here we convert the OCaml value runtime representations to native C values, use the binary operator on them, and return a new OCaml value. the ml_float80_copy function does the allocation of that runtime representation.

Likewise,the C implementations of sub, mul and div functions should be defined there too. One can notice the similarity in signature and implementation of these functions, and abstract away through the use of C macros:

#define FLOAT80_BIN_OP(OPNAME,OP)                   \
CAMLprim value ml_float80_##OPNAME(value l, value r){ \
float80 rlf = Float80_val(l); \
float80 rrf = Float80_val(r); \
float80 llf = rlf OP rrf; \
value res = ml_float80_copy(llf); \
return res; \
}

FLOAT80_BIN_OP(add,+);
FLOAT80_BIN_OP(sub,-);
FLOAT80_BIN_OP(mul,*);
FLOAT80_BIN_OP(div,/);

The rest of the OCaml and C module should follow.

There are many possibilities as to how encode the float80 C type into an OCaml value. The simplest choice is to use a string, and store in it the raw long double.

type t = string

On the C side, we define the functions to convert an OCaml value back and forth to a C value:

#include <caml/mlvalues.h>
#include <caml/alloc.h>
#include <caml/misc.h>
#include <caml/memory.h>

#define FLOAT80_SIZE 10 /* 10 bytes */

typedef long double float80;

#define Float80_val(x) *((float80 *)String_val(x))

void float80_copy_str(char *r, const char *l){
int i;
for (i=0;i<FLOAT80_SIZE;i++)
r[i] = l[i];
}

void store_float80_val(value v,float80 f){
float80_copy_str(String_val(v), (const char *)&f);
}

CAMLprim value ml_float80_copy(value r, value l){
float80_copy_str(String_val(r),String_val(l));
return Val_unit;
}

However, that implementation doesn't bring support for the polymorphic comparison functions built into OCaml Pervasive.compare, and a few other features. Using that function on the above float80 type will mislead the comparison function into beleiving that the values are strings, and do a lexicographical comparison on their content.

Supporting these special features is simple enough though. We redefine the OCaml type as abstract, and change the C code to create and handle custom structs for our float80:

#include <caml/mlvalues.h>
#include <caml/alloc.h>
#include <caml/misc.h>
#include <caml/memory.h>
#include <caml/custom.h>
#include <caml/intext.h>

typedef struct {
struct custom_operations *ops;
float80 v;
} float80_s;

#define Float80_val(x) *((float80 *)Data_custom_val(x))

inline int comp(const float80 l, const float80 r){
return l == r ? 0: (l < r ? -1: 1);
}

static int float80_compare(value l, value r){
const float80 rlf = Float80_val(l);
const float80 rrf = Float80_val(r);
const int llf = comp(rlf,rrf);
return llf;
}

/* other features implementation here */

CAMLexport struct custom_operations float80_ops = {
"float80", custom_finalize_default, float80_compare, float80_hash,
float80_serialize, float80_deserialize, custom_compare_ext_default
};

CAMLprim value ml_float80_copy(long double ld){
value res = caml_alloc_custom(&float80_ops, FLOAT80_SIZE, 0, 1);
Float80_val(res) = ld;
return res;
}

We then propose to build the whole thing using ocamlbuild and a small bash script.

Is there any architecture that uses the same register space for scalar integer and floating point operations?

The Motorola 88100 had a single register file (thirty-one 32-bit entries plus a hardwired zero register) used for floating point and integer values. With 32-bit registers and support for double precision, register pairs had to be used to supply values, significantly constraining the number of double precision values that could be kept in registers.

The follow-on 88110 added thirty-two 80-bit extended registers for additional (and larger) floating point values.

Mitch Alsup, who was involved in Motorola's 88k development, has developed his own load-store ISA (at least partially for didactic reasons) which, if I recall correctly, uses a unified register file.

It should also be noted that the Power ISA (descendant from PowerPC) defines an "Embedded Floating Point Facility" which uses GPRs for floating point values. This reduces core implementation cost and context switch overhead.

One benefit of separate register files is that such provides explicit banking to reduce register port count in a straightforward limited superscalar design (e.g., providing three read ports to each file would allow all pairs of one FP, even three-source-operand FMADD, and one GPR-based operation to start in parallel and many common pairs of GPR-based operations compared with a five read ports with single register file to support FMADD and one other two-source operation). Another factor is that the capacity is additional and the width independent; this has both advantages and disadvantages. In addition, by coupling storage with operations a highly distinct coprocessor can be implemented in a more straightforward manner. This was more significant for early microprocessors given chip size limits, but the UltraSPARC T1 shared a floating point unit with eight cores and AMD's Bulldozer shared an FP/SIMD unit with two integer "cores".

A unified register file has some calling convention advantages; values can be passed in the same registers regardless of the type of the values. A unified register file also reduces unusable resources by allowing all registers to be used for all operations.



Related Topics



Leave a reply



Submit