Fixed-Size Floating Point Types

Fixed-size floating point types

Nothing like this exists in the C or C++ standards at present. In fact, there isn't even a guarantee that float will be a binary floating-point format at all.

Some compilers guarantee that the float type will be the IEEE-754 32 bit binary format. Some do not. In reality, float is in fact the IEEE-754 single type on most non-embedded platforms, though the usual caveats about some compilers evaluating expressions in a wider format apply.

There is a working group discussing adding C language bindings for the 2008 revision of IEEE-754, which could consider recommending that such a typedef be added. If this were added to C, I expect the C++ standard would follow suit... eventually.

Is there a type with fixed size of two bytes for floating points in c?

There is no such thing in the C standard. Some compilers do have __fp16.

You can use Q numbers, but these are limited in a fixed range.

If you really need floating point, with the exponent, then you should implement the ieee standard half precision.

Regular artimetics work on the Q numbers. You should write your own arithmetic for the half precision. Unless your compiler support it.

Or go open source.

Where are the fixed width floating types?

C standard fixed width floating point types are not defined

  • C floating point (FP) goals are designed to embrace variations and many implementations.

  • MISRA FP goals are to restrict variety.

Fixed size FP types do not result in uniform bit encoding nor other consistent FP properties. They have limited usefulness in C - hence they are not part of the C standard or library.


Fall-back

Code could use below and a _Static_assert (since C11) or a C99 substitute.

typedef      float  float32_t;
typedef double float64_t;
typedef long double float128_t;

_Static_assert(sizeof(float)*CHAR_BIT == 32, "float 32");
_Static_assert(sizeof(float)*CHAR_BIT == 64, "float 64");
_Static_assert(sizeof(float)*CHAR_BIT == 128, "float 128");

Further notes

Compliant C may not have all 32, 64, 128 bit FP types, thus unable to define all float32_t, float64_t, float128_t.

2 different Compliant C implementations may have a 32-bit FP types, but different encoding. Compare float32 vs. CCSI resulting in different range, precision and sub-normal support.

2 different Compliant C implementations may have a 32-bit FP types with the same encoding, but different endians, even if their integer endians agree.

Rule 6.3 (advisory): typedefs that indicate size and signedness should be used in place of the basic numerical types.: that goal "helps to clarify the size of the storage" and not much else.

Rule 1.5 (advisory): Floating-point implementations should comply with a defined floating-point standard. is particularly difficult to achieve. Even if an implementation uses the same FP encoding as IEEE 754, C allows the operations enough implementation defined behavior to differ from IEE 754.

Ideally, in C, an implementation that conforms to IEEE 754 defines __STDC_IEC_559__. Yet proving and maintaining conformity is challenging enough that an implementation may forego defining __STDC_IEC_559__ as it may only be 99.999% conforming.

Fixed-width Floating-Point Numbers in C/C++

According to the current C99 draft standard, annex F, that should be double. Of course, this is assuming your compilers meet that part of the standard.

For C++, I've checked the 0x draft and a draft for the 1998 version of the standard, but neither seem to specify anything about representation like that part of the C99 standard, beyond a bool in numeric_limits that specifies that IEEE 754/IEC 559 is used on that platform, like Josh Kelley mentions.

Very few platforms do not support IEEE 754, though - it generally does not pay off to design another floating-point format since IEEE 754 is well-defined and works quite nicely - and if that is supported, then it is a reasonable assumption that double is indeed 64 bits (IEEE 754-1985 calls that format double-precision, after all, so it makes sense).

On the off chance that double isn't double-precision, build in a sanity check so users can report it and you can handle that platform separately. If the platform doesn't support IEEE 754, you're not going to get that representation anyway unless you implement it yourself.

Are there float and double types with fixed sizes in C99?

I found the answer in Any guaranteed minimum sizes for types in C?

Quoting Jed Smith (with corrected link to C99 standard):

Yes, the values in float.h and limits.h are system dependent. You should never make assumptions about the width of a type, but the standard does lay down some minimums. See §6.2.5 and §5.2.4.2.1 in the C99 standard.

For example, the standard only says that a char should be large enough to hold every character in the execution character set. It doesn't say how wide it is.

For the floating-point case, the standard hints at the order in which the widths of the types are given:

§6.2.5.10

There are three real floating types, designated as float, double, and long
double
. 32) The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.

They implicitly defined which is wider than the other, but not specifically how wide they are. "Subset" itself is vague, because a long double can have the exact same range of a double and satisfy this clause.

This is pretty typical of how C goes, and a lot is left to each individual environment. You can't assume, you have to ask the compiler.

Are there fixed-width float types in clang?

Answering the question as posed, CLang's documented language extensions do not include analogs of GCC's _Float32 and _Float64 types. Do note, however, that even GCC provides those only on targets that support corresponding types natively.

On the other hand, inasmuch as clang is built on top of LLVM, it is worthwhile to consider LLVM's documentation of FP type representations:

The binary format of half, float, double, and fp128 correspond to the
IEEE-754-2008 specifications for binary16, binary32, binary64, and
binary128 respectively.

In that sense, then, CLang's equivalents of _Float64 and _Float32 are double and float, respectively. (Indeed, the same equivalence holds in GCC for substantially all targets where the explicit-width versions are supported.)

Fixed-sized float/double for portability

Most floating point is IEEE 754, 64 bit or 32 bit. However if the floating point unit in your processor is not compatible, there's no realistic, efficient way of making it compatible, and thus programs will produce slightly different results when run on different machines. (That's actually a good test for a sound program - if results are significantly different because of floating point errors, then you are handling floating point operations badly).

You can however load and save the closest representation to IEEE 754 in a binary file, portably. Code is maintained here here

Fixed size data types, C++, windows types

Floating Point

While there is no C++ standard defining the sizes for formats of floating point values Microsoft has specified that they consistently use 4-byte and 8-byte IEEE floating point format for float and double types respectively.

Integrals

As for integral types, Microsoft does have compiler-specific defines for fixed length variables. Some non-Microsoft compilers define fixed-size integral types using the cstdint header. Neither of these are based on official standards.

Serialization

This will be terribly unportable and will most likely turn into a maintenance nightmare as your structs get more complicated. What you are effectively doing is defining an error-prone binary serialization format that must be complied with through convention. This problem has already been solved more effectively.

I would highly recommend using a serialization format like protocol buffers or maybe boost::serialization for communication between machines. If your data is hitting the wire, then the performance of serialization/deserialization is going to be an incredibly small fraction of transmission time.

Alignment

Another serious issue that you'll have is how the struct is packed in memory. Your struct will most likely be laid-out in memory differently in a 32-bit process than it is in a 64-bit process.

In a 32-bit process, your struct members will be aligned on word boundaries, and on doubleword boundaries for 64-bit.

For example, this program outputs 20 on 32-bit and 24 on 64-bit platforms:

#include <iostream>                                                                                                                                             
#include <cstdint>

struct mystruct {
uint32_t y;
double z;
uint8_t c;
float v;
} mystruct_t;

int main() {
std::cout << sizeof(mystruct_t);
}


Related Topics



Leave a reply



Submit