Union for Uint32_T and Uint8_T[4] Undefined Behavior

Will a C union of uint32_t and uint8_t[4] will always map the same way on little endian architectures?

TL;DR: Yes, the code is fine.

As noted, it contains implementation-defined behavior depending on endianess, but other than that, the behavior is well-defined and the code is portable (between little endian machines).


Detailed answer:

One thing that's important is that the order of allocation of an array is guaranteed, C11 6.2.5/20:

An array type describes a contiguously allocated nonempty set of objects with a particular member object type, called the element type.

This means that the array of 4 uint8_t is guaranteed to follow the allocation order of the uint32_t, which on a little endian system means least significant byte first.

In theory, the compiler is however free to toss in any padding at the end of a union (C11 6.7.2.1/17), but that shouldn't affect the data representation. If you want to pedantically protect against this - or more relevantly, you wish to protect against an issue in case more members are added later - you can add a compile-time assert:

typedef union {
uint32_t double_word;
uint8_t octets[4];
} u;

_Static_assert(sizeof(u) == sizeof(uint32_t), "union u: Padding detected");

As for the representation of the uintn_t types, it is guaranteed to be 2's complement (in case of signed types) with no padding bits (C11 7.20.1.1).

And finally, the issue about whether "type punning" through a union is allowed or undefined behavior, this is specified a bit vaguely in C11 6.5.2.3:

A postfix expression followed by the . operator and an identifier designates a member of a structure or union object. The value is that of the named member,95) and is an lvalue if the first expression is an lvalue.

Where the (non-normative) note 95 provides clarification:

If the member used to read the contents of a union object is not the same as the member last used to
store a value in the object, the appropriate part of the object representation of the value is reinterpreted
as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type
punning’’). This might be a trap representation.

And since we already ruled out padding bits, trap representations is not an issue.

Use a uint32_t to store four separate uint8_t values

Your initial revision had a correct albeit roundabout approach for option 2, which was

// a, b, c, and d are of initialized and of type uint8_t
uint32_t x = ...;
x = (x & 0xFFFFFF00) | (uint32_t) a;
x = (x & 0xFFFF00FF) | (uint32_t) b << 8;
x = (x & 0xFF00FFFF) | (uint32_t) c << 16;
x = (x & 0x00FFFFFF) | (uint32_t) d << 24;

This revision for option 2 is wrong:

uint32_t x = ...;
x |= (uint32_t) a;
x |= (uint32_t) b << 8;
x |= (uint32_t) c << 16;
x |= (uint32_t) d << 24;

Even when x is initialized it's still wrong because you're not setting the 8 bit ranges, you're ORing them.

The correct approach would be

// a, b, c, and d are of initialized and of type uint8_t
uint32_t x = (uint32_t) a;
x |= (uint32_t) b << 8;
x |= (uint32_t) c << 16;
x |= (uint32_t) d << 24;

Or more succinctly

// a, b, c, and d are of initialized and of type uint8_t
uint32_t x =
(uint32_t) a
| (uint32_t) b << 8
| (uint32_t) c << 16
| (uint32_t) d << 24;

The issue with option 1 is that it assumes the endianness of uint32_t to be LSB first and is therefore not a portable solution.


After receiving clarification about the question you're asking, your initial revision (the first code block in this answer) is the correct approach. It leaves the remaining 24 bits untouched while setting a particular 8 bit range to the uint8_t value on the RHS.

How to use union in C++ correctly?

The behaviour of your program is undefined because you read from an inactive member of the union.

How to use union in C++ correctly?

In general: By only reading from the active member of the union which is the one that was assigned last. Exceptions do exist. For example, reading from inactive member that has the same type as the active member is also allowed. There are no such exceptions that would apply to your program.

Since you want to pun the type into array of bytes, there is another, well defined way: reinterpret_cast:

struct
{
...
} bits;

static_assert(std::is_same_v<uint8_t, unsigned char>);
uint8_t* bytes = reinterpret_cast<uint8_t*>(&bits);

Note that this reading through this bytes pointer is allowed specifically because unsigned char (along with a few other types) is special.


Now, assuming that you use a language extension that defines the behaviour of union type punning or use the reinterpret_cast shown above:

Can anybody give me an advice why I have got 53 instead of 154 value in the msg.bytes[0]?

Because of this:

reg_addr | msb | bit field
7654321 | 0 | bit index in order of significance
0011010 | 1 | bit value

0b00110101 == 53

It is unclear why you had expected otherwise. Relying on order of bit fields is not portable.

Purpose of Unions in C and C++

The purpose of unions is rather obvious, but for some reason people miss it quite often.

The purpose of union is to save memory by using the same memory region for storing different objects at different times. That's it.

It is like a room in a hotel. Different people live in it for non-overlapping periods of time. These people never meet, and generally don't know anything about each other. By properly managing the time-sharing of the rooms (i.e. by making sure different people don't get assigned to one room at the same time), a relatively small hotel can provide accommodations to a relatively large number of people, which is what hotels are for.

That's exactly what union does. If you know that several objects in your program hold values with non-overlapping value-lifetimes, then you can "merge" these objects into a union and thus save memory. Just like a hotel room has at most one "active" tenant at each moment of time, a union has at most one "active" member at each moment of program time. Only the "active" member can be read. By writing into other member you switch the "active" status to that other member.

For some reason, this original purpose of the union got "overridden" with something completely different: writing one member of a union and then inspecting it through another member. This kind of memory reinterpretation (aka "type punning") is not a valid use of unions. It generally leads to undefined behavior is described as producing implementation-defined behavior in C89/90.

EDIT: Using unions for the purposes of type punning (i.e. writing one member and then reading another) was given a more detailed definition in one of the Technical Corrigenda to the C99 standard (see DR#257 and DR#283). However, keep in mind that formally this does not protect you from running into undefined behavior by attempting to read a trap representation.

This dereferencing pointer in C works fine, but it looks wrong

The well defined method for type punning is using a memcpy between variables of the different types. I.e.

memcpy(&value, &aux, sizeof(float));

With optimization enabled and operating variables residing in automatic storage (i.e. regular function variables), this will translate into zero additional instructions, as the compiler will internally perform a single static assignment.

EDIT: Since C99 the other well defined method is using a union for type punning:

union {
uint8_t u8[sizeof(float)/sizeof(uint8_t)];
uint32_t u32[sizeof(float)/sizeof(uint32_t)];
float flt;
} float_type_punner;

build int32_t from 4 uint8_t values

typedef union
{
uint32_t u32;
int32_t i32;
float f;
uint16_t u16[2];
int16_t i16[2];
uint8_t u8[4];
int8_t i8[4];
char c[4];
} any32;

I keep that in my back pocket for all of my embedded system projects. Aside from needing to understand the endian-ness of your system, you can build the 32bit values rather easily from 8bit pieces. This is very useful if you are shuttling out bytes on a serial line or I2C or SPI. It's also useful if you are working with 8.24 (or 16.16 or 24.8) fixed point math. I generally supplement this with some #defines to help with any endian headaches:

//!\todo add 16-bit boundary endian-ness options
#if (__LITTLE_ENDIAN)
#define FP_824_INTEGER (3)
#define FP_824_FRAC_HI (2)
#define FP_824_FRAC_MID (1)
#define FP_824_FRAC_LOW (0)
#elif (__BIG_ENDIAN)
#define FP_824_INTEGER (0)
#define FP_824_FRAC_HI (1)
#define FP_824_FRAC_MID (2)
#define FP_824_FRAC_LOW (3)
#else
#error undefined endian implementation
#endif

Does this type aliasing using union invoke undefined behavior?

Common initial subsequence has a ridiculously specific definition. int and struct foo{int x;} do not have a common initial subsequence.

struct foo{int x;} and struct bar{int y;} do have a common initial subsequence.

Reading memory through an unrelated type is not the same as reading from a union alternative. That text doesn't do anything there.

You can do (std::unit8_t const*)&addr.value and treat it as a 4 byte array, assuming your platform has unit8_t. The byte values you get are implementation defined.

You cannot, under the standard, read from parts[i] however (when value exists).

Compilers are free to specify behaviour when the standard states it is undefined under the standard, except during a compile time constexpr evaluation.

Copying differently sized data from a union to a byte array

In short, the answer is no, its not a safe.

A simple change made to a compiler setting or environmental differences between platforms could result in the data being interpreted incorrectly by the receiving application.

As you know the the memory size of a union is equivalent to its largest member. In simplistic terms the union defined as:

union{
uint8_t byte;
uint16_t ushort;
uint8_t bytes[2];
}data;

Will take up a minimum of 2 bytes, but the order of byte packing and how the bytes are packed is determined by compiler settings such as whether multi-byte values are ordered by least or most significant byte first and also microprocessor architecture.

For example:

The two bytes for byte and ushort could be packed as follows:

byte      | byte|     |
byte | | byte|
ushort | MSB | LSB |
ushort | LSB | MSB |

As you can see the value for byte may be stored in the first or the second byte, similarly the data stored for ushort may appear to be reversed with the Most Significant Byte appearing first in one example and the Least Significant Byte appearing first in the other.

Each of the above examples may be determined by the compiler and its settings.

To make matter worse some microprocessors will rearrange bytes dependent on their architecture for example when looking at uint32_t.

Instead of uint32_t being stored as |byte0|byte1|byte2|byte3| it may be stored as |byte1|byte0|byte3|byte2|.

If your union changed to

union{
uint8_t byte;
uint16_t ushort;
sytuct{
uint8_t byte[3];
}multiByte;
};

The matters become even more complex as now you will most probably have data alignment issues. On a 16 bit processor the 3 bytes present within the multiByte structure will be placed upon a 16 bit boundary causing the union to take up 4 bytes and not 3 bytes of data by default.

So if you are reliant on compiler settings or architecture to ensure that your data packing is consistent then the project isn't supportable long term and it may not be portable without change.

Hence for safety it is best to be pedantic and process the data to ensure that it is in the correct order both for transmission and on reception.



Related Topics



Leave a reply



Submit