Does the 'Offsetof' MACro from <Stddef.H> Invoke Undefined Behaviour

Does the 'offsetof' macro from stddef.h invoke undefined behaviour?

Where the language standard says "undefined behavior", any given compiler can define the behavior. Implementation code in the standard library typically relies on that. So there are two questions:

(1) Is the code UB with respect to the C++ standard?

That's a really hard question, because it's a well known almost-defect that the C++98/03 standard never says right out in normative text that in general it's UB to dereference a nullpointer. It is implied by the exception for typeid, where it's not UB.

What you can say decidedly is that it's UB to use offsetof with a non-POD type.

(2) Is the code UB with respect to the compiler that it's written for?

No, of course not.

A compiler vendor's code for a given compiler can use any feature of that compiler.

Cheers & hth.,

Portability of using stddef.h's offsetof rather than rolling your own

To answer #2: yes, gcc-4* (I'm currently looking at v4.3.4, released 4 Aug 2009, but it should hold true for all gcc-4 releases to date). The following definition is used in their stddef.h:

#define offsetof(TYPE, MEMBER) __builtin_offsetof (TYPE, MEMBER)

where __builtin_offsetof is a compiler builtin like sizeof (that is, it's not implemented as a macro or run-time function). Compiling the code:

#include <stddef.h>

struct testcase {
char array[256];
};

int main (void) {
char buffer[offsetof(struct testcase, array[0])];
return 0;
}

would result in an error using the expansion of the macro that you provided ("size of array ‘buffer’ is not an integral constant-expression") but would work when using the macro provided in stddef.h. Builds using gcc-3 used a macro similar to yours. I suppose that the gcc developers had many of the same concerns regarding undefined behavior, etc that have been expressed here, and created the compiler builtin as a safer alternative to attempting to generate the equivalent operation in C code.

Additional information:

  • A mailing list thread from the Linux kernel developer's list
  • GCC's documentation on offsetof
  • A sort-of-related question on this site

Regarding your other questions: I think R's answer and his subsequent comments do a good job of outlining the relevant sections of the standard as far as question #1 is concerned. As for your third question, I have not heard of a modern C compiler that does not have stddef.h. I certainly wouldn't consider any compiler lacking such a basic standard header as "production". Likewise, if their offsetof implementation didn't work, then the compiler still has work to do before it could be considered "production", just like if other things in stddef.h (like NULL) didn't work. A C compiler released prior to C's standardization might not have these things, but the ANSI C standard is over 20 years old so it's extremely unlikely that you'll encounter one of these.

The whole premise to this problems begs a question: If these people are convinced that they can't trust the version of offsetof that the compiler provides, then what can they trust? Do they trust that NULL is defined correctly? Do they trust that long int is no smaller than a regular int? Do they trust that memcpy works like it's supposed to? Do they roll their own versions of the rest of the C standard library functionality? One of the big reasons for having language standards is so that you can trust the compiler to do these things correctly. It seems silly to trust the compiler for everything else except offsetof.

Update: (in response to your comments)

I think my co-workers behave like yours do :-) Some of our older code still has custom macros defining NULL, VOID, and other things like that since "different compilers may implement them differently" (sigh). Some of this code was written back before C was standardized, and many older developers are still in that mindset even though the C standard clearly says otherwise.

Here's one thing you can do to both prove them wrong and make everyone happy at the same time:

#include <stddef.h>

#ifndef offsetof
#define offsetof(tp, member) (((char*) &((tp*)0)->member) - (char*)0)
#endif

In reality, they'll be using the version provided in stddef.h. The custom version will always be there, however, in case you run into a hypothetical compiler that doesn't define it.

Based on similar conversations that I've had over the years, I think the belief that offsetof isn't part of standard C comes from two places. First, it's a rarely used feature. Developers don't see it very often, so they forget that it even exists. Second, offsetof is not mentioned at all in Kernighan and Ritchie's seminal book "The C Programming Language" (even the most recent edition). The first edition of the book was the unofficial standard before C was standardized, and I often hear people mistakenly referring to that book as THE standard for the language. It's much easier to read than the official standard, so I don't know if I blame them for making it their first point of reference. Regardless of what they believe, however, the standard is clear that offsetof is part of ANSI C (see R's answer for a link).


Here's another way of looking at question #1. The ANSI C standard gives the following definition in section 4.1.5:

     offsetof( type,  member-designator)

which expands to an integral constant expression that has type size_t,
the value of which is the offset in bytes, to the structure member
(designated by member-designator ), from the beginning of its
structure (designated by type ).

Using the offsetof macro does not invoke undefined behavior. In fact, the behavior is all that the standard actually defines. It's up to the compiler writer to define the offsetof macro such that its behavior follows the standard. Whether it's implemented using a macro, a compiler builtin, or something else, ensuring that it behaves as expected requires the implementor to deeply understand the inner workings of the compiler and how it will interpret the code. The compiler may implement it using a macro like the idiomatic version you provided, but only because they know how the compiler will handle the non-standard code.

On the other hand, the macro expansion you provided indeed invokes undefined behavior. Since you don't know enough about the compiler to predict how it will process the code, you can't guarantee that particular implementation of offsetof will always work. Many people define their own version like that and don't run into problems, but that doesn't mean that the code is correct. Even if that's the way that a particular compiler happens to define offsetof, writing that code yourself invokes UB while using the provided offsetof macro does not.

Rolling your own macro for offsetof can't be done without invoking undefined behavior (ANSI C section A.6.2 "Undefined behavior", 27th bullet point). Using stddef.h's version of offsetof will always produce the behavior defined in the standard (assuming a standards-compliant compiler). I would advise against defining a custom version since it can cause portability problems, but if others can't be persuaded then the #ifndef offsetof snippet provided above may be an acceptable compromise.

How does the C offsetof macro work?

It has no advantages and should not be used, since it invokes undefined behavior (and uses the wrong type - int instead of size_t).

The C standard defines an offsetof macro in stddef.h which actually works, for cases where you need the offset of an element in a structure, such as:

#include <stddef.h>

struct foo {
int a;
int b;
char *c;
};

struct struct_desc {
const char *name;
int type;
size_t off;
};

static const struct struct_desc foo_desc[] = {
{ "a", INT, offsetof(struct foo, a) },
{ "b", INT, offsetof(struct foo, b) },
{ "c", CHARPTR, offsetof(struct foo, c) },
};

which would let you programmatically fill the fields of a struct foo by name, e.g. when reading a JSON file.

Why does this implementation of offsetof() work?

At no point in the above code is anything dereferenced. A dereference occurs when the * or -> is used on an address value to find referenced value. The only use of * above is in a type declaration for the purpose of casting.

The -> operator is used above but it's not used to access the value. Instead it's used to grab the address of the value. Here is a non-macro code sample that should make it a bit clearer

SomeType *pSomeType = GetTheValue();
int* pMember = &(pSomeType->SomeIntMember);

The second line does not actually cause a dereference (implementation dependent). It simply returns the address of SomeIntMember within the pSomeType value.

What you see is a lot of casting between arbitrary types and char pointers. The reason for char is that it's one of the only type (perhaps the only) type in the C89 standard which has an explicit size. The size is 1. By ensuring the size is one, the above code can do the evil magic of calculating the true offset of the value.

Is accessing members through offsetof well defined?

As far as I can tell, it is well-defined behavior. But only because you access the data through a char type. If you had used some other pointer type to access the struct, it would have been a "strict aliasing violation".

Strictly speaking, it is not well-defined to access an array out-of-bounds, but it is well-defined to use a character type pointer to grab any byte out of a struct. By using offsetof you guarantee that this byte is not a padding byte (which could have meant that you would get an indeterminate value).

Note however, that casting away the const qualifier does result in poorly-defined behavior.

EDIT

Similarly, the cast (char**)ptr is an invalid pointer conversion - this alone is undefined behavior as it violates strict aliasing. The variable ptr itself was declared as a char*, so you can't lie to the compiler and say "hey, this is actually a char**", because it is not. This is regardless of what ptr points at.

I believe that the correct code with no poorly-defined behavior would be this:

#include <stddef.h>
#include <stdio.h>
#include <string.h>

typedef struct {
const char* a;
const char* b;
} A;

int main() {
A test[3] = {
{.a = "Hello", .b = "there."},
{.a = "How are", .b = "you?"},
{.a = "I\'m", .b = "fine."}};

for (size_t i = 0; i < 3; ++i) {
const char* ptr = (const char*) &test[i];
ptr += offsetof(A, b);

/* Extract the const char* from the address that ptr points at,
and store it inside ptr itself: */
memmove(&ptr, ptr, sizeof(const char*));
printf("%s\n", ptr);
}
}

Using offsetof to access struct member

Yes, this is perfectly well defined, and is exactly how offsetof is intended to be used. You do the pointer arithmetic on a pointer to character type, so that it is done in bytes, and then cast back to the actual type of the member.

You can see for instance 6.3.2.3 p7 (all references are to C17 draft N2176):

When a pointer to
an object is converted to a pointer to a character type, the result points to the lowest addressed byte
of the object. Successive increments of the result, up to the size of the object, yield pointers to the
remaining bytes of the object.

So (char *)&x is a pointer to x converted to a pointer to char, therefore it points to the lowest addressed byte of x. When we add offsetof(struct X, b) (say it's 4) then we have a pointer to byte 4 of x. Now offsetof(struct X, b) is defined to return

the
offset in bytes, to the structure member, from the beginning of its
structure [7.19p3]

so 4 is in fact the offset from the beginning of x to x.b. Hence byte 4 of x is the lowest byte of x.b, and that's what ptr points to; in other words, ptr is a pointer to x.b, but of type char *. When we cast it back to int *, we have a pointer to x.b which is of the type int *, exactly the same as we would get from the expression &x.b. So dereferencing this pointer accesses x.b.


A question arose in the comments about this last step: when ptr is cast back to int *, how do we know we indeed have a pointer to the int x.b? This is a bit less explicit in the standard but I think it is the obvious intent.

However, I think we can also derive it indirectly. Hopefully we agree that ptr above is a pointer to the lowest addressed byte of x.b. Now by the same passage of 6.3.2.3 p7 quoted above, taking a pointer to x.b and converting it to char *, as in (char *)&x.b, would also yield a pointer to the lowest addressed byte of x.b. As they are pointers of the same type which point to the same byte, they are the same pointer: ptr == (char *)&x.b.

Then we look at the preceding sentences of 6.3.2.3 p7:

A pointer to an object type may be converted to a pointer to a different object type. If the resulting
pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise,
when converted back again, the result shall compare equal to the original pointer.

There are no problems with alignment here, because char has the weakest alignment requirement (6.2.8 p6). So converting (char *)&x.b back to int * must recover a pointer to x.b, i.e. (int *)(char *)&x.b == &x.b.

But ptr is the same pointer as (char *)&x.b, so we may substitute them in this equality: (int *)ptr == &x.b.

Obviously *&x.b produces an lvalue designating x.b (6.5.3.2 p4), hence so does *(int *)ptr.


There is no problem with strict aliasing (6.5p7). First, determine the effective type of x.b using 6.5p6:

The effective type of an object for an access to its stored value is the declared type of the object, if
any. [Then explanations on what to do if it doesn't have a declared type.]

Well, x.b does have a declared type, which is int. So its effective type is int.

Now to see if the access is legal under strict aliasing, see 6.5p7:

An object shall have its stored value accessed only by an lvalue expression that has one of the
following types:

— a type compatible with the effective type of the object,

[more options not relevant here]

We are accessing x.b through the lvalue expression *(int *)ptr, which has type int. And int is compatible with int per 6.2.7p1:

Two types have compatible type if their types are the same. [Then other conditions under which they may also be compatible].


An example of this same technique that maybe is more familiar is indexing into an array by bytes. If we have

int arr[100];
*(int *)((char *)arr + (17 * sizeof(int))) = 42;

then this is equivalent to arr[17] = 42;.

This is how generic routines like qsort and bsearch are implemented. If we try to qsort an array of int, then within qsort all the pointer arithmetic is done in bytes, on pointers to character type with the offsets manually scaled by the object size passed as an argument (which here would be sizeof(int)). When qsort needs to compare two objects, it casts them to const void * and passes them as arguments to the comparator function, which casts them back to const int * to do the comparison.

This all works fine and is clearly an intended feature of the language. So I think we needn't doubt that the use of offsetof in the current question is similarly an intended feature.

Is this valid ANSI C++ code? Trying to generate offsets of structure members at compile-time

You're not dereferencing anything invalid here. All that macro does is tell the compiler that a structure of type p_type exists in memory at the address NULL. It then takes the address of p_member, which is a member of this fictitious structure. So, no dereferencing anywhere.

In fact, this is exactly what the offsetof macro, defined in stddef.h does.

EDIT:

As some of the comments say, this may not work well with C++ and inheritance, I've only used offsetof with POD structures in C.

GCC 4.4.3 offsetof constant expression bug. How should I work around this?

Standards

In the C++98 standard, there's some information in

C.2.4.1 Macro offsetof(type, memberdesignator) [diff.offsetof]

The macro offsetof, defined in <cstddef>, accepts a restricted set of type arguments in this International
Standard. §18.1 describes the change.

(C.2.4.1 showed up with offsetof in the contents, so I went there first.) And:

§18.1 Types 18 Language support library

¶5 The macro offsetof accepts a restricted set of type arguments in this International Standard. type
shall be a POD structure or a POD union (clause 9). The result of applying the offsetof macro to a field that
is a static data member or a function member is undefined.

For comparison, the C99 standard says:

 offsetof(type, member-designator)

which expands to an integer constant expression that has type size_t, the value of
which is the offset in bytes, to the structure member (designated by member-designator),
from the beginning of its structure (designated by type). The type and member designator
shall be such that given

static type t;

then the expression &(t.member-designator) evaluates to an address constant. (If the
specified member is a bit-field, the behavior is undefined.)


Your code

Your code meets the requirements of both the C++ and C standards, it seems to me.

When I use G++ 4.1.2 and GCC 4.5.1 on RedHat (RHEL 5), this code compiles without complaint with the -Wall -Wextra options:

#include <cstddef>

struct SomeType {
int m_member;
};

static const int memberOffset = offsetof(SomeType, m_member);

It also compiles without complaint with #include <stddef.h> and with the GCC compilers (if I use struct SomeType in the macro invocation).

I wonder - I got errors until I included <cstddef>...did you include that? I also added the type int to the declaration, of course.

Assuming that you haven't made any bloopers in your code, it seems to me that you probably have found a bug in the <cstddef> (or <stddef.h>) header on your platform. You should not be getting the error, and the Linux-based G++ appears to confirm that.

Workarounds?

You will need to review how offsetof() is defined in your system headers. You will then probably redefine it in such a way as not to run into the problem.

You might be able to use something like this, assuming you identify your broken system somehow and execute #define BROKEN_OFFSETOF_MACRO (or add -DBROKEN_OFFSETOF_MACRO to the command line).

#include <cstddef>

#ifdef BROKEN_OFFSETOF_MACRO
#undef offsetof
#define offsetof(type, member) ((size_t)((char *)&(*(type *)0).member - \
(char *)&(*(type *)0)))
#endif /* BROKEN_OFFSETOF_MACRO */

struct SomeType {
int m_member;
};

static const int memberOffset = offsetof(SomeType, m_member);

The size_t cast is present since the difference between two addresses is a ptrdiff_t and the offset() macro is defined to return size_t. The macro is nothing other than ugly, but that's why it is normally hidden in a system header where you don't have to look at it in all its ghastliness. But when all else fails, you must do whatever is necessary.

I know that once, circa 1990, I encountered a C compiler that would not allow 0 but it would allow 1024 instead. The distributed <stddef.h> header, of course, used 0, so I 'fixed' it by changing the 0 to 1024 (twice) for the duration (until I got a better compiler on a better machine).



Related Topics



Leave a reply



Submit