Difference Between Packed VS Normal Data Type

Difference between packed vs normal data type

This information is from here

float4 has an alignment of 16 bytes. This means that the memory address of such a type (e.g. 0x12345670) will be divisible by 16 (aka the last hexadecimal digit is 0).

packed_float4 on the other hand has an alignment of 4 bytes. Last digit of the address will be 0, 4, 8 or c

This does matter when you create custom structs. Say you want a struct with 2 normal floats and 1 float4/packed_float4:

struct A{
    float x, y;
    float4 z;
}

struct B{
    float x, y;
    packed_float4 z;
}

For A: The alignment of float4 has to be 16 and since float4 has to be after the normal floats, there is going to be 8 bytes of empty space between y and z. Here is what A looks like in memory:

 Address | 0x200 | 0x204 | 0x208 | 0x20c | 0x210 | 0x214 | 0x218 | 0x21c |
 Content |   x   |   y   |   -   |   -   |   z1  |   z2  |   z3  |   z4  |
                                             ^Has to be 16 byte aligned

For B: Alignment of packed_float4 is 4, the same as float, so it can follow right after the floats in any case:

 Address | 0x200 | 0x204 | 0x208 | 0x20c | 0x210 | 0x214 |
 Content |   x   |   y   |   z1  |   z2  |   z3  |   z4  |

As you can see, A takes up 32 bytes whereas B only uses 24 bytes. When you have an array of those structs, A will take up 8 more bytes for every element. So for passing around a lot of data, the latter is preferred.

The reason you need float4 at all is because the GPU can't handle 4 byte aligned packed_float4s, you won't be able to return packed_float4 in a shader. This is because of performance I assume.

One last thing: When you declare the Swift version of a struct:

struct S {
    let x, y: Float
    let z : (Float, Float, Float, Float)
}

This struct will be equal to B in Metal and not A. A tuple is like a packed_floatN.

All of this also applies to other vector types such as packed_float3, packed_short2, ect.

What is packed and unpacked and extended packed data

Well, I've just been searching for the answer to the same question, and also with no success. So I can only be guessing.

Intel introduced packed and scalar instructions already in their MMX technology. For example, they introduced a function

__m64 _mm_add_pi8 (__m64 a, __m64 b)

At that time there was no such a thing as "extended packed". The only data type was __m64 and all operations worked on integers.
With SSE there came 128-bit registers and operations on floating point numbers. However, SSE2 included a superset of MMX operations on integers performed in 128-bit registers. For example,

__m128i _mm_add_epi8 (__m128i a, __m128i b)

Here for the first time we see the "ep" (extended packed") part of the function name. Why it was introduced? I believe this was a solution to the problem of the name _mm_add_pi8 being already taken by the MMX instruction listed above. The interface of SSE/AVX is in the C language, where there's no polymorphism of function names.

With AVX, Intel chose a different strategy, and started to add the register length just after the opening "_mm" letters, c.f.:

__m256i _mm256_add_epi8 (__m256i a, __m256i b)
__m512i _mm512_add_epi8 (__m512i a, __m512i b)

Why they here chose "ep" and not "p" is a mystery, irrelevant for programmers. Actually, they seem to use "p" for operations on floats and doubles and "ep" for integers.

__m128d _mm_add_pd (__m128d a, __m128d b); // "d": function operates on doubles
__m256 _mm256_add_ps (__m256 a, __m256 b); // "s": function operates on floats

Perhaps this goes back to the transition from MMX to SSE, where "ep" was introduced for operations on integers (no floats were handled by MMX) and an attempt to make AVX mnemonics as close to the SSE ones as possible

Thus, basically, from the perspective of a programmer, there's no difference between "ep" ("extended packed") and "p" ("packed"), for we are already aware of the register length that we target in our code.

As for the next part of the question, "unpacking" belongs to a completely different category of notions than "scalar" and "packed". This is rather a colloquial term for a particular data rearrangement or shuffle, like rotation or shift.

The reason for using "epi" in the name of intrinsics like _mm256_unpackhi_epi16 is that it is a truly vector (not scalar) function on a vector of 16-bit integer elements. Notice that here "unpack" belongs to the part of the function name that describe its action (like mul, add, or permute), whereas "s" / "p" / "ep" (scalar, packed, extended packed) belong to the part describing the operation mode (scalar for "s", vector for "p" or "ep").

(There are no scalar-integer instructions that operate between two XMM registers, but "si" does appear in the intrinsic name for movd eax, xmm0: _mm_cvtsi128_si32. There are a few similar intrinsics.)

What is the difference between #pragma pack and attribute((aligned))

The #pragma pack(byte-alignment) effect each member of the struct as specified by the byte-alignment input, or on their natural alignment boundary, whichever is less.

The __attribute__((aligned(byte-alignment))) affect the minimum alignment of the variable (or struct field if specified within the struct)

I believe the following are equivalent

#define L1_CACHE_LINE 2

struct A
{
    u_int32_t   a   __attribute__ ( (aligned(L1_CACHE_LINE)) );
    u_int32_t   b   __attribute__ ( (aligned(L1_CACHE_LINE)) );
    u_int16_t   c   __attribute__ ( (aligned(L1_CACHE_LINE)) );       
    u_int16_t   d   __attribute__ ( (aligned(L1_CACHE_LINE)) );      
    u_int32_t   e   __attribute__ ( (aligned(L1_CACHE_LINE)) );     
};

#pragma pack(L1_CACHE_LINE)
struct A
{
    u_int32_t   a;  
    u_int32_t   b;  
    u_int16_t   c;  
    u_int16_t   d;  
    u_int32_t   e;  
};
#pragma pack()

where is A a __attritube__((aligned(L1_CACHE_LINE))) will insure u_int32_t a inside struct A will aligned with 2 byte but will not align the other variable in the same manner.

Reference:

http://publib.boulder.ibm.com/infocenter/macxhelp/v6v81/index.jsp?topic=%2Fcom.ibm.vacpp6m.doc%2Fcompiler%2Fref%2Frnpgpack.htm
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/attributes-variables.html

What is a packed structure in C?

When structures are defined, the compiler is allowed to add paddings (spaces without actual data) so that members fall in address boundaries that are easier to access for the CPU.

For example, on a 32-bit CPU, 32-bit members should start at addresses that are multiple of 4 bytes in order to be efficiently accessed (read and written). The following structure definition adds a 16-bit padding between both members, so that the second member falls in a proper address boundary:

struct S {
    int16_t member1;
    int32_t member2;
};

The structure in memory of the above structure in a 32-bit architecture is (~ = padding):

+---------+---------+
| m1 |~~~~|   m2    |
+---------+---------+

When a structure is packed, these paddings are not inserted. The compiler has to generate more code (which runs slower) to extract the non-aligned data members, and also to write to them.

The same structure, when packed, will appear in memory as something like:

+---------+---------+
| m1 |   m2    |~~~~
+---------+---------+

Differences in Python's packed binary data size among platforms?

Thanks @jasonharper:

You have to start your struct format string with one of the standard byte order/size/alignment indicators (usually < or >) in order to get any sort of cross-platform compatibility

Structure padding and packing

Padding aligns structure members to "natural" address boundaries - say, int members would have offsets, which are mod(4) == 0 on 32-bit platform. Padding is on by default. It inserts the following "gaps" into your first structure:

struct mystruct_A {
    char a;
    char gap_0[3]; /* inserted by compiler: for alignment of b */
    int b;
    char c;
    char gap_1[3]; /* -"-: for alignment of the whole struct in an array */
} x;

Packing, on the other hand prevents compiler from doing padding - this has to be explicitly requested - under GCC it's __attribute__((__packed__)), so the following:

struct __attribute__((__packed__)) mystruct_A {
    char a;
    int b;
    char c;
};

would produce structure of size 6 on a 32-bit architecture.

A note though - unaligned memory access is slower on architectures that allow it (like x86 and amd64), and is explicitly prohibited on strict alignment architectures like SPARC.

Is there any speed penalty to use packed vertex structure?

Based on this Metal Shading Language Specification

You cannot use the stage_in attribute to declare members of the structure that are packed
vectors, matrices, structures, bitfields, references or pointers to a type, or arrays of scalars,
vectors, or matrices.
MSL functions and arguments have these additional restrictions:
The return type of a vertex or fragment function cannot include an
element that is a packed vector type, matrix type, a structure type,
a reference, or a pointer to a type.
You can use an array index to access components of a packed vector data type. However, you
cannot use the .xyzw or .rgba selection syntax to access components of a packed vector data
type.

Is there any speed penalty to use packed vertex structure?

It's described very well in this answer, in short you would benefit from using it, in terms of speed, especially when passing around a lot of data.

To pack or not to pack a structure containing just an array

A. Is the packed attribute necessary and sufficient condition for
sizeof(Digest) to always return the correct size (= 512 bits or 64
bytes)?

It is sufficient.

B. Is digest->bits[i] a safe operation on all architectures while we
keep the packed attribute?

I think that you do not understand __attribute__((packed)). Below is what is does actually.

When packed is used in a structure declaration, it will compress its
fields such, such that, sizeof(structure) == sizeof(first_member) +
... + sizeof(last_member).

Here is the url to the resource of the above statment Effects of __attribute__((packed)) on nested array of structures?

EDIT:

Of course it is safe. Packing defines layout in memory but don't worry because accessing specific data type is handled by the compiler even if data is misaligned.

C. Can we simplify the representation while keeping the container
opaque?

Yes, you can just define a simple buffer uint32_t bits[LENGTH]; and it will work in this same manner for you.

D. Is there a run-time penalty to pay if we keep the representation?

Generally speaking yes. Packing enforces compiler to do not perform padding in data structure between members. Padding in data structure makes physical object larger however the access to the singular fields is faster, because it is just read operation do not require read, mask and rotation for instance.

Please check below this very simple program showing the effect of packing on struct size.

#include <stdio.h>
#include <stdint.h>

#pragma pack(push, 1) 
typedef struct _aaa_t {
  uint16_t a;
  uint8_t b;
  uint8_t c;
  uint8_t d;
} aaa_t;
#pragma pack(pop)

typedef struct _bbb_t {
  uint16_t a;
  uint8_t b;
  uint8_t c;
  uint8_t d;
} bbb_t;

int main(void) {
    aaa_t a;
    bbb_t b;
    printf("%d\n", sizeof(a));
    printf("%d\n", sizeof(b));
    printf("%p\n", &(a.a));
    printf("%p\n", &(a.b));
    printf("%p\n", &(a.c));
    printf("%p\n", &(a.d));
    printf("%p\n", &(b.a));
    printf("%p\n", &(b.b));
    printf("%p\n", &(b.c));
    printf("%p\n", &(b.d));
    return 0;
}

Program output:

5
6
0xbf9ea115
0xbf9ea117
0xbf9ea118
0xbf9ea119
0xbf9ea11a
0xbf9ea11c
0xbf9ea11d
0xbf9ea11e

Explanation:

Packed:
     ____________ _______ _______ _______ _______
    |            |       |       |       |       |
    | 0xbf9ea115 | msb_a | lsb_a | lsb_b | lsb_c |
    |____________|_______|_______|_______|_______|
    |            |       |
    | 0xbf9ea119 | lsb_d |
    |____________|_______|

Not Packed:
     ____________ _______ _______ _______ _______
    |            |       |       |       |       |
    | 0xbf9ea11a | msb_a | lsb_a | lsb_b | lsb_c |
    |____________|_______|_______|_______|_______|
    |            |       |       |
    | 0xbf9ea11e | lsb_c |  pad  |
    |____________|_______|_______|

Compiler does that in order to generate code which accesses data types faster than code without padding and memory alignment optimizations.

You can run my code under this link demo program

Difference Between Packed VS Normal Data Type