﻿ Difference Between Packed VS Normal Data Type - ITCodar

# Difference Between Packed VS Normal Data Type

## Difference between packed vs normal data type

This information is from here

`float4` has an alignment of `16` bytes. This means that the memory address of such a type (e.g. `0x12345670`) will be divisible by `16` (aka the last hexadecimal digit is `0`).

`packed_float4` on the other hand has an alignment of `4 bytes`. Last digit of the address will be `0`, `4`, `8` or `c`

This does matter when you create custom structs. Say you want a struct with 2 normal `float`s and 1 `float4`/`packed_float4`:

``struct A{    float x, y;    float4 z;}struct B{    float x, y;    packed_float4 z;}``

For `A`: The alignment of `float4` has to be `16` and since `float4` has to be after the normal `float`s, there is going to be `8` bytes of empty space between `y` and `z`. Here is what `A` looks like in memory:

`` Address | 0x200 | 0x204 | 0x208 | 0x20c | 0x210 | 0x214 | 0x218 | 0x21c | Content |   x   |   y   |   -   |   -   |   z1  |   z2  |   z3  |   z4  |                                             ^Has to be 16 byte aligned``

For `B`: Alignment of `packed_float4` is `4`, the same as `float`, so it can follow right after the `float`s in any case:

`` Address | 0x200 | 0x204 | 0x208 | 0x20c | 0x210 | 0x214 | Content |   x   |   y   |   z1  |   z2  |   z3  |   z4  |``

As you can see, `A` takes up `32` bytes whereas `B` only uses `24` bytes. When you have an array of those structs, `A` will take up `8` more bytes for every element. So for passing around a lot of data, the latter is preferred.

The reason you need `float4` at all is because the GPU can't handle `4` byte aligned `packed_float4`s, you won't be able to return `packed_float4` in a shader. This is because of performance I assume.

One last thing: When you declare the Swift version of a struct:

``struct S {    let x, y: Float    let z : (Float, Float, Float, Float)}``

This struct will be equal to `B` in Metal and not `A`. A tuple is like a `packed_floatN`.

All of this also applies to other vector types such as `packed_float3`, `packed_short2`, ect.

## What is packed and unpacked and extended packed data

Well, I've just been searching for the answer to the same question, and also with no success. So I can only be guessing.

Intel introduced packed and scalar instructions already in their MMX technology. For example, they introduced a function

``__m64 _mm_add_pi8 (__m64 a, __m64 b)``

At that time there was no such a thing as "extended packed". The only data type was `__m64` and all operations worked on integers.
With SSE there came 128-bit registers and operations on floating point numbers. However, SSE2 included a superset of MMX operations on integers performed in 128-bit registers. For example,

``__m128i _mm_add_epi8 (__m128i a, __m128i b)``

Here for the first time we see the "ep" (extended packed") part of the function name. Why it was introduced? I believe this was a solution to the problem of the name `_mm_add_pi8` being already taken by the MMX instruction listed above. The interface of SSE/AVX is in the C language, where there's no polymorphism of function names.

With AVX, Intel chose a different strategy, and started to add the register length just after the opening "_mm" letters, c.f.:

``__m256i _mm256_add_epi8 (__m256i a, __m256i b)__m512i _mm512_add_epi8 (__m512i a, __m512i b)``

Why they here chose "ep" and not "p" is a mystery, irrelevant for programmers. Actually, they seem to use "p" for operations on floats and doubles and "ep" for integers.

``__m128d _mm_add_pd (__m128d a, __m128d b); // "d": function operates on doubles__m256 _mm256_add_ps (__m256 a, __m256 b); // "s": function operates on floats``

Perhaps this goes back to the transition from MMX to SSE, where "ep" was introduced for operations on integers (no floats were handled by MMX) and an attempt to make AVX mnemonics as close to the SSE ones as possible

Thus, basically, from the perspective of a programmer, there's no difference between "ep" ("extended packed") and "p" ("packed"), for we are already aware of the register length that we target in our code.

As for the next part of the question, "unpacking" belongs to a completely different category of notions than "scalar" and "packed". This is rather a colloquial term for a particular data rearrangement or shuffle, like rotation or shift.

The reason for using "epi" in the name of intrinsics like `_mm256_unpackhi_epi16` is that it is a truly vector (not scalar) function on a vector of 16-bit integer elements. Notice that here "unpack" belongs to the part of the function name that describe its action (like mul, add, or permute), whereas "s" / "p" / "ep" (scalar, packed, extended packed) belong to the part describing the operation mode (scalar for "s", vector for "p" or "ep").

(There are no scalar-integer instructions that operate between two XMM registers, but "si" does appear in the intrinsic name for `movd eax, xmm0`: `_mm_cvtsi128_si32`. There are a few similar intrinsics.)

## What is the difference between #pragma pack and __attribute__((aligned))

The #pragma pack(byte-alignment) effect each member of the struct as specified by the byte-alignment input, or on their natural alignment boundary, whichever is less.

The `__attribute__((aligned(byte-alignment)))` affect the minimum alignment of the variable (or struct field if specified within the struct)

I believe the following are equivalent

``#define L1_CACHE_LINE 2struct A{    u_int32_t   a   __attribute__ ( (aligned(L1_CACHE_LINE)) );    u_int32_t   b   __attribute__ ( (aligned(L1_CACHE_LINE)) );    u_int16_t   c   __attribute__ ( (aligned(L1_CACHE_LINE)) );           u_int16_t   d   __attribute__ ( (aligned(L1_CACHE_LINE)) );          u_int32_t   e   __attribute__ ( (aligned(L1_CACHE_LINE)) );     };#pragma pack(L1_CACHE_LINE)struct A{    u_int32_t   a;      u_int32_t   b;      u_int16_t   c;      u_int16_t   d;      u_int32_t   e;  };#pragma pack()``

where is A `a __attritube__((aligned(L1_CACHE_LINE)))` will insure `u_int32_t a` inside `struct A` will aligned with 2 byte but will not align the other variable in the same manner.

Reference:

1. http://publib.boulder.ibm.com/infocenter/macxhelp/v6v81/index.jsp?topic=%2Fcom.ibm.vacpp6m.doc%2Fcompiler%2Fref%2Frnpgpack.htm
2. http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/attributes-variables.html

## What is a packed structure in C?

When structures are defined, the compiler is allowed to add paddings (spaces without actual data) so that members fall in address boundaries that are easier to access for the CPU.

For example, on a 32-bit CPU, 32-bit members should start at addresses that are multiple of 4 bytes in order to be efficiently accessed (read and written). The following structure definition adds a 16-bit padding between both members, so that the second member falls in a proper address boundary:

``struct S {    int16_t member1;    int32_t member2;};``

The structure in memory of the above structure in a 32-bit architecture is (~ = padding):

``+---------+---------+| m1 |~~~~|   m2    |+---------+---------+``

When a structure is packed, these paddings are not inserted. The compiler has to generate more code (which runs slower) to extract the non-aligned data members, and also to write to them.

The same structure, when packed, will appear in memory as something like:

``+---------+---------+| m1 |   m2    |~~~~+---------+---------+``

## Differences in Python's packed binary data size among platforms?

Thanks @jasonharper:

You have to start your `struct` format string with one of the standard byte order/size/alignment indicators (usually `<` or `>`) in order to get any sort of cross-platform compatibility

## Structure padding and packing

Padding aligns structure members to "natural" address boundaries - say, `int` members would have offsets, which are `mod(4) == 0` on 32-bit platform. Padding is on by default. It inserts the following "gaps" into your first structure:

``struct mystruct_A {    char a;    char gap_0[3]; /* inserted by compiler: for alignment of b */    int b;    char c;    char gap_1[3]; /* -"-: for alignment of the whole struct in an array */} x;``

Packing, on the other hand prevents compiler from doing padding - this has to be explicitly requested - under GCC it's `__attribute__((__packed__))`, so the following:

``struct __attribute__((__packed__)) mystruct_A {    char a;    int b;    char c;};``

would produce structure of size `6` on a 32-bit architecture.

A note though - unaligned memory access is slower on architectures that allow it (like x86 and amd64), and is explicitly prohibited on strict alignment architectures like SPARC.

## Is there any speed penalty to use **packed** vertex structure?

Based on this Metal Shading Language Specification

• You cannot use the stage_in attribute to declare members of the structure that are packed
vectors, matrices, structures, bitfields, references or pointers to a type, or arrays of scalars,
vectors, or matrices.

• MSL functions and arguments have these additional restrictions:
The return type of a vertex or fragment function cannot include an
element that is a packed vector type, matrix type, a structure type,
a reference, or a pointer to a type.

• You can use an array index to access components of a packed vector data type. However, you
cannot use the .xyzw or .rgba selection syntax to access components of a packed vector data
type.

Is there any speed penalty to use packed vertex structure?

It's described very well in this answer, in short you would benefit from using it, in terms of speed, especially when passing around a lot of data.

## To pack or not to pack a structure containing just an array

A. Is the packed attribute necessary and sufficient condition for
sizeof(Digest) to always return the correct size (= 512 bits or 64
bytes)?

It is sufficient.

B. Is digest->bits[i] a safe operation on all architectures while we
keep the packed attribute?

I think that you do not understand `__attribute__((packed))`. Below is what is does actually.

When packed is used in a structure declaration, it will compress its
fields such, such that, sizeof(structure) == sizeof(first_member) +
... + sizeof(last_member).

Here is the url to the resource of the above statment Effects of __attribute__((packed)) on nested array of structures?

EDIT:

Of course it is safe. Packing defines layout in memory but don't worry because accessing specific data type is handled by the compiler even if data is misaligned.

C. Can we simplify the representation while keeping the container
opaque?

Yes, you can just define a simple buffer `uint32_t bits[LENGTH];` and it will work in this same manner for you.

D. Is there a run-time penalty to pay if we keep the representation?

Generally speaking yes. Packing enforces compiler to do not perform padding in data structure between members. Padding in data structure makes physical object larger however the access to the singular fields is faster, because it is just read operation do not require read, mask and rotation for instance.

Please check below this very simple program showing the effect of packing on struct size.

``#include <stdio.h>#include <stdint.h>#pragma pack(push, 1) typedef struct _aaa_t {  uint16_t a;  uint8_t b;  uint8_t c;  uint8_t d;} aaa_t;#pragma pack(pop)typedef struct _bbb_t {  uint16_t a;  uint8_t b;  uint8_t c;  uint8_t d;} bbb_t;int main(void) {    aaa_t a;    bbb_t b;    printf("%d\n", sizeof(a));    printf("%d\n", sizeof(b));    printf("%p\n", &(a.a));    printf("%p\n", &(a.b));    printf("%p\n", &(a.c));    printf("%p\n", &(a.d));    printf("%p\n", &(b.a));    printf("%p\n", &(b.b));    printf("%p\n", &(b.c));    printf("%p\n", &(b.d));    return 0;}``

Program output:

``560xbf9ea1150xbf9ea1170xbf9ea1180xbf9ea1190xbf9ea11a0xbf9ea11c0xbf9ea11d0xbf9ea11e``

Explanation:

``Packed:     ____________ _______ _______ _______ _______    |            |       |       |       |       |    | 0xbf9ea115 | msb_a | lsb_a | lsb_b | lsb_c |    |____________|_______|_______|_______|_______|    |            |       |    | 0xbf9ea119 | lsb_d |    |____________|_______|Not Packed:     ____________ _______ _______ _______ _______    |            |       |       |       |       |    | 0xbf9ea11a | msb_a | lsb_a | lsb_b | lsb_c |    |____________|_______|_______|_______|_______|    |            |       |       |    | 0xbf9ea11e | lsb_c |  pad  |    |____________|_______|_______|``

Compiler does that in order to generate code which accesses data types faster than code without padding and memory alignment optimizations.

You can run my code under this link demo program

Submit