Casting Double Array to a Struct of Doubles

Is there a way to convert a double array to a struct array?

Use struct and num2cell:

data = [1,2;3,4];
S = struct ('data', num2cell(data));

Can a struct of doubles be typecast to an array of doubles in C?

This leads to undefined behaviour. The layout of the struct is not totally prescribed by the standard. For instance, there may be padding.

Is it legal to cast array of wrapper structs containing POD to the array of POD type it contains?

Now, this code is not valid. There are several reasons for this. First, casting a pointer to the first member of the struct to the struct itself violates strict aliasing rule. This you can fix by making Wrapper a child class of the Data.

The second issue is more problematic, as you are trying to treat an array (vector in this case) polymorphically. sizeof Data is different from the sizeof Wrapper, so an attempt to index an array of Wrapper elements as if it was an array of Data elements will end up pointing into random areas of the array.

Fast interleave 2 double arrays into an array of structs with 2 float and 1 int (loop invariant) member, with SIMD double-float conversion?

Loosely inspired by Intel's 4x3 transposition example and based on @PeterCordes solution, here is an AVX1 solution, which should get a throughput of 8 structs within 8 cycles (bottleneck is still p5):

#include <immintrin.h>
#include <stddef.h>

struct f2u { 
  float O1, O2;
  unsigned int Offset;
};
static const unsigned uiDefaultOffset = 123;

void cvt_interleave_avx(f2u *__restrict dst, double *__restrict pA, double *__restrict pB, ptrdiff_t len)
{
    __m256 voffset = _mm256_castsi256_ps(_mm256_set1_epi32(uiDefaultOffset));

    // 8 structs per iteration
    ptrdiff_t i=0;
    for(; i<len-7; i+=8)
    {
        // destination address for next 8 structs as float*:
        float* dst_f = reinterpret_cast<float*>(dst + i);

        // 4*vcvtpd2ps    --->  4*(p1,p5,p23)
        __m128 inA3210 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pA[i]));
        __m128 inB3210 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pB[i]));
        __m128 inA7654 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pA[i+4]));
        __m128 inB7654 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pB[i+4]));

        // 2*vinsertf128  --->  2*p5
        __m256 A76543210 = _mm256_set_m128(inA7654,inA3210);
        __m256 B76543210 = _mm256_set_m128(inB7654,inB3210);

        // 2*vpermilps    --->  2*p5
        __m256 A56741230 = _mm256_shuffle_ps(A76543210,A76543210,_MM_SHUFFLE(1,2,3,0));
        __m256 B67452301 = _mm256_shuffle_ps(B76543210,B76543210,_MM_SHUFFLE(2,3,0,1));

        // 6*vblendps     ---> 6*p015 (does not need to use p5)
        __m256 outA1__B0A0 = _mm256_blend_ps(A56741230,B67452301,2+16*2);
        __m256 outA1ccB0A0 = _mm256_blend_ps(outA1__B0A0,voffset,4+16*4);

        __m256 outB2A2__B1 = _mm256_blend_ps(B67452301,A56741230,4+16*4);
        __m256 outB2A2ccB1 = _mm256_blend_ps(outB2A2__B1,voffset,2+16*2);

        __m256 outccB3__cc = _mm256_blend_ps(voffset,B67452301,4+16*4);
        __m256 outccB3A3cc = _mm256_blend_ps(outccB3__cc,A56741230,2+16*2);

        // 3* vmovups     ---> 3*(p237,p4)
        _mm_storeu_ps(dst_f+ 0,_mm256_castps256_ps128(outA1ccB0A0));
        _mm_storeu_ps(dst_f+ 4,_mm256_castps256_ps128(outB2A2ccB1));
        _mm_storeu_ps(dst_f+ 8,_mm256_castps256_ps128(outccB3A3cc));
        // 3*vextractf128 ---> 3*(p23,p4)
        _mm_storeu_ps(dst_f+12,_mm256_extractf128_ps(outA1ccB0A0,1));
        _mm_storeu_ps(dst_f+16,_mm256_extractf128_ps(outB2A2ccB1,1));
        _mm_storeu_ps(dst_f+20,_mm256_extractf128_ps(outccB3A3cc,1));
    }

    // scalar cleanup for  if _iNum is not even
    for (; i < len; i++)
    {
        dst[i].O1 = static_cast<float>(pA[i]);
        dst[i].O2 = static_cast<float>(pB[i]);
        dst[i].Offset = uiDefaultOffset;
    }
}

Godbolt link, with minimal test-code at the end: https://godbolt.org/z/0kTO2b

For some reason, gcc does not like to generate vcvtpd2ps which directly convert from memory to a register. This ~~might~~ works better with aligned loads (having input and output aligned is likely beneficial anyway). And clang apparently wants to outsmart me with one of the vextractf128 instructions at the end.

Casting a managed array to an array of structs without copying

There is actually a cheat, but it is an ugly unsafe totally unsafe cheat:

[StructLayout(LayoutKind.Sequential)]
//[StructLayout(LayoutKind.Sequential, Pack = 4)]
public struct DataStructure
{
    public int Id;
    public double Value;
}

[StructLayout(LayoutKind.Explicit)]
public struct DataStructureConverter
{
    [FieldOffset(0)]
    public int[] IntArray;

    [FieldOffset(0)]
    public DataStructure[] DataStructureArray;
}

and then you can convert it without problems:

var myarray = new int[8];
myarray[0] = 1;
myarray[3] = 2;
//myarray[4] = 2;

DataStructure[] ds = new DataStructureConverter { IntArray = myarray }.DataStructureArray;

int i1 = ds[0].Id;
int i2 = ds[1].Id;

Note that depending on the size of DataStructure (if it is 16 bytes or 12 bytes), you have to use Pack = 4 (if it is 12 bytes) or you don't need anything (see explanation (1) later)

I'll add that this technique is undocumented and totally unsafe. It even has a problem: ds.Length isn't the length of the DataStructure[] but is the length of the int[] (so in the example given it is 8, not 2)

The "technique" is the same I described here and originally described here.

explanation (1)

The sizeof(double) is 8 bytes, so Value is normally aligned on the 8 bytes boundary, so normally there is a "gap" between Id (that has sizeof(int) == 4) and Value of 4 bytes. So normally sizeof(DataStructure) == 16. Depending on how the DataStructure is built, there could not be this gap, so the Pack = 4 that forces alignment on the 4 byte boundary.