Allocating Shared Memory

allocating shared memory

const doesn't mean "constant", it means "read-only".

A constant expression is something whose value is known to the compiler at compile-time.

Allocating a user defined struct in shared memory with boost::interprocess

Just don't do manual allocation. If you want a contiguous allocation of aSize elements of type char, that's what C++ has std::vector for.

Best of all, std::vector already knows how to use another allocator, so there is really no reason not to use it:

template <typename Alloc>
struct MyType {
    explicit MyType(size_t aSize, Alloc alloc = {}) : mData(aSize, alloc) {}

  private:
    std::vector<char, Alloc> mData;
};

Now to play well with standard library construct/scoped allocators, you might want to define the allocator_type nested type:

    using allocator_type = Alloc; // typename Alloc::template rebind<char>::other;

That's all. Just use it as any standard library type that has an allocator:

int main() {
    using namespace Shared;

    Shared::remove("MySharedMem");
    auto memory = Segment(create_only, "MySharedMem", 65536);

    using A = Alloc<char>;
    A alloc(memory.get_segment_manager());

    auto* data = memory.find_or_construct<MyType<A>>("data")(1024, memory.get_segment_manager());

    return data? 0 : 255;
}

I created a few convenience typedefs in the Shared namespace, for maintainability. Here's the full sample

Full Sample

Live On Coliru ¹

#include <boost/interprocess/managed_shared_memory.hpp>
#include <vector>

template <typename Alloc>
struct MyType {
    using allocator_type = typename Alloc::template rebind<char>::other;

    explicit MyType(size_t aSize, Alloc alloc = {}) : mData(aSize, alloc) {}

  private:
    std::vector<char, Alloc> mData;
};

namespace Shared {
    namespace bip = boost::interprocess;

    using Segment = bip::managed_shared_memory;
    using Manager = Segment::segment_manager;
    template <typename T> 
        using Alloc = bip::allocator<T, Manager>;

    void remove(char const* name) { bip::shared_memory_object::remove(name); }

    using bip::create_only;
}

int main() {
    using namespace Shared;

    Shared::remove("MySharedMem");
    auto memory = Segment(create_only, "MySharedMem", 65536);

    using A = Alloc<char>;
    A alloc(memory.get_segment_manager());

    auto* data = memory.find_or_construct<MyType<A>>("data")(1024, memory.get_segment_manager());

    return data? 0 : 255;
}

¹ For Coliru uses managed mapped file because shared memory is not supported there

Using both dynamically-allocated and statically-allocated shared memory

The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:

  ptxas info    : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]

Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).

You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.

Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.

Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.

__device__ void dev1()
{
    __shared__ int a[32] ;
    a[threadIdx.x] = threadIdx.x ;

    if (threadIdx.x == 0)
        printf ("dev1 : %x\n", a) ;
}

__device__ void dev2()
{
    __shared__ int a[32] ;
    a[threadIdx.x] = threadIdx.x * 5 ;

    if (threadIdx.x == 0)
        printf ("dev2 : %x\n", a) ;
}

__global__ void kernel(int* res, int* res2)
{
    __shared__ int a[32] ;
    extern __shared__ int b[];

    a[threadIdx.x] = 0 ;
    b[threadIdx.x] = threadIdx.x * 3 ;

    dev1();
    __syncthreads();
    dev2();
    __syncthreads();

    res[threadIdx.x] = a[threadIdx.x] ;
    res2[threadIdx.x] = b[threadIdx.x] ;

    if (threadIdx.x == 0)
        printf ("global a : %x\n", a) ;
    if (threadIdx.x == 0)
        printf ("global b : %x\n", b) ;
}

int main()
{
    int* dres  ;
    int* dres2 ;

    cudaMalloc <> (&dres, 32*sizeof(int)) ;
    cudaMalloc <> (&dres2, 32*sizeof(int)) ;

    kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);

    int hres[32] ;
    int hres2[32] ;

    cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
    cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;

    for (int k = 0 ; k < 32 ; ++k)
    {
        printf ("%d -- %d \n", hres[k], hres2[k]) ;
    }
    return 0 ;
}

This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.

When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.

EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory,

.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];

except for the entry-point where it is defined in the body of the method

// _ZZ6kernelPiS_E1a has been demoted

The shared space of the memory is defined in the PTX documentation here:

The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.

Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.

During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF

1             IADD32I R1, R1, -0x8;
2             S2R R0, SR_LMEMHIOFF;
3             ISETP.GE.U32.AND P0, PT, R1, R0, PT;

and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).

I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).

Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.

dynamic allocation of shared memory for array of data and pointers to data in CUDA

You can use 2 different pointers, with different types, to point to the same block of shared memory:

extern __shared__ char dataShared[];
double ** columns = (double **) dataShared; //Here you can store pointers to columns
double * realData = (double *) (dataShared + N * sizeof(double *)); //N is the number of columns

Here you use one block of shared memory, but use different offsets for 2 regions of it (offset 0 for the pointers to columns and offset N * sizeof(double *) for your actual data).

Allocating Shared Memory