Cudamalloc of a Structure and an Element of Same Structure

cudaMalloc of a structure and an element of same structure

I would suggest that you put some effort into compiling and running your codes with proper cuda error checking. Learning to interpret the compiler output and runtime output will make you a better, smarter, more efficient coder. I also suggest reviewing the writeup I previously pointed you at here. It deals with this exact topic, and includes linked worked examples. This question is a duplicate of that one.

There are various errors:

StructA *d_A = (StructA*)malloc(numS * sizeof(StructA));

The above line of code creates an allocation in host memory for a structure of size StructA, and sets the pointer d_A pointing to the start of that allocation. Nothing wrong at the moment.

cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );

The above line of code creates an allocation in device memory of the size of StructA, and sets the pointer d_A pointing to the start of that allocation. This has effectively wiped out the previous pointer and allocation. (The previous host allocation is still somewhere, but you can't access it. It's basically lost.) Surely that was not your intent.

int *h_A = d_a->a;

Now that d_A (I assume you meant d_A, not d_a) has been assigned as a device memory pointer, the -> operation will dereference that pointer to locate the element a. This is illegal in host code and will throw an error (seg fault).

cudaMalloc( &(d_A->a), row*col*sizeof(int) );

This line of code has a similar issue. We cannot cudaMalloc a pointer that lives in device memory. cudaMalloc creates pointers that live in host memory but reference a location in device memory. This operation &(d_A->a) is dereferencing a device pointer, which is illegal in host code.

A proper code would be something like this:

$ cat t363.cu
#include <stdio.h>

typedef struct {
  int *a;
  int foo;
} StructA;

__global__ void kernel(StructA *data){

  printf("The value is %d\n", *(data->a + 2));
}

int main()
{
  int  numS = 1; // defined at runtime

  //allocate host memory for the structure storage
  StructA *h_A = (StructA*)malloc(numS * sizeof(StructA));
  //allocate host memory for the storage pointed to by the embedded pointer
  h_A->a = (int *)malloc(10*sizeof(int));
  // initialize data pointed to by the embedded pointer
  for (int i = 0; i <10; i++) *(h_A->a+i) = i;
  StructA *d_A;  // pointer for device structure storage
  //allocate device memory for the structure storage
  cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );
  // create a pointer for cudaMalloc to use for embedded pointer device storage
  int *temp;
  //allocate device storage for the embedded pointer storage
  cudaMalloc((void **)&temp, 10*sizeof(int));
  //copy this newly created *pointer* to it's proper location in the device copy of the structure
  cudaMemcpy(&(d_A->a), &temp, sizeof(int *), cudaMemcpyHostToDevice);
  //copy the data pointed to by the embedded pointer from the host to the device
  cudaMemcpy(temp, h_A->a, 10*sizeof(int), cudaMemcpyHostToDevice);

  kernel<<<1, 1>>>(d_A); // Passing pointer to StructA in device
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_20 -o t363 t363.cu
$ cuda-memcheck ./t363
========= CUDA-MEMCHECK
The value is 2
========= ERROR SUMMARY: 0 errors
$

You'll note that I haven't worked out the case where you are dealing with an array of StructA (i.e. numS > 1), that will require a loop. I'll leave it to you to work through the logic I've presented here and in my previous linked answer to see if you can work out the details of that loop. Furthermore, for the sake of clarity/brevity I've dispensed with the usual cuda error checking but please use it in your codes. Finally, this process (sometimes called a "deep copy operation") is somewhat tedious in ordinary CUDA if you haven't concluded that yet. Previous recommendations along these lines are to "flatten" such structures (so that they don't contiain pointers), but you can also explore cudaMallocManaged i.e. Unified Memory in CUDA 6.

How to use cudaMalloc / cudaMemcpy for a pointer to a structure containing pointers?

You have to be aware where your memory resides. malloc allocates host memory, cudaMalloc allocates memory on the device and returns a pointer to that memory back. However, this pointer is only valid in device functions.

What you want could be achived as followed:

typedef struct {
    int width;
    int height;
    float* elements;
} Matrix;

int main void() {
    int rows, cols, numMat = 2; // These are actually determined at run-time
    Matrix* data = (Matrix*)malloc(numMat * sizeof(Matrix));

    // ... Successfully read from file into "data" ...
    Matrix* h_data = (Matrix*)malloc(numMat * sizeof(Matrix));
    memcpy(h_data, data, numMat * sizeof(Matrix);

    for (int i=0; i<numMat; i++){

        cudaMalloc(&(h_data[i].elements), rows*cols*sizeof(float));
        cudaMemcpy(h_data[i].elements, data[i].elements,  rows*cols*sizeof(float)), cudaMemcpyHostToDevice);

     }// matrix data is now on the gpu, now copy the "meta" data to gpu
     Matrix* d_data;
     cudaMalloc(&d_data, numMat*sizeof(Matrix)); 
     cudaMemcpy(d_data, h_data, numMat*sizeof(Matrix));
     // ... Do other things ...
}

To make things clear:
Matrix* data contains the data on the host.
Matrix* h_data contains a pointer to the device memory in elements which can be passed to the kernels as parameters. The memory is on the GPU.
Matrix* d_data is completly on the GPU and can be used like data on the host.

in your kernel code you kann now access the matrix values, e.g.,

__global__ void doThings(Matrix* matrices)
{
      matrices[i].elements[0] = 42;
}

CUDA Memory Allocation for AoS inside a SoA

Finally after some further research i was able to find a solution, effectively as sugested the solution is to create host memory versions of each level of the structure. Following the full working example:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
    
/// <summary>
    /// Struct to define item
    /// </summary>
    /// <returns></returns>
struct item
{
    int id;
    float weight;
    float value;
    int node;
    int taken;
};

/// <summary>
/// Struct to define a node
/// </summary>
/// <returns></returns>
struct node
{
    int id;
    double x;
    double y;
    int item_qty;
    item* items;
};

/// <summary>
/// Struct to define a tour
/// </summary>
/// <returns></returns>
struct tour
{
    int id;
    int node_qty;
    node* nodes;
};

/// <summary>
/// Struct to define population
/// </summary>
/// <returns></returns>
struct population
{
    int id;
    int tour_qty;
    tour* tours;
};

static void HandleError(cudaError_t err, const char* file, int line)
{
    if (err != cudaSuccess) {
        printf("%s in %s at line %d\n", cudaGetErrorString(err), file, line);
        getchar();
        exit(EXIT_FAILURE);
    }
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

void printStructure(population* pop, int pop_size, int tour_size);

__global__ void populationTest(population* population, int population_size)
{
    for (int p = 0; p < population_size; ++p)
    {
        printf(" > population[%d].id: %d\n", p, population[p].id);
        printf(" > population[%d].tour_qty: %d\n", p, population[p].tour_qty);
        if (population[p].tour_qty > 0)
        {
            for (int t = 0; t < population[p].tour_qty; ++t)
            {
                printf(" > population[%d].tours[%d].node_qty: %d\n", p, t, population[p].tours[t].node_qty);
                if (population[p].tours[t].node_qty > 0)
                {
                    for (int n = 0; n < population[p].tours[t].node_qty; ++n)
                    {
                        printf(" > population[%d].tours[%d].nodes[%d].id: %d\n", p, t, n, population[p].tours[t].nodes[n].id);
                        printf(" > population[%d].tours[%d].nodes[%d].x: %lf\n", p, t, n, population[p].tours[t].nodes[n].x);
                        printf(" > population[%d].tours[%d].nodes[%d].y: %lf\n", p, t, n, population[p].tours[t].nodes[n].y);
                        printf(" > population[%d].tours[%d].nodes[%d].item_qty: %d\n", p, t, n, population[p].tours[t].nodes[n].item_qty);
                        if (population[p].tours[t].nodes[n].item_qty > 0)
                        {
                            for (int i = 0; i < population[p].tours[t].nodes[n].item_qty; ++i)
                            {
                                printf(" > population[%d].tours[%d].nodes[%d].items[%d].id: %d\n", p, t, n, i, population[p].tours[t].nodes[n].items[i].id);
                                printf(" > population[%d].tours[%d].nodes[%d].items[%d].node: %d\n", p, t, n, i, population[p].tours[t].nodes[n].items[i].node);
                                printf(" > population[%d].tours[%d].nodes[%d].items[%d].taken: %d\n", p, t, n, i, population[p].tours[t].nodes[n].items[i].taken);
                                printf(" > population[%d].tours[%d].nodes[%d].items[%d].value: %f\n", p, t, n, i, population[p].tours[t].nodes[n].items[i].value);
                                printf(" > population[%d].tours[%d].nodes[%d].items[%d].weight: %f\n", p, t, n, i, population[p].tours[t].nodes[n].items[i].weight);
                            }
                        }
                    }
                }
            }
        }
    }
    printf("\n\n");
}

int main()
{
    // Get user defined values
    int population_size = 1;
    int tour_size = 10; 
    int node_size = 5;
    int item_size = 4;

#pragma region ALLOCATE CPU MEMORY

    // Declare pointers
    population* host_population;
    tour* host_tour;
    node* host_node;
    item* host_item;

    // Allocate host memory for population
    host_population = (population*)malloc(sizeof(population) * population_size);
    for (int p = 0; p < population_size; p++)
    {
        host_population[p].tours = (tour*)malloc(sizeof(tour) * tour_size);
    }

    // Allocate host memory for tour
    host_tour = (tour*)malloc(sizeof(tour) * tour_size);
    for (int t = 0; t < tour_size; t++)
    {
        host_tour[t].nodes = (node*)malloc(sizeof(node) * node_size);
    }

    // Allocate host memory for node
    host_node = (node*)malloc(sizeof(node) * node_size);
    for (int n = 0; n < node_size; n++)
    {
        host_node[n].items = (item*)malloc(sizeof(item) * item_size);
    }

    // Allocate memory for item
    //host_item = (item*)malloc(sizeof(item) * item_size);

    //Test for AoSoA
    host_item = (item*)malloc(sizeof(item) * item_size * node_size);

#pragma endregion

#pragma region FILL CPU DATA

    //Fill the full structure with information, for test purposes these values are going to be taken

    // 1. Item Data
    int item_id[4] = { 1,2,3,4 };
    float item_value[4] = { 300,50,30,40 };
    float item_weight[4] = { 400,200,40,2 };
    int item_node[4] = { 3,4,5,2 };

    // 2. Node Data
    int node_id[5] = { 1,2,3,4,5 };
    double node_x[5] = { 0,6,14,11,7 };
    double node_y[5] = { 0,-5,5,13,5 };
    int node_item[5] = { 0,1,1,1,1 };

    // 3. Tour Data
    int tour_id[10] = { 1,2,3,4,5,6,7,8,9,10 };

    // 4. Population Data
    int population_id = 1;

    for (int i = 0; i < item_size; i++)
    {
        host_item[i].id = item_id[i];
        host_item[i].value = item_value[i];
        host_item[i].taken = rand() % 2;
        host_item[i].node = item_node[i];
        host_item[i].weight = item_weight[i];
    }

    for (int n = 0; n < node_size; n++)
    {
        host_node[n].id = node_id[n];
        host_node[n].x = node_x[n];
        host_node[n].y = node_y[n];
        host_node[n].item_qty = node_item[n];
        for (int i = 0; i < item_size; i++)
        {
            if (host_node[n].id == host_item[i].node)
            {
                memcpy(host_node[n].items, &host_item[i], sizeof(item) * node_item[n]);
            }
        }
    }

    for (int t = 0; t < tour_size; t++)
    {
        host_tour[t].id = tour_id[t];
        host_tour[t].node_qty = node_size;
        memcpy(host_tour[t].nodes, host_node, sizeof(node) * node_size);
    }

    for (int p = 0; p < population_size; p++)
    {
        host_population[p].id = population_id;
        host_population[p].tour_qty = tour_size;
        memcpy(host_population[p].tours, host_tour, sizeof(tour) * tour_size);
    }

    printStructure(host_population, population_size, tour_size);

#pragma endregion

#pragma region ALLOCATE GPU MEMORY

    // Define pointers for device structs
    population* device_population;
    tour* device_tour;
    node* device_node;
    item* device_item;

    // Allocate device memory for population
    HANDLE_ERROR(cudaMalloc((void**)&device_population, sizeof(population) * size_t(population_size)));
    
    // Allocate device memory for tour
    HANDLE_ERROR(cudaMalloc((void**)&device_tour, sizeof(tour) * size_t(tour_size)));
    
    // Allocate device memory for node
    HANDLE_ERROR(cudaMalloc((void**)&device_node, sizeof(node) * size_t(node_size)));
    
    // Allocate device memory for item
    HANDLE_ERROR(cudaMalloc((void**)&device_item, sizeof(item) * size_t(item_size)));

    // Copy host item struct with device pointers to device
    HANDLE_ERROR(cudaMemcpy(device_item, host_item, sizeof(item) * size_t(item_size), cudaMemcpyHostToDevice));

    // Offset pointers
    for (int n = 0; n < node_size; ++n)
    {
        for (int i = 0; i < item_size; ++i)
        {
            if (host_node[n].id == host_item[i].node)
            {
                host_node[n].items = device_item + i;
            }
        }
    }

    // Copy host node struct with device pointers to device
    HANDLE_ERROR(cudaMemcpy(device_node, host_node, sizeof(node) * size_t(node_size), cudaMemcpyHostToDevice));

    for (int t = 0; t < tour_size; ++t)
    {
        host_tour[t].nodes = device_node;
    }

    // Copy host tour struct with device pointers to device
    HANDLE_ERROR(cudaMemcpy(device_tour, host_tour, sizeof(tour) * size_t(tour_size), cudaMemcpyHostToDevice));

    for (int p = 0; p < population_size; ++p)
    {
        host_population[p].tours = device_tour;
    }

    host_population->tour_qty = tour_size;

    HANDLE_ERROR(cudaMemcpy(device_population, host_population, sizeof(population) * size_t(population_size), cudaMemcpyHostToDevice));

   populationTest << <1, 1 >> > (device_population, population_size);
   HANDLE_ERROR(cudaDeviceSynchronize());

#pragma endregion

    return 0;
}

/// <summary>
/// Function to print the tree struct of population
/// </summary>
/// <param name="p"></param>
/// <param name="pop_size"></param>
/// <param name="tour_size"></param>
void printStructure(population* pop, int pop_size, int tour_size)
{
    for (int p = 0; p < pop_size; ++p)
    {
        printf(" > population[%d].id: %d\n", p, pop[p].id);
        for (int t = 0; t < tour_size; ++t)
        {
            printf(" > population[%d].tours[%d].id: %d\n", p, t, pop[p].tours[t].id);
            printf(" > population[%d].tours[%d].node_qty: %d\n", p, t, pop[p].tours[t].node_qty);
            if (pop[p].tours[t].node_qty > 0)
            {
                for (int n = 0; n < pop[p].tours[t].node_qty; ++n)
                {
                    printf(" > population[%d].tours[%d].nodes[%d].id: %d\n", p, t, n, pop[p].tours[t].nodes[n].id);
                    printf(" > population[%d].tours[%d].nodes[%d].x: %lf\n", p, t, n, pop[p].tours[t].nodes[n].x);
                    printf(" > population[%d].tours[%d].nodes[%d].y: %lf\n", p, t, n, pop[p].tours[t].nodes[n].y);
                    printf(" > population[%d].tours[%d].nodes[%d].item_qty: %d\n", p, t, n, pop[p].tours[t].nodes[n].item_qty);
                    if (pop[p].tours[t].nodes[n].item_qty > 0)
                    {
                        for (int i = 0; i < pop[p].tours[t].nodes[n].item_qty; ++i)
                        {
                            printf(" > population[%d].tours[%d].nodes[%d].items[%d].id: %d\n", p, t, n, i, pop[p].tours[t].nodes[n].items[i].id);
                            printf(" > population[%d].tours[%d].nodes[%d].items[%d].node: %d\n", p, t, n, i, pop[p].tours[t].nodes[n].items[i].node);
                            printf(" > population[%d].tours[%d].nodes[%d].items[%d].taken: %d\n", p, t, n, i, pop[p].tours[t].nodes[n].items[i].taken);
                            printf(" > population[%d].tours[%d].nodes[%d].items[%d].value: %f\n", p, t, n, i, pop[p].tours[t].nodes[n].items[i].value);
                            printf(" > population[%d].tours[%d].nodes[%d].items[%d].weight: %f\n", p, t, n, i, pop[p].tours[t].nodes[n].items[i].weight);
                        }
                    }
                }
            }
        }
    }
    printf("\n\n");
}

malloc for struct and pointer in C

No, you're not allocating memory for y->x twice.

Instead, you're allocating memory for the structure (which includes a pointer) plus something for that pointer to point to.

Think of it this way:

         1          2
        +-----+    +------+
y------>|  x------>|  *x  |
        |  n  |    +------+
        +-----+

You actually need the two allocations (1 and 2) to store everything you need.

Additionally, your type should be struct Vector *y since it's a pointer, and you should never cast the return value from malloc in C.

It can hide certain problems you don't want hidden, and C is perfectly capable of implicitly converting the void* return value to any other pointer.

And, of course, you probably want to encapsulate the creation of these vectors to make management of them easier, such as with having the following in a header file vector.h:

struct Vector {
    double *data;    // Use readable names rather than x/n.
    size_t size;
};

struct Vector *newVector(size_t sz);
void delVector(struct Vector *vector);
//void setVectorItem(struct Vector *vector, size_t idx, double val);
//double getVectorItem(struct Vector *vector, size_t idx);

Then, in vector.c, you have the actual functions for managing the vectors:

#include "vector.h"

// Atomically allocate a two-layer object. Either both layers
// are allocated or neither is, simplifying memory checking.

struct Vector *newVector(size_t sz) {
    // First, the vector layer.

    struct Vector *vector = malloc(sizeof (struct Vector));
    if (vector == NULL)
        return NULL;

    // Then data layer, freeing vector layer if fail.

    vector->data = malloc(sz * sizeof (double));
    if (vector->data == NULL) {
        free(vector);
        return NULL;
    }

    // Here, both layers worked. Set size and return.

    vector->size = sz;
    return vector;
}

void delVector(struct Vector *vector) {
    // Can safely assume vector is NULL or fully built.

    if (vector != NULL) {
        free(vector->data);
        free(vector);
    }
}

By encapsulating the vector management like that, you ensure that vectors are either fully built or not built at all - there's no chance of them being half-built.

It also allows you to totally change the underlying data structures in future without affecting clients. For example:

if you wanted to make them sparse arrays to trade off space for speed.
if you wanted the data saved to persistent storage whenever changed.
if you wished to ensure all vector elements were initialised to zero.
if you wanted to separate the vector size from the vector capacity for efficiency⁽¹⁾.

You could also add more functionality such as safely setting or getting vector values (see commented code in the header), as the need arises.

For example, you could (as one option) silently ignore setting values outside the valid range and return zero if getting those values. Or you could raise an error of some description, or attempt to automatically expand the vector under the covers⁽¹⁾.

In terms of using the vectors, a simple example is something like the following (very basic) main.c

#include "vector.h"

#include <stdio.h>

int main(void) {
    Vector myvec = newVector(42);
    myvec.data[0] = 2.718281828459;
    delVector(myvec);
}

⁽¹⁾ That potential for an expandable vector bears further explanation.

Many vector implementations separate capacity from size. The former is how many elements you can use before a re-allocation is needed, the latter is the actual vector size (always <= the capacity).

When expanding, you want to generally expand in such a way that you're not doing it a lot, since it can be an expensive operation. For example, you could add 5% more than was strictly necessary so that, in a loop continuously adding one element, it doesn't have to re-allocate for every single item.

Copying a struct containing pointers to CUDA device

Edit: CUDA 6 introduces Unified Memory, which makes this "deep copy" problem a lot easier. See this post for more details.

Don't forget that you can pass structures by value to kernels. This code works:

// pass struct by value (may not be efficient for complex structures)
__global__ void kernel2(StructA in)
{
    in.arr[threadIdx.x] *= 2;
}

Doing so means you only have to copy the array to the device, not the structure:

int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA h_a;
int *d_arr;

// 1. Allocate device array.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);

// 2. Copy array contents from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);

// 3. Point to device pointer in host struct.
h_a.arr = d_arr;

// 4. Call kernel with host struct as argument
kernel2<<<N,1>>>(h_a);

// 5. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

// 6. Point to host pointer in host struct 
//    (or do something else with it if this is not needed)
h_a.arr = h_arr;

Variable gets lost after allocating array of structs in cuda

The problem is in the function cudaGraphMalloc where you are trying to allocate device memory to the members of outGraph which has already been allocated on the device. In process of doing so, you are de-referencing a device pointer on host which is illegal.

To allocate device memory to members of struct type variable which exists on the device, we first have to create a temporary host variable of that struct type, then allocate device memory to its members, and then copy it to the struct which exists on the device.

I have answered a similar question here. Please take a look at it.

The fixed code may look like this:

#include <algorithm>
#include <cuda_runtime.h>
#include <cuda.h>

// A point, part of some elements
struct Node {

    float* position;

};

struct Graph {
    unsigned int nNode;
    Node* node;
    unsigned int nBoundary;
    unsigned int* boundary;
};
Graph* cudaGraphMalloc(const Graph* inGraph);
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

__global__ void testKernel(Graph* graph, unsigned int * d_res) {
    d_res[0] = graph->nBoundary;

};
int main()
{

    // Generate some fake data on the CPU
    Graph graph;
    graph.node = (Node*)malloc(2 * sizeof(Node));
    graph.boundary = (unsigned int*)malloc(3 * sizeof(unsigned int));
    for (int i = 0; i < 3; i++) {
        graph.boundary[i] = i + 10;
    }
    graph.nBoundary = 3;
    graph.nNode = 2;
    for (int i = 0; i < 2; i++) {
        // They can have different sizes in the original code
        graph.node[i].position = (float*)malloc(3 * sizeof(float));
        graph.node[i].position[0] = 45;
        graph.node[i].position[1] = 1;
        graph.node[i].position[2] = 2;
    }

    // allocate GPU memory
    Graph * d_graph = cudaGraphMalloc(&graph);
    // some dummy variables to test on GPU.
    unsigned int * d_res, *h_res;
    cudaMalloc((void **)&d_res, sizeof(unsigned int));
    h_res = (unsigned int*)malloc(sizeof(unsigned int));

    //Run kernel
    testKernel << <1, 1 >> >(d_graph, d_res);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaMemcpy(h_res, d_res, sizeof(unsigned int), cudaMemcpyDeviceToHost));

    printf("%u\n", graph.nBoundary);
    printf("%u\n", h_res[0]);

    return 0;
}

Graph* cudaGraphMalloc(const Graph* inGraph) 
{
    //Create auxiliary Graph variable on host
    Graph temp;

    //copy constants
    temp.nNode = inGraph->nNode;
    temp.nBoundary = inGraph->nBoundary;

    // copy boundary
    gpuErrchk(cudaMalloc((void**)&(temp.boundary), inGraph->nBoundary * sizeof(unsigned int)));
    gpuErrchk(cudaMemcpy(temp.boundary, inGraph->boundary, inGraph->nBoundary * sizeof(unsigned int), cudaMemcpyHostToDevice));

    //Create nodes 
    size_t nodeBytesTotal = temp.nNode * sizeof(Node);
    gpuErrchk(cudaMalloc((void**)&(temp.node), nodeBytesTotal));

    for (int i = 0; i < temp.nNode; i++)
    {
        //Create auxiliary node on host
        Node auxNodeHost;

        //Allocate device memory to position member of auxillary node
        size_t nodeBytes = 3 * sizeof(float);
        gpuErrchk(cudaMalloc((void**)&(auxNodeHost.position), nodeBytes));
        gpuErrchk(cudaMemcpy(auxNodeHost.position, inGraph->node[i].position, nodeBytes, cudaMemcpyHostToDevice));

        //Copy auxillary host node to device
        Node* dPtr = temp.node + i;
        gpuErrchk(cudaMemcpy(dPtr, &auxNodeHost, sizeof(Node), cudaMemcpyHostToDevice));
    }

    Graph* outGraph;
    gpuErrchk(cudaMalloc((void**)&outGraph, sizeof(Graph)));
    gpuErrchk(cudaMemcpy(outGraph, &temp, sizeof(Graph), cudaMemcpyHostToDevice));

    return outGraph;
}

Be advised that you will have to keep the host copies of internal device pointers (i.e. the auxiliary host variables). This is because you will have to free the device memory later and since you will only have a device copy of Graph in the main code, you won't be able to access its members from the host to call cudaFree on them. In this case the variable Node auxNodeHost (created in each iteration) and Graph temp are those variables.

The above code does not do that and is just for demonstration purpose.

Tested on Windows 10, Visual Studio 2015, CUDA 9.2, NVIDIA Driver 397.44.

Simple operation on Structure in CUDA : Segmentation fault

There are several invalid memory access in the provided code.

Accessing device memory (allocated using cudaMalloc) from host like d_data->a will cause undefined behavior (segmentation fault etc.).
cudaMemcpy takes pointers as arguments, not address of pointer. So cudaMemcpy(&d_data, &h_data... should be replaced with cudaMemcpy(d_data, h_data....

Allocating a device object with a device pointer as a member is a bit tricky. It can be achieved as follows:

Allocate a temporary host object (MyStruct temp).
Allocate device memory to the member we want on device (cudaMalloc(&temp.a, bytes)).
Allocate device object (cudaMalloc(&d_data, sizeof(MyStruct)).
Copy temporary host object to the device object (cudaMemcpy(d_data, &temp, sizeof(MyStruct), cudaMemcpyHostToDevice)).

Keep in mind that when you modify the contents of d_data->a on the device, temp.a will also be modified because they are actually pointing to same memory location on device.

Your final main function will look something like this:

int main(){
    MyStruct *h_data, *d_data, *out_data;

    size_t structSize = sizeof(MyStruct);
    size_t intSize = sizeof(int);

    h_data = (MyStruct *) malloc(structSize * 1);
    h_data->b = 32;
    h_data->a = (int *)malloc(intSize * h_data->b);

    out_data = (MyStruct *) malloc(structSize * 1);
    out_data->b = 32;
    out_data->a = (int *)malloc(intSize * out_data->b);

    for(int i = 0; i<32; i++){
        h_data->a[i] = i;   
    }

    //Create temporary MyStruct object on host and allocate memory to its member "a" on device
    MyStruct temp;
    temp.b = h_data->b;
    checkCuda(cudaMalloc(&temp.a, 32 * sizeof(int)));

Cudamalloc of a Structure and an Element of Same Structure