Reset Cuda Context After Exception

How can I reset the CUDA error to success with Driver API after a trap instruction?

This type of error cannot be reset with the CUDA Runtime API cudaGetLastError() function.

There are two types of CUDA runtime errors: "sticky" and "non-sticky". "non-sticky" errors are those which do not corrupt the context. For example, a cudaMalloc request that is asking for more than the available memory will fail, but it will not corrupt the context. Such an error is "non-sticky".

Errors that involve unexpected termination of a CUDA kernel (including your trap example, also in-kernel assert() failures, also runtime detected execution errors such as out-of-bounds accesses) are "sticky". You cannot clear "sticky" errors with cudaGetLastError(). The only method to clear these errors in the runtime API is cudaDeviceReset() (which eliminates all device allocations, and wipes out the context).

The corresponding driver API function is cuDevicePrimaryCtxReset()

Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

States of memory data after cuda exceptions

The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.

This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.

Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.

Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.

An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.

This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:

non-sticky, non-cuda-context-corrupting:

cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.

sticky, cuda-context-corrupting:

cudaErrorMisalignedAddress = 74
The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.

Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

What is the role of cudaDeviceReset() in Cuda

The role of cudaDeviceReset() is documented here

It is used to destroy a CUDA context, which means that all device allocations are removed.

I agree that it appears to have a synchronizing effect. However since the documentation states:

Note that this function will reset the device immediately.

I believe it is unsafe to rely on this behavior. Furthermore, the documentation also calls out the hazard of using this function in a multi-threaded app. Therefore, safe coding would dictate:

Use of device synchronization (e.g. cudaDeviceSynchronize(), or cudaMemcpy(), etc.)
Retrieve whatever data your application would like to preserve that may be in a device allocation, or that a recently running kernel may have updated (in device memory).
Make sure that any host threads that may also have device activity associated with them, are also terminated
Make sure that any C++ objects that may have device activity in their destructors are properly destroyed or out-of-scope
call cudaDeviceReset() as part of application shut-down.

Note that calling cudaDeviceReset() as part of application shut-down should not be considered mandatory. Many applications will work fine without such an idiom.

This answer may also be of interest.

Triggering a runtime error within a CUDA kernel

Option 1 Assertions:

All presently supported GPUs include an in kernel assertion mechanism, described here.

Directly from the documentation:

#include <assert.h>

__global__ void testAssert(void)
{
    int is_one = 1;
    int should_be_one = 0;

    // This will have no effect
    assert(is_one);

    // This will halt kernel execution
    assert(should_be_one);
}

int main(int argc, char* argv[])
{
    testAssert<<<1,1>>>();
    cudaDeviceSynchronize();

    return 0;
}

There is a dedicated CUDA runtime error cudaErrorAssert which will be reported by any kernel which fires an assertion call during execution. As per all other device side runtime errors, the context will be destroyed on the error and a new context will need to be created (by calling cudaDeviceReset()).

Note that is (unfortunately) not supported on MacOS because of driver limitations.

Option 2 Illegal Instruction

You can use inline ptx and asm("trap;") to trigger an illegal instruction.

Here is some code demonstrating that:

#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cstdio>
#include <cstdlib>

__global__ void kernel(int i) {
    if(i > 0) {
        asm("trap;");
    }

    ::printf("%d\n", i);
}

inline void error_check(cudaError_t err, const char* file, int line) {
    if(err != cudaSuccess) {
        ::fprintf(stderr, "CUDA ERROR at %s[%d] : %s\n", file, line, cudaGetErrorString(err));
        abort();
    }
}
#define CUDA_CHECK(err) do { error_check(err, __FILE__, __LINE__); } while(0)

int main() {
    kernel<<<1, 1>>>(0);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    kernel<<<1, 1>>>(1);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

}

which outputs:

0

CUDA ERROR at ...kernel.cu[31] : an illegal instruction was encountered