Concurrent Writes in the Same Global Memory Location

Concurrent writes in the same global memory location

In the CUDA execution model, there are no guarantees that every simultaneous write from threads in the same block to the same global memory location will succeed. At least one write will work, but it isn't guaranteed by the programming model how many write transactions will occur, or in what order they will occur if more than one transaction is executed.

If this is a problem, then a better approach (from a correctness point of view), would be to have only one thread from each block do the global write. You can either use a shared memory flag set atomically or a reduction operation to determine whether the value should be set. Which you choose might depend on how many zeros there are likely to be. The more zeroes there are, the more attractive the reduction will be. CUDA includes warp level __any() and __all() operators which can be built into a very efficient boolean reduction in a few lines of code.

is it atomic for writing a struct to same global memory location in CUDA?

^{[This answer was copied from a comment that should have been an answer.]}

Is it possible that the final result Point in that location has x value of thread A and y value of thread B?

Yes. To avoid such a scenario you need to write a Point as a single atomic value (ie, reinterpret a Point as a double or int64 and use an atomic set).

Several threads writing the same value in the same global memory location

I have several threads in my kernel that could write the very same value in the same global memory location.

Contrary to some of the comments, this is safe. By safe, I mean that the value written will show up in that global memory location. There is no possibility that a write will not occur, and there is no possibility of some other spurious data corruption. Tom's answer here is applicable to this.

If different values are being written, then one of the values will end up in that location, but which value is undefined.

All blocks read same global memory location section. Fastest method is?

Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.

It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.

For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.

You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.

If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.

Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.

In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.

Does OpenCL allow concurrent writes to same memory address?

Did you try to use the cl_khr_global_int32_base_atomics extension and atom_inc intrinsic function? I would first store the data on an int32 instead of an uchar as proof of concept, then optimize the memory footprint of data structures.

cuda: write to same global memory location by several threads

CUDA provides compiler intrinsics for atomic operations. See the CUDA C Programming Guide for additional details on what atomic operations are available for each compute capability.
counters is a pointer to an array of integers of size gridDim.x. Each thread will increment the array value indexed by it's blockIdx.x.

__global__ void CountThreadsInBlock(int* counters)
{
    int i = blockIdx.x;
    atomicAdd(&counters[i], 1);
}

// NOTE: Assume 1D launch.

Writing to the same memory location, is this possible?

Corruption

Irrespective of the system [concurrent or truly parallel] the state of the memory depends on implementation of the memory device. Generally speaking, memory reads and writes are not atomic, which means it is possible that multiple concurrent accesses to the same memory address may return inconsistent results [ie data corruption].

Imagine two concurrent requests, 1 write, 1 read, of a simple integer value. Let's say an integer is 4 bytes. Let us also say, a read takes 2ns to execute, and a write takes 4ns to execute

t0, Initial value of underlying 4 byte tuple, [0, 0, 0, 0]
t1, Write op begins, write first byte [255, 0, 0, 0]
t2, Write op continues, write second byte [255, 255, 0, 0]
t2, Read op begins, reads first 2 bytes [255, 255, -, -]
t3, Write op continues, write third byte [255, 255, 255, 0]
t3, Read op ends, reads last 2 bytes [255, 255, 255, 0]
t4, Write op ends, write fourth byte [255, 255, 255, 255]

The value returned by the read is neither the original nor the new value. The value is completely corrupted.

And what it means to you!

Admittedly, that is an incredibly simplified and contrived example, but what possible effect could this have in your scenario? In my opinion, the most vulnerable piece of your diagnostics system is the list of diagnostics data.

If your list is of fixed size, say an array of references to objects, at best you may lose whole objects as array elements are overwritten by competing threads, at worst you seg fault if the element contains a corrupted object reference [a la corruption scenario above].

If your list is dynamic, then it is possible underlying data structure becomes corrupted [if an array as in .Net List<> when it is re-allocated, or if a linked list your next\prev references become lost or corrupted].

As an aside

Why isn't memory access atomic? For the same reason base collection implementations are not atomic - it would be too restrictive and introduce overhead, effectively penalizing simple scenarios. Therefore it is left to the consumer [us!] to synchronize our own memory accesses.

What happens if concurrent processes write to a global variable the same value?

What happens if concurrent processes write to a global variable the
same value?

The results of a data race are undefined.

Run the Go data race detector.

References:

Wikipedia: Race condition

Benign Data Races: What Could Possibly Go Wrong?

The Go Blog: Introducing the Go Race Detector

Go: Data Race Detector

Go 1.8 Release Notes

Concurrent Map Misuse

In Go 1.6, the runtime added lightweight, best-effort detection of
concurrent misuse of maps. This release improves that detector with
support for detecting programs that concurrently write to and iterate
over a map.

As always, if one goroutine is writing to a map, no other goroutine
should be reading (which includes iterating) or writing the map
concurrently. If the runtime detects this condition, it prints a
diagnosis and crashes the program. The best way to find out more about
the problem is to run the program under the race detector, which will
more reliably identify the race and give more detail.

For example,

package main

import "time"

var linksToVisit = map[string]bool{}

func main() {
    someLink := "someLink"
    go func() {
        for {
            linksToVisit[someLink] = true
        }
    }()
    go func() {
        for {
            linksToVisit[someLink] = true
        }
    }()
    time.Sleep(100 * time.Millisecond)
}

Output:

$ go run racer.go
fatal error: concurrent map writes
$

$ go run -race racer.go

==================
WARNING: DATA RACE
Write at 0x00c000078060 by goroutine 6:
  runtime.mapassign_faststr()
      /home/peter/go/src/runtime/map_faststr.go:190 +0x0
  main.main.func2()
      /home/peter/gopath/src/racer.go:16 +0x6a

Previous write at 0x00c000078060 by goroutine 5:
  runtime.mapassign_faststr()
      /home/peter/go/src/runtime/map_faststr.go:190 +0x0
  main.main.func1()
      /home/peter/gopath/src/racer.go:11 +0x6a

Goroutine 6 (running) created at:
  main.main()
      /home/peter/gopath/src/racer.go:14 +0x88

Goroutine 5 (running) created at:
  main.main()
      /home/peter/gopath/src/racer.go:9 +0x5b
==================

fatal error: concurrent map writes

$

Concurrent Writes in the Same Global Memory Location