_Ldg Causes Slower Execution Time in Certain Situation

__ldg causes slower execution time in certain situation

The reason for the version using __ldg being slower is the fact that the NVCC compiler is not able to perform loop unrolling optimizations correctly in this particular scenario. The issue was submitted to NVIDIA with ID 1605303. The most recent response from NVIDIA team is as follows:

Although, we have not communicated this to you, we have done prior investigations into your issue. The solution to your issue is improving our loop-unrolling heuristic in the back-end compiler – the compiler embedded inside of ptxas. We evaluated the possibility of resolving this issue in CUDA 8.0, but an initial solution that fixed your problem caused unacceptable regressions. Because of other constraints, we were not able to develop a proper solution in time for it to be done for CUDA 8.0.
We are actively working to resolve your issue in the future CUDA release(that following of CUDA 8.0). We will ensure that we keep you informed of our progress moving forward.

CUDA device function as class member: Inlining and performance?

The GPU compiler will aggressively inline functions for performance reasonse. In that case, there should be no particular impact to performance.

If a function cannot be inlined, then the usual performance overhead would occur, involving the creation of a stack frame and a call to a function -just as you would observe on a CPU call to a non-inlined function.

If you have concerns about a specific example, you can create a short test code and look at the generated assembly language (SASS) by using cuobjdump -sass myexe and determine whether or not the function was inlined.

There are no general restrictions on inlining of __device__ functions that are class members/methods.

CUDA Matrix Multiply on Fortran is slower than C

First of all, I suggest that performance questions include complete codes. I generally need to be able to run stuff, and you can save me some typing. Sure, you can leave stuff out. Sure, I can probably figure out what it is. But I'm less likely to help you that way, and I suspect I'm not alone in that view. My advice: Make it easy for others to help you. I've given examples of what would be useful below.

On to the question:

The difference is that C uses a 1D array whereas Fortran uses 2D. But that should not be a problem since underneath the memory will be contiguous.

TL;DR: Your claim ("that should not be a problem") is evidently not supportable. The difference between a 1D allocation and a 2D allocation matters, not only from a storage perspective but also from an index-calculation perspective. If you're sensitive to the length of this answer, skip to note D at the bottom of this post.

Details:

When we have a loop like this:

    for (k=0; k<SIZE; ++k) {
        accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
    }

If the compiler can discover that the indexing calculation(s) can simply be "incremented" as the loop proceeds, this will be fairly simple. OTOH, if the indexing must be resolved taking into account a 2D calculation (e.g. row*width+column) previously referenced at each loop iteration, this will be more complicated. An important factor here is also whether the array width is usable/used by the compiler at compile time.

This appears to be what is happening, which I suggest is the difference in timing. We'll consider 2 proof points for this.

SASS analysis

If we compare the SASS between realizations, we see some important differences. In order to do this, we'll need full examples:

CUDA Fortran:

$ cat t11.cuf
module mmk
   USE cudafor
   !
   ! Definition of symbols for real types (RP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   INTEGER, PARAMETER :: SIZE_ = 16384
   !

   Contains

        attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

          integer :: tx, ty, k
          REAL(DP) :: accum
          REAL(DP), dimension(:,:) :: d_mat, d_matT, d_matSym

          tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
          ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

          if (tx <= SIZE_ .and. ty <=SIZE_) then
            accum = 0.0
            do k=1, SIZE_
              accum = accum + d_mat(tx,k) * d_matT(k,ty)
            end do
            d_matSym(tx,ty) = accum
          end if
        end subroutine matrixMultiply

end module mmk

PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE mmk

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:,:)   :: d_mat, d_matT, d_matSym
   !
   INTEGER    :: err, i1, i2
   type(dim3) :: grid_dim, blk_dim
   !
   !
   ! Allocate storage for the arrays
   !
   Allocate(d_mat(SIZE_,SIZE_),d_matT(SIZE_,SIZE_),d_matSym(SIZE_,SIZE_))
   !
   ! invoke the kernel
   !

   !Call
   grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
   blk_dim = dim3(32, 32, 1)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   err = cudaDeviceSynchronize()

   !
   ! Free storage for the arrays
   !
   Deallocate(d_mat,d_matT,d_matSym)
   !
END PROGRAM Test
$ nvfortran t11.cuf -o t11
$ nvprof ./t11
==38508== NVPROF is profiling process 38508, command: ./t11
==38508== Profiling application: ./t11
==38508== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  12.7168s         2  6.35838s  6.35229s  6.36447s  mmk_matrixmultiply_
                    0.00%  9.2480us         6  1.5410us  1.3120us  2.1760us  [CUDA memcpy HtoD]
      API calls:   97.08%  12.7260s         9  1.41400s  10.596us  6.36561s  cudaFree
                    2.85%  374.24ms         9  41.583ms  3.5780us  369.55ms  cudaMalloc
                    0.03%  4.5190ms         4  1.1297ms  1.0551ms  1.2608ms  cuDeviceTotalMem
                    0.02%  2.5060ms       404  6.2020us     111ns  1.2057ms  cuDeviceGetAttribute
                    0.01%  1.1501ms         4  287.52us  33.014us  1.0465ms  cuDeviceGetName
                    0.00%  120.00us         6  20.000us  5.1530us  62.523us  cudaMemcpy
                    0.00%  99.593us         2  49.796us  44.275us  55.318us  cudaLaunchKernel
                    0.00%  43.414us         1  43.414us  43.414us  43.414us  cudaDeviceSynchronize
                    0.00%  11.563us         4  2.8900us  1.4400us  5.5870us  cuDeviceGetPCIBusId
                    0.00%  1.7030us         8     212ns     124ns     580ns  cuDeviceGet
                    0.00%     901ns         3     300ns     163ns     534ns  cuDeviceGetCount
                    0.00%     855ns         4     213ns     183ns     263ns  cuDeviceGetUuid
$

NVIDIA HPC SDK, 2021.2, Tesla V100

We see that the execution duration is about 6.3 seconds, in line with your report.

CUDA C++:

$ cat t12.cu
#define idx(x,y,z) x*y + z
const int SIZE = 16384;
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k<SIZE; ++k) {
            accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
        }
        d_matSym[idx(tx,SIZE,ty)] = accum;
    }
}
int main(){
  //Call
  dim3 grid_dim(SIZE/32, SIZE/32, 1);
  dim3 blk_dim(32, 32, 1);
  double *d_mat, *d_matT, *d_matSym;
  cudaMalloc(&d_mat,    SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matT,   SIZE*SIZE*sizeof(double));
  cudaMalloc(&d_matSym, SIZE*SIZE*sizeof(double));
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
  cudaDeviceSynchronize();
}
$ nvcc -o t12 t12.cu -arch=sm_70
$ nvprof ./t12
==40364== NVPROF is profiling process 40364, command: ./t12
==40364== Profiling application: ./t12
==40364== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  11.0379s         2  5.51893s  5.50503s  5.53282s  matrixMultiply(double*, double*, double*)
      API calls:   97.80%  11.0379s         1  11.0379s  11.0379s  11.0379s  cudaDeviceSynchronize
                    2.15%  242.74ms         3  80.914ms  1.9747ms  238.78ms  cudaMalloc
                    0.04%  4.3105ms         4  1.0776ms  1.0728ms  1.0890ms  cuDeviceTotalMem
                    0.01%  1.4584ms       404  3.6090us     110ns  163.21us  cuDeviceGetAttribute
                    0.00%  142.28us         4  35.571us  32.475us  41.980us  cuDeviceGetName
                    0.00%  51.695us         2  25.847us  7.0790us  44.616us  cudaLaunchKernel
                    0.00%  11.198us         4  2.7990us  1.6260us  5.5950us  cuDeviceGetPCIBusId
                    0.00%  1.7930us         3     597ns     149ns  1.2200us  cuDeviceGetCount
                    0.00%  1.6590us         8     207ns     120ns     704ns  cuDeviceGet
                    0.00%     733ns         4     183ns     161ns     215ns  cuDeviceGetUuid
$

CUDA 11.2, same V100 machine

Again we see ~5.5 sec kernel duration timing, lining up with your report. Now let's compare the SASS. We use the cubojdump utility to extract it, and in both cases it is evident that the compiler is unrolling the kernel loop in some fashion. The key operation is DFMA, Double-precision Fused Multiply Add, which is called once per loop iteration, to compute a single product (and keep a running sum). Therefore, I will extract just a short sequence of DFMA operations in each case, in the midst of the unrolled loop, to get an idea of the per-FLOP "cost" and "overhead".

CUDA Fortran:

/*1670*/                   LDG.E.64.SYS R18, [R12+0x1b8] ;                 /* 0x0001b8000c127381 */
                                                                           /* 0x000ea200001eeb00 */
/*1680*/                   IADD3 R20, P0, R16, R29, RZ ;                   /* 0x0000001d10147210 */
                                                                           /* 0x001fc60007f1e0ff */
/*1690*/                   LDG.E.64.SYS R22, [R12+0x1b0] ;                 /* 0x0001b0000c167381 */
                                                                           /* 0x000e6400001eeb00 */
/*16a0*/                   IMAD.X R21, R17, 0x1, R27, P0 ;                 /* 0x0000000111157824 */
                                                                           /* 0x000fe400000e061b */
/*16b0*/                   LDG.E.64.SYS R24, [R16] ;                       /* 0x0000000010187381 */
                                                                           /* 0x00006c00001eeb00 */
/*16c0*/                   LDG.E.64.SYS R16, [R20] ;                       /* 0x0000000014107381 */
                                                                           /* 0x001ea200001eeb00 */
/*16d0*/                   DFMA R22, R24, R22, R14 ;                       /* 0x000000161816722b */
                                                                           /* 0x002084000000000e */
/*16e0*/                   IADD3 R14, P0, R20, R29, RZ ;                   /* 0x0000001d140e7210 */
                                                                           /* 0x001fca0007f1e0ff */
/*16f0*/                   IMAD.X R15, R21, 0x1, R27, P0 ;                 /* 0x00000001150f7824 */
                                                                           /* 0x000fe200000e061b */
/*1700*/                   DFMA R16, R16, R18, R22 ;                       /* 0x000000121010722b */

The above sequence repeats within the unrolled loop area. We note that there are 2 DFMA operations, 4 LDG operations (makes sense - two per multiply) and the remaining instructions (4) presumably constitute indexing updates as the loop proceeds through its iterations.

CUDA C++:

/*02d0*/                   LDG.E.64.SYS R16, [R6+0x40000] ;         /* 0x0400000006107381 */
                                                                    /* 0x001f2800001eeb00 */
/*02e0*/                   LDG.E.64.SYS R24, [R4+0x10] ;            /* 0x0000100004187381 */
                                                                    /* 0x000f2200001eeb00 */
/*02f0*/                   DFMA R10, R20, R10, R22 ;                /* 0x0000000a140a722b */
                                                                    /* 0x0200860000000016 */
/*0300*/                   LDG.E.64.SYS R22, [R6+0x60000] ;         /* 0x0600000006167381 */
                                                                    /* 0x001f6800001eeb00 */
/*0310*/                   LDG.E.64.SYS R20, [R4+0x18] ;            /* 0x0000180004147381 */
                                                                    /* 0x000f6200001eeb00 */
/*0320*/                   DFMA R14, R8, R14, R10 ;                 /* 0x0000000e080e722b */

Again that sequence repeats. We see 2 DFMA and 4 LDG instructions, but no other instructions in the repeating sequence. We note that the LDG instructions seem to have offsets that can be incorporated at compile-time into the instruction, and these offsets eliminate the need for any extra instructions, because they are only "incrementing" previously computed offsets (we also note that the "increments" are by what appear to be column-wise offsets in one case, and row-wise offsets in the other, just as we would expect to pick up the multiplication operands in the kernel loop.)

Refactoring

Can we refactor the Fortran code to make it work like the C++ code? Yes:

$ cat t11a.cuf
module mmk
   USE cudafor
   !
   ! Definition of symbols for real types (RP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   INTEGER, PARAMETER :: SIZE_ = 16384
   !

   Contains

        attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

          integer :: tx, ty, k
          REAL(DP) :: accum
          REAL(DP), dimension(:) :: d_mat, d_matT, d_matSym

          tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
          ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

          if (tx <= SIZE_ .and. ty <=SIZE_) then
            accum = 0.0
            do k=1, SIZE_
              accum = accum + d_mat((ty-1)*SIZE_+k) * d_matT((k-1)*SIZE_+tx)
            end do
            d_matSym((ty-1)*SIZE_+tx) = accum
          end if
        end subroutine matrixMultiply

end module mmk

PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE mmk

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:)     :: d_mat, d_matT, d_matSym
   !
   INTEGER    :: err, i1, i2
   type(dim3) :: grid_dim, blk_dim
   !
   ! Allocate storage for the arrays
   !
   Allocate(d_mat(SIZE_*SIZE_),d_matT(SIZE_*SIZE_),d_matSym(SIZE_*SIZE_))
   !
   ! invoke the kernel
   !

   !Call
   grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
   blk_dim = dim3(32, 32, 1)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
   err = cudaDeviceSynchronize()

   !
   ! Free storage for the arrays
   !
   Deallocate(d_mat,d_matT,d_matSym)
   !
END PROGRAM Test
$ nvfortran t11a.cuf -o t11a
$ nvprof ./t11a
==45544== NVPROF is profiling process 45544, command: ./t11a
==45544== Profiling application: ./t11a
==45544== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  10.8169s         2  5.40847s  5.39118s  5.42576s  mmk_matrixmultiply_
                    0.00%  9.2160us         6  1.5360us  1.3120us  2.1440us  [CUDA memcpy HtoD]
      API calls:   96.72%  10.8240s         9  1.20266s  12.102us  5.42594s  cudaFree
                    3.22%  360.34ms         9  40.038ms  3.2760us  355.64ms  cudaMalloc
                    0.04%  4.3488ms         4  1.0872ms  1.0598ms  1.1593ms  cuDeviceTotalMem
                    0.02%  2.3633ms       404  5.8490us     111ns  869.22us  cuDeviceGetAttribute
                    0.00%  144.50us         4  36.125us  33.254us  41.317us  cuDeviceGetName
                    0.00%  106.43us         6  17.737us  4.7910us  51.453us  cudaMemcpy
                    0.00%  101.46us         2  50.732us  44.479us  56.985us  cudaLaunchKernel
                    0.00%  20.926us         1  20.926us  20.926us  20.926us  cudaDeviceSynchronize
                    0.00%  7.2850us         4  1.8210us     491ns  4.6210us  cuDeviceGetPCIBusId
                    0.00%  1.7650us         8     220ns     118ns     672ns  cuDeviceGet
                    0.00%     926ns         4     231ns     218ns     264ns  cuDeviceGetUuid
                    0.00%     614ns         3     204ns     130ns     310ns  cuDeviceGetCount
$

And we see that the performance is approximately the same as the CUDA C++ version.

Notes:

A. I've omitted the SASS analysis for the final variant. Exercise is left to the reader. However the unrolled loop portion is a repeating sequence of LDG-LDG-DFMA just as is witnessed in the CUDA C++ version.

B. You might have two questions: 1. Why is it this way? I've tried to answer that. 2. Should it be this way/does it need to be this way? I have not answered that, I believe that is a combination of some Fortran statements that I don't wish to make, as well as a question for the compiler engineers. If you think the 2D realization should duplicate the 1D realization peformance, I suggest filing a bug.

C. It was a little while of head-scratching before I noticed you had done this in the C++ case but not the fortran case:

int tx = blockIdx.y * blockDim.y + threadIdx.y;
     ^            ^            ^             ^

D. One of the reasons why I think this discrepancy is not trivially resolvable at compile-time is that the compiler 2D array handling in CUDA Fortran (whether needed or not) seems to involve the creation of a run time meta-data package which includes the array width, perhaps amongst other information. This meta-data package is transmitted to the device separately from the array data payload - this much can be surmised from the profiler output above, or by running with --print-gpu-trace to see it more clearly. In your CUDA C++ case, effectively, the array width (SIZE) is known at compile-time. In the Fortran case, it is not (or alternatively, if it is, the compiler does not seem to be propagating this knowledge through the kernel code generation). You can see that the "improved" addressing mode above suggests offsets that are known at compile-time. Whether or not this "should" be resolvable with some more careful treatment of Fortran I cannot say. However I would expect that some sort of contiguous declaration/decoration, by itself is not sufficient to resolve this discrepancy for the compiler. That declaration alone would not obviate the need for on-the fly runtime indexing calculations, based on array width.

Share large constant data among cuda threads

In CUDA, I believe the best recommendation would be to make use of the so called "Read-Only" cache. This has at least two possible benefits over the __constant__ memory/constant cache system:

It is not limited to 64kB like __constant__ memory is.
It does not expect or require "uniform access" like the constant cache does, to deliver full access bandwidth/best performance. Uniform access refers to the idea that all threads in a warp are accessing the same location or same constant memory value (per read cycle/instruction).

The read-only cache is documented in the CUDA programming guide. Possibly the easiest way to use it is to decorate your pointers passed to the CUDA kernel with __restrict__ (assuming you are not aliasing between pointers) and to decorate the pointer that refers to the large constant data with const ... __restrict__. This will allow the compiler to generate appropriate LDG instructions for access to constant data, pulling it through the read-only cache mechanism.

This read-only cache mechanism is only supported on cc 3.5 or higher GPUs, but that covers some GPUs in the Kepler generation and all GPUs in the Maxwell, Pascal (including your GTX 1080 ti), Volta, and Turing generations.

If you have a GPU that is less than cc3.5, possibly the best suggestion for similar benefits (larger than __const__, not needing uniform access) then would be to use texture memory. This is also documented elsewhere in the programming guide, there are various CUDA sample codes that demonstrate the use of texture memory, and plenty of questions here on the SO cuda tag covering it as well.

CUDA clock() leads to zero clock cycles

Quoting from one of the answers you linked to in your question

You should also be aware that the compiler and assembler do perform
instruction re-ordering so you might want to check that the clock
calls don't wind up getting put next to each other in the SASS output
(use cuobjdump to check).

I believe this is the source of your problem. If I compile your kernel with the CUDA 8 release toolkit and then disassemble the resulting machine code with cuobjdump, I get the following:

    code for sm_52
            Function : _Z5saxpyifPfS_Pi
    .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                                           /* 0x001c4400fe0007f6 */
    /*0008*/                   MOV R1, c[0x0][0x20];                                       /* 0x4c98078000870001 */
    /*0010*/         {         CS2R R7, SR_CLOCKLO;                                        /* 0x50c8000005070007 */
    /*0018*/                   S2R R0, SR_CTAID.X;        }                                /* 0xf0c8000002570000 */
                                                                                           /* 0x083fc400e3e007f0 */
    /*0028*/         {         CS2R R8, SR_CLOCKLO;                                        /* 0x50c8000005070008 */
    /*0030*/                   S2R R2, SR_TID.X;        }                                  /* 0xf0c8000002170002 */
    /*0038*/                   XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ;                 /* 0x4f107f8000270003 */
                                                                                           /* 0x081fc400fec207f6 */
    /*0048*/                   XMAD R2, R0.reuse, c[0x0] [0x8], R2;                        /* 0x4e00010000270002 */
    /*0050*/                   XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2;                         /* 0x5b30011800370000 */
    /*0058*/                   ISETP.GE.AND P0, PT, R0.reuse, c[0x0][0x140], PT;           /* 0x4b6d038005070007 */
                                                                                           /* 0x001fd400fc2007ec */
    /*0068*/                   SHR R9, R0, 0x1f;                                           /* 0x3829000001f70009 */
    /*0070*/              @!P0 SHF.L.U64 R2, RZ, 0x2, R0;                                  /* 0x36f800400028ff02 */
    /*0078*/              @!P0 SHF.L.U64 R3, R0, 0x2, R9;                                  /* 0x36f804c000280003 */
                                                                                           /* 0x001fc040fe4207f6 */
    /*0088*/              @!P0 IADD R4.CC, R2.reuse, c[0x0][0x148];                        /* 0x4c10800005280204 */
    /*0090*/              @!P0 IADD.X R5, R3.reuse, c[0x0][0x14c];                         /* 0x4c10080005380305 */
    /*0098*/         {    @!P0 IADD R2.CC, R2, c[0x0][0x150];                              /* 0x4c10800005480202 */
    /*00a8*/              @!P0 LDG.E R4, [R4];        }                                    /* 0x0005c400fe400076 */
                                                                                           /* 0xeed4200000080404 */
    /*00b0*/              @!P0 IADD.X R3, R3, c[0x0][0x154];                               /* 0x4c10080005580303 */
    /*00b8*/              @!P0 LDG.E R6, [R2];                                             /* 0xeed4200000080206 */
                                                                                           /* 0x001fd800fea007e1 */
    /*00c8*/                   LEA R10.CC, R0, c[0x0][0x158], 0x2;                         /* 0x4bd781000567000a */
    /*00d0*/                   IADD R8, -R7, R8;                                           /* 0x5c12000000870708 */
    /*00d8*/                   LEA.HI.X R9, R0, c[0x0][0x15c], R9, 0x2;                    /* 0x1a17048005770009 */
                                                                                           /* 0x001fc008fe4007f1 */
    /*00e8*/                   MOV R7, R9;                                                 /* 0x5c98078000970007 */
    /*00f0*/              @!P0 FFMA R0, R4, c[0x0][0x144], R6;                             /* 0x4980030005180400 */
    /*00f8*/         {         MOV R6, R10;                                                /* 0x5c98078000a70006 */
    /*0108*/              @!P0 STG.E [R2], R0;        }                                    /* 0x001ffc005e2001f2 */
                                                                                           /* 0xeedc200000080200 */
    /*0110*/                   STG.E [R6], R8;                                             /* 0xeedc200000070608 */
    /*0118*/                   EXIT;                                                       /* 0xe30000000007000f */
                                                                                           /* 0x001f8000fc0007ff */
    /*0128*/                   BRA 0x120;                                                  /* 0xe2400fffff07000f */
    /*0130*/                   NOP;                                                        /* 0x50b0000000070f00 */
    /*0138*/                   NOP;                                                        /* 0x50b0000000070f00 */
            .................................

You can see that the clock instructions have been reordered so that they are called without any code in between them. That will result in a zero, or very close to zero clock measurement for many, if not all, warps running this code.

PHP form code sometimes being executed twice simultaneously

See a client-side (jQuery) solution here: Prevent double submission of forms in jQuery

And a PHP one here:
http://phpsense.com/2006/prevent-duplicate-form-submission/

Improving the Efficiency of Compact/Scatter in CUDA

I found the algorithm mentioned in this poster (similar algorithm also discussed in this paper) works pretty well, especially for compacting large arrays. It uses less memory to do it and is slightly faster than my previous method (5-10%). I put in a few tweaks to the poster's algorithm: 1) eliminating the final warp shuffle reduction in phase 1, can simply sum the elements as they are calculated, 2) giving the function the ability to work over more than just arrays sized as a multiple of 1024 + adding grid-strided loops, and 3) allowing each thread to load their registers simultaneously in phase 3 instead of one at a time. I also use CUB instead of Thrust for Inclusive sum for faster scans. There may be more tweaks I can make, but for now this is good.

//kernel phase 1
int myID =  blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int mask;
    unsigned int cnt=0;//;//

    for(int j = 0; j < 32; j++){
        int index = (warpID<<10)+(j<<5)+lnID;
        
        bool pred;
        if(index > true_length) pred = false;
        else pred = predicate(input[index]);
        mask = __ballot(pred); 

        if(lnID == 0) {
            flag[(warpID<<5)+j] = mask;
            cnt += __popc(mask);
        }
    }

    if(lnID == 0) counter[warpID] = cnt; //store sum
}

//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array

//kernel phase 3
int myID =  blockIdx.x*blockDim.x + threadIdx.x;

for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int predmask;
    unsigned int cnt;

    predmask = flag[(warpID<<5)+lnID];
    cnt = __popc(predmask);

    //parallel prefix sum
#pragma unroll
    for(int offset = 1; offset < 32; offset<<=1){
        unsigned int n = __shfl_up(cnt, offset);
        if(lnID >= offset) cnt += n;
    }

    unsigned int global_index = 0;
    if(warpID > 0) global_index = scan_Index[warpID - 1];

    for(int i = 0; i < 32; i++){
        unsigned int mask = __shfl(predmask, i); //broadcast from thread i
        unsigned int sub_group_index = 0;
        if(i > 0) sub_group_index = __shfl(cnt, i-1);
        if(mask & (1 << lnID)){
            compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID]; 
        }
    }
}

}

EDIT: There is a newer article by a subset of the poster authors where they examine a faster variation of compact than what is written above. However, their new version is not order preserving, so not useful for myself and I haven't implemented it to test it out. That said, if your project doesn't rely on object order, their newer compact version can probably speed up your algorithm.

_Ldg Causes Slower Execution Time in Certain Situation