_Ldg Causes Slower Execution Time in Certain Situation

__ldg causes slower execution time in certain situation

The reason for the version using __ldg being slower is the fact that the NVCC compiler is not able to perform loop unrolling optimizations correctly in this particular scenario. The issue was submitted to NVIDIA with ID 1605303. The most recent response from NVIDIA team is as follows:

Although, we have not communicated this to you, we have done prior investigations into your issue. The solution to your issue is improving our loop-unrolling heuristic in the back-end compiler – the compiler embedded inside of ptxas. We evaluated the possibility of resolving this issue in CUDA 8.0, but an initial solution that fixed your problem caused unacceptable regressions. Because of other constraints, we were not able to develop a proper solution in time for it to be done for CUDA 8.0.

We are actively working to resolve your issue in the future CUDA release(that following of CUDA 8.0). We will ensure that we keep you informed of our progress moving forward.

CUDA __device__ function as class member: Inlining and performance?

The GPU compiler will aggressively inline functions for performance reasonse. In that case, there should be no particular impact to performance.

If a function cannot be inlined, then the usual performance overhead would occur, involving the creation of a stack frame and a call to a function -just as you would observe on a CPU call to a non-inlined function.

If you have concerns about a specific example, you can create a short test code and look at the generated assembly language (SASS) by using cuobjdump -sass myexe and determine whether or not the function was inlined.

There are no general restrictions on inlining of __device__ functions that are class members/methods.

CUDA Matrix Multiply on Fortran is slower than C

First of all, I suggest that performance questions include complete codes. I generally need to be able to run stuff, and you can save me some typing. Sure, you can leave stuff out. Sure, I can probably figure out what it is. But I'm less likely to help you that way, and I suspect I'm not alone in that view. My advice: Make it easy for others to help you. I've given examples of what would be useful below.

On to the question:

The difference is that C uses a 1D array whereas Fortran uses 2D. But that should not be a problem since underneath the memory will be contiguous.

TL;DR: Your claim ("that should not be a problem") is evidently not supportable. The difference between a 1D allocation and a 2D allocation matters, not only from a storage perspective but also from an index-calculation perspective. If you're sensitive to the length of this answer, skip to note D at the bottom of this post.

Details:

When we have a loop like this:

    for (k=0; k<SIZE; ++k) {
accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
}

If the compiler can discover that the indexing calculation(s) can simply be "incremented" as the loop proceeds, this will be fairly simple. OTOH, if the indexing must be resolved taking into account a 2D calculation (e.g. row*width+column) previously referenced at each loop iteration, this will be more complicated. An important factor here is also whether the array width is usable/used by the compiler at compile time.

This appears to be what is happening, which I suggest is the difference in timing. We'll consider 2 proof points for this.

  1. SASS analysis

If we compare the SASS between realizations, we see some important differences. In order to do this, we'll need full examples:

CUDA Fortran:

$ cat t11.cuf
module mmk
USE cudafor
!
! Definition of symbols for real types (RP)
!
IMPLICIT NONE
!
INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6, 37) ! REAL32
INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307) ! REAL64
INTEGER, PARAMETER :: SIZE_ = 16384
!

Contains

attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

integer :: tx, ty, k
REAL(DP) :: accum
REAL(DP), dimension(:,:) :: d_mat, d_matT, d_matSym

tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

if (tx <= SIZE_ .and. ty <=SIZE_) then
accum = 0.0
do k=1, SIZE_
accum = accum + d_mat(tx,k) * d_matT(k,ty)
end do
d_matSym(tx,ty) = accum
end if
end subroutine matrixMultiply

end module mmk

PROGRAM Test
!
! This is the main program for Test
!
USE cudafor
USE mmk

!
IMPLICIT NONE
!
REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:,:) :: d_mat, d_matT, d_matSym
!
INTEGER :: err, i1, i2
type(dim3) :: grid_dim, blk_dim
!
!
! Allocate storage for the arrays
!
Allocate(d_mat(SIZE_,SIZE_),d_matT(SIZE_,SIZE_),d_matSym(SIZE_,SIZE_))
!
! invoke the kernel
!

!Call
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()

!
! Free storage for the arrays
!
Deallocate(d_mat,d_matT,d_matSym)
!
END PROGRAM Test
$ nvfortran t11.cuf -o t11
$ nvprof ./t11
==38508== NVPROF is profiling process 38508, command: ./t11
==38508== Profiling application: ./t11
==38508== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 12.7168s 2 6.35838s 6.35229s 6.36447s mmk_matrixmultiply_
0.00% 9.2480us 6 1.5410us 1.3120us 2.1760us [CUDA memcpy HtoD]
API calls: 97.08% 12.7260s 9 1.41400s 10.596us 6.36561s cudaFree
2.85% 374.24ms 9 41.583ms 3.5780us 369.55ms cudaMalloc
0.03% 4.5190ms 4 1.1297ms 1.0551ms 1.2608ms cuDeviceTotalMem
0.02% 2.5060ms 404 6.2020us 111ns 1.2057ms cuDeviceGetAttribute
0.01% 1.1501ms 4 287.52us 33.014us 1.0465ms cuDeviceGetName
0.00% 120.00us 6 20.000us 5.1530us 62.523us cudaMemcpy
0.00% 99.593us 2 49.796us 44.275us 55.318us cudaLaunchKernel
0.00% 43.414us 1 43.414us 43.414us 43.414us cudaDeviceSynchronize
0.00% 11.563us 4 2.8900us 1.4400us 5.5870us cuDeviceGetPCIBusId
0.00% 1.7030us 8 212ns 124ns 580ns cuDeviceGet
0.00% 901ns 3 300ns 163ns 534ns cuDeviceGetCount
0.00% 855ns 4 213ns 183ns 263ns cuDeviceGetUuid
$

NVIDIA HPC SDK, 2021.2, Tesla V100

We see that the execution duration is about 6.3 seconds, in line with your report.

CUDA C++:

$ cat t12.cu
#define idx(x,y,z) x*y + z
const int SIZE = 16384;
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

//Global inidices
int tx = blockIdx.y * blockDim.y + threadIdx.y;
int ty = blockIdx.x * blockDim.x + threadIdx.x;
int k;

if (tx < SIZE && ty < SIZE) {
double accum = 0.0;
//Accumulation for (tx,ty) position
for (k=0; k<SIZE; ++k) {
accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
}
d_matSym[idx(tx,SIZE,ty)] = accum;
}
}
int main(){
//Call
dim3 grid_dim(SIZE/32, SIZE/32, 1);
dim3 blk_dim(32, 32, 1);
double *d_mat, *d_matT, *d_matSym;
cudaMalloc(&d_mat, SIZE*SIZE*sizeof(double));
cudaMalloc(&d_matT, SIZE*SIZE*sizeof(double));
cudaMalloc(&d_matSym, SIZE*SIZE*sizeof(double));
matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
cudaDeviceSynchronize();
}
$ nvcc -o t12 t12.cu -arch=sm_70
$ nvprof ./t12
==40364== NVPROF is profiling process 40364, command: ./t12
==40364== Profiling application: ./t12
==40364== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 11.0379s 2 5.51893s 5.50503s 5.53282s matrixMultiply(double*, double*, double*)
API calls: 97.80% 11.0379s 1 11.0379s 11.0379s 11.0379s cudaDeviceSynchronize
2.15% 242.74ms 3 80.914ms 1.9747ms 238.78ms cudaMalloc
0.04% 4.3105ms 4 1.0776ms 1.0728ms 1.0890ms cuDeviceTotalMem
0.01% 1.4584ms 404 3.6090us 110ns 163.21us cuDeviceGetAttribute
0.00% 142.28us 4 35.571us 32.475us 41.980us cuDeviceGetName
0.00% 51.695us 2 25.847us 7.0790us 44.616us cudaLaunchKernel
0.00% 11.198us 4 2.7990us 1.6260us 5.5950us cuDeviceGetPCIBusId
0.00% 1.7930us 3 597ns 149ns 1.2200us cuDeviceGetCount
0.00% 1.6590us 8 207ns 120ns 704ns cuDeviceGet
0.00% 733ns 4 183ns 161ns 215ns cuDeviceGetUuid
$

CUDA 11.2, same V100 machine

Again we see ~5.5 sec kernel duration timing, lining up with your report. Now let's compare the SASS. We use the cubojdump utility to extract it, and in both cases it is evident that the compiler is unrolling the kernel loop in some fashion. The key operation is DFMA, Double-precision Fused Multiply Add, which is called once per loop iteration, to compute a single product (and keep a running sum). Therefore, I will extract just a short sequence of DFMA operations in each case, in the midst of the unrolled loop, to get an idea of the per-FLOP "cost" and "overhead".

CUDA Fortran:

/*1670*/                   LDG.E.64.SYS R18, [R12+0x1b8] ;                 /* 0x0001b8000c127381 */
/* 0x000ea200001eeb00 */
/*1680*/ IADD3 R20, P0, R16, R29, RZ ; /* 0x0000001d10147210 */
/* 0x001fc60007f1e0ff */
/*1690*/ LDG.E.64.SYS R22, [R12+0x1b0] ; /* 0x0001b0000c167381 */
/* 0x000e6400001eeb00 */
/*16a0*/ IMAD.X R21, R17, 0x1, R27, P0 ; /* 0x0000000111157824 */
/* 0x000fe400000e061b */
/*16b0*/ LDG.E.64.SYS R24, [R16] ; /* 0x0000000010187381 */
/* 0x00006c00001eeb00 */
/*16c0*/ LDG.E.64.SYS R16, [R20] ; /* 0x0000000014107381 */
/* 0x001ea200001eeb00 */
/*16d0*/ DFMA R22, R24, R22, R14 ; /* 0x000000161816722b */
/* 0x002084000000000e */
/*16e0*/ IADD3 R14, P0, R20, R29, RZ ; /* 0x0000001d140e7210 */
/* 0x001fca0007f1e0ff */
/*16f0*/ IMAD.X R15, R21, 0x1, R27, P0 ; /* 0x00000001150f7824 */
/* 0x000fe200000e061b */
/*1700*/ DFMA R16, R16, R18, R22 ; /* 0x000000121010722b */

The above sequence repeats within the unrolled loop area. We note that there are 2 DFMA operations, 4 LDG operations (makes sense - two per multiply) and the remaining instructions (4) presumably constitute indexing updates as the loop proceeds through its iterations.

CUDA C++:

/*02d0*/                   LDG.E.64.SYS R16, [R6+0x40000] ;         /* 0x0400000006107381 */
/* 0x001f2800001eeb00 */
/*02e0*/ LDG.E.64.SYS R24, [R4+0x10] ; /* 0x0000100004187381 */
/* 0x000f2200001eeb00 */
/*02f0*/ DFMA R10, R20, R10, R22 ; /* 0x0000000a140a722b */
/* 0x0200860000000016 */
/*0300*/ LDG.E.64.SYS R22, [R6+0x60000] ; /* 0x0600000006167381 */
/* 0x001f6800001eeb00 */
/*0310*/ LDG.E.64.SYS R20, [R4+0x18] ; /* 0x0000180004147381 */
/* 0x000f6200001eeb00 */
/*0320*/ DFMA R14, R8, R14, R10 ; /* 0x0000000e080e722b */

Again that sequence repeats. We see 2 DFMA and 4 LDG instructions, but no other instructions in the repeating sequence. We note that the LDG instructions seem to have offsets that can be incorporated at compile-time into the instruction, and these offsets eliminate the need for any extra instructions, because they are only "incrementing" previously computed offsets (we also note that the "increments" are by what appear to be column-wise offsets in one case, and row-wise offsets in the other, just as we would expect to pick up the multiplication operands in the kernel loop.)


  1. Refactoring

Can we refactor the Fortran code to make it work like the C++ code? Yes:

$ cat t11a.cuf
module mmk
USE cudafor
!
! Definition of symbols for real types (RP)
!
IMPLICIT NONE
!
INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6, 37) ! REAL32
INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307) ! REAL64
INTEGER, PARAMETER :: SIZE_ = 16384
!

Contains

attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

integer :: tx, ty, k
REAL(DP) :: accum
REAL(DP), dimension(:) :: d_mat, d_matT, d_matSym

tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

if (tx <= SIZE_ .and. ty <=SIZE_) then
accum = 0.0
do k=1, SIZE_
accum = accum + d_mat((ty-1)*SIZE_+k) * d_matT((k-1)*SIZE_+tx)
end do
d_matSym((ty-1)*SIZE_+tx) = accum
end if
end subroutine matrixMultiply

end module mmk

PROGRAM Test
!
! This is the main program for Test
!
USE cudafor
USE mmk

!
IMPLICIT NONE
!
REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:) :: d_mat, d_matT, d_matSym
!
INTEGER :: err, i1, i2
type(dim3) :: grid_dim, blk_dim
!
! Allocate storage for the arrays
!
Allocate(d_mat(SIZE_*SIZE_),d_matT(SIZE_*SIZE_),d_matSym(SIZE_*SIZE_))
!
! invoke the kernel
!

!Call
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()

!
! Free storage for the arrays
!
Deallocate(d_mat,d_matT,d_matSym)
!
END PROGRAM Test
$ nvfortran t11a.cuf -o t11a
$ nvprof ./t11a
==45544== NVPROF is profiling process 45544, command: ./t11a
==45544== Profiling application: ./t11a
==45544== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 10.8169s 2 5.40847s 5.39118s 5.42576s mmk_matrixmultiply_
0.00% 9.2160us 6 1.5360us 1.3120us 2.1440us [CUDA memcpy HtoD]
API calls: 96.72% 10.8240s 9 1.20266s 12.102us 5.42594s cudaFree
3.22% 360.34ms 9 40.038ms 3.2760us 355.64ms cudaMalloc
0.04% 4.3488ms 4 1.0872ms 1.0598ms 1.1593ms cuDeviceTotalMem
0.02% 2.3633ms 404 5.8490us 111ns 869.22us cuDeviceGetAttribute
0.00% 144.50us 4 36.125us 33.254us 41.317us cuDeviceGetName
0.00% 106.43us 6 17.737us 4.7910us 51.453us cudaMemcpy
0.00% 101.46us 2 50.732us 44.479us 56.985us cudaLaunchKernel
0.00% 20.926us 1 20.926us 20.926us 20.926us cudaDeviceSynchronize
0.00% 7.2850us 4 1.8210us 491ns 4.6210us cuDeviceGetPCIBusId
0.00% 1.7650us 8 220ns 118ns 672ns cuDeviceGet
0.00% 926ns 4 231ns 218ns 264ns cuDeviceGetUuid
0.00% 614ns 3 204ns 130ns 310ns cuDeviceGetCount
$

And we see that the performance is approximately the same as the CUDA C++ version.

Notes:

A. I've omitted the SASS analysis for the final variant. Exercise is left to the reader. However the unrolled loop portion is a repeating sequence of LDG-LDG-DFMA just as is witnessed in the CUDA C++ version.

B. You might have two questions: 1. Why is it this way? I've tried to answer that. 2. Should it be this way/does it need to be this way? I have not answered that, I believe that is a combination of some Fortran statements that I don't wish to make, as well as a question for the compiler engineers. If you think the 2D realization should duplicate the 1D realization peformance, I suggest filing a bug.

C. It was a little while of head-scratching before I noticed you had done this in the C++ case but not the fortran case:

int tx = blockIdx.y * blockDim.y + threadIdx.y;
^ ^ ^ ^

D. One of the reasons why I think this discrepancy is not trivially resolvable at compile-time is that the compiler 2D array handling in CUDA Fortran (whether needed or not) seems to involve the creation of a run time meta-data package which includes the array width, perhaps amongst other information. This meta-data package is transmitted to the device separately from the array data payload - this much can be surmised from the profiler output above, or by running with --print-gpu-trace to see it more clearly. In your CUDA C++ case, effectively, the array width (SIZE) is known at compile-time. In the Fortran case, it is not (or alternatively, if it is, the compiler does not seem to be propagating this knowledge through the kernel code generation). You can see that the "improved" addressing mode above suggests offsets that are known at compile-time. Whether or not this "should" be resolvable with some more careful treatment of Fortran I cannot say. However I would expect that some sort of contiguous declaration/decoration, by itself is not sufficient to resolve this discrepancy for the compiler. That declaration alone would not obviate the need for on-the fly runtime indexing calculations, based on array width.

Share large constant data among cuda threads

In CUDA, I believe the best recommendation would be to make use of the so called "Read-Only" cache. This has at least two possible benefits over the __constant__ memory/constant cache system:

  1. It is not limited to 64kB like __constant__ memory is.
  2. It does not expect or require "uniform access" like the constant cache does, to deliver full access bandwidth/best performance. Uniform access refers to the idea that all threads in a warp are accessing the same location or same constant memory value (per read cycle/instruction).

The read-only cache is documented in the CUDA programming guide. Possibly the easiest way to use it is to decorate your pointers passed to the CUDA kernel with __restrict__ (assuming you are not aliasing between pointers) and to decorate the pointer that refers to the large constant data with const ... __restrict__. This will allow the compiler to generate appropriate LDG instructions for access to constant data, pulling it through the read-only cache mechanism.

This read-only cache mechanism is only supported on cc 3.5 or higher GPUs, but that covers some GPUs in the Kepler generation and all GPUs in the Maxwell, Pascal (including your GTX 1080 ti), Volta, and Turing generations.

If you have a GPU that is less than cc3.5, possibly the best suggestion for similar benefits (larger than __const__, not needing uniform access) then would be to use texture memory. This is also documented elsewhere in the programming guide, there are various CUDA sample codes that demonstrate the use of texture memory, and plenty of questions here on the SO cuda tag covering it as well.

CUDA clock() leads to zero clock cycles

Quoting from one of the answers you linked to in your question

You should also be aware that the compiler and assembler do perform
instruction re-ordering so you might want to check that the clock
calls don't wind up getting put next to each other in the SASS output
(use cuobjdump to check).

I believe this is the source of your problem. If I compile your kernel with the CUDA 8 release toolkit and then disassemble the resulting machine code with cuobjdump, I get the following:

    code for sm_52
Function : _Z5saxpyifPfS_Pi
.headerflags @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
/* 0x001c4400fe0007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ { CS2R R7, SR_CLOCKLO; /* 0x50c8000005070007 */
/*0018*/ S2R R0, SR_CTAID.X; } /* 0xf0c8000002570000 */
/* 0x083fc400e3e007f0 */
/*0028*/ { CS2R R8, SR_CLOCKLO; /* 0x50c8000005070008 */
/*0030*/ S2R R2, SR_TID.X; } /* 0xf0c8000002170002 */
/*0038*/ XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ; /* 0x4f107f8000270003 */
/* 0x081fc400fec207f6 */
/*0048*/ XMAD R2, R0.reuse, c[0x0] [0x8], R2; /* 0x4e00010000270002 */
/*0050*/ XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2; /* 0x5b30011800370000 */
/*0058*/ ISETP.GE.AND P0, PT, R0.reuse, c[0x0][0x140], PT; /* 0x4b6d038005070007 */
/* 0x001fd400fc2007ec */
/*0068*/ SHR R9, R0, 0x1f; /* 0x3829000001f70009 */
/*0070*/ @!P0 SHF.L.U64 R2, RZ, 0x2, R0; /* 0x36f800400028ff02 */
/*0078*/ @!P0 SHF.L.U64 R3, R0, 0x2, R9; /* 0x36f804c000280003 */
/* 0x001fc040fe4207f6 */
/*0088*/ @!P0 IADD R4.CC, R2.reuse, c[0x0][0x148]; /* 0x4c10800005280204 */
/*0090*/ @!P0 IADD.X R5, R3.reuse, c[0x0][0x14c]; /* 0x4c10080005380305 */
/*0098*/ { @!P0 IADD R2.CC, R2, c[0x0][0x150]; /* 0x4c10800005480202 */
/*00a8*/ @!P0 LDG.E R4, [R4]; } /* 0x0005c400fe400076 */
/* 0xeed4200000080404 */
/*00b0*/ @!P0 IADD.X R3, R3, c[0x0][0x154]; /* 0x4c10080005580303 */
/*00b8*/ @!P0 LDG.E R6, [R2]; /* 0xeed4200000080206 */
/* 0x001fd800fea007e1 */
/*00c8*/ LEA R10.CC, R0, c[0x0][0x158], 0x2; /* 0x4bd781000567000a */
/*00d0*/ IADD R8, -R7, R8; /* 0x5c12000000870708 */
/*00d8*/ LEA.HI.X R9, R0, c[0x0][0x15c], R9, 0x2; /* 0x1a17048005770009 */
/* 0x001fc008fe4007f1 */
/*00e8*/ MOV R7, R9; /* 0x5c98078000970007 */
/*00f0*/ @!P0 FFMA R0, R4, c[0x0][0x144], R6; /* 0x4980030005180400 */
/*00f8*/ { MOV R6, R10; /* 0x5c98078000a70006 */
/*0108*/ @!P0 STG.E [R2], R0; } /* 0x001ffc005e2001f2 */
/* 0xeedc200000080200 */
/*0110*/ STG.E [R6], R8; /* 0xeedc200000070608 */
/*0118*/ EXIT; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*0128*/ BRA 0x120; /* 0xe2400fffff07000f */
/*0130*/ NOP; /* 0x50b0000000070f00 */
/*0138*/ NOP; /* 0x50b0000000070f00 */
.................................

You can see that the clock instructions have been reordered so that they are called without any code in between them. That will result in a zero, or very close to zero clock measurement for many, if not all, warps running this code.

PHP form code sometimes being executed twice simultaneously

See a client-side (jQuery) solution here: Prevent double submission of forms in jQuery

And a PHP one here:
http://phpsense.com/2006/prevent-duplicate-form-submission/

Improving the Efficiency of Compact/Scatter in CUDA

I found the algorithm mentioned in this poster (similar algorithm also discussed in this paper) works pretty well, especially for compacting large arrays. It uses less memory to do it and is slightly faster than my previous method (5-10%). I put in a few tweaks to the poster's algorithm: 1) eliminating the final warp shuffle reduction in phase 1, can simply sum the elements as they are calculated, 2) giving the function the ability to work over more than just arrays sized as a multiple of 1024 + adding grid-strided loops, and 3) allowing each thread to load their registers simultaneously in phase 3 instead of one at a time. I also use CUB instead of Thrust for Inclusive sum for faster scans. There may be more tweaks I can make, but for now this is good.

//kernel phase 1
int myID = blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;

unsigned int mask;
unsigned int cnt=0;//;//

for(int j = 0; j < 32; j++){
int index = (warpID<<10)+(j<<5)+lnID;

bool pred;
if(index > true_length) pred = false;
else pred = predicate(input[index]);
mask = __ballot(pred);

if(lnID == 0) {
flag[(warpID<<5)+j] = mask;
cnt += __popc(mask);
}
}

if(lnID == 0) counter[warpID] = cnt; //store sum
}

//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array

//kernel phase 3
int myID = blockIdx.x*blockDim.x + threadIdx.x;

for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;

unsigned int predmask;
unsigned int cnt;

predmask = flag[(warpID<<5)+lnID];
cnt = __popc(predmask);

//parallel prefix sum
#pragma unroll
for(int offset = 1; offset < 32; offset<<=1){
unsigned int n = __shfl_up(cnt, offset);
if(lnID >= offset) cnt += n;
}

unsigned int global_index = 0;
if(warpID > 0) global_index = scan_Index[warpID - 1];

for(int i = 0; i < 32; i++){
unsigned int mask = __shfl(predmask, i); //broadcast from thread i
unsigned int sub_group_index = 0;
if(i > 0) sub_group_index = __shfl(cnt, i-1);
if(mask & (1 << lnID)){
compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID];
}
}
}

}

EDIT: There is a newer article by a subset of the poster authors where they examine a faster variation of compact than what is written above. However, their new version is not order preserving, so not useful for myself and I haven't implemented it to test it out. That said, if your project doesn't rely on object order, their newer compact version can probably speed up your algorithm.



Related Topics



Leave a reply



Submit