Should I Unify Two Similar Kernels with an 'If' Statement, Risking Performance Loss

Should I unify two similar kernels with an 'if' statement, risking performance loss?

You have a third alternative, which is to use C++ templating and make the variable which is used in the if/switch statement a template parameter. Instantiate each version of the kernel you need, and then you have multiple kernels doing different things with no branch divergence or conditional evaluation to worry about, because the compiler will optimize away the dead code and the branching with it.

Perhaps something like this:

template<int action>
__global__ void kernel()
{
    switch(action) {
       case 1:
       // First code
       break;

       case 2:
       // Second code
       break;
    }
}

template void kernel<1>();
template void kernel<2>();

Is it efficient to use boolean algebra and a passed argument to implement two versions of a similar kernel in CUDA?

It's not necessarily a bad idea. Based on what you have shown and your stipulations, there shouldn't be any significant warp divergence that I can see.

However, you might also consider templating as described here (I believe in fact your question is very nearly a duplicate of that one -- perhaps this one should be marked a duplicate of that one). That will allow you to create kernels for such a simple example (only two options) that are optimized at compile time and therefore will have no branching as a result of the usage of doOpOne.

Smart design for large kernel with different inputs that only changes one line of code

CUDA kernels don't seem to be able to be overloaded either.

It should be possible to overload kernels. Here is one possible approach, using overloading (and no templating):

$ cat t1648.cu
// Includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>

__device__ float my_common(float *d, int width, unsigned int x, unsigned int y){

// 200 lines of common code...
  return d[y *width +x];
}




////////////////////////////////////////////////////////////////////////////////
// Kernels
////////////////////////////////////////////////////////////////////////////////
//! Write to a cuArray using surface writes
//! @param gIData input data in global memory
////////////////////////////////////////////////////////////////////////////////
__global__ void WriteKernel(float *gIData, int width, int height,
                                       cudaSurfaceObject_t outputSurface)
{
    // calculate surface coordinates
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    // read from global memory and write to cuarray (via surface reference)
    surf2Dwrite(my_common(gIData, width, x, y),
                outputSurface, x*4, y, cudaBoundaryModeTrap);
}

__global__ void WriteKernel(float *gIData, int width, int height,
                                       float *out)
{
    // calculate coordinates
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    // read from global memory and write to global memory
    out[y*width+x] = my_common(gIData, width, x, y);
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
    printf("starting...\n");


    unsigned width = 256;
    unsigned height = 256;
    unsigned int size = width * height * sizeof(float);

    // Allocate device memory for result
    float *dData = NULL;
    checkCudaErrors(cudaMalloc((void **) &dData, size));

    // Allocate array and copy image data
    cudaChannelFormatDesc channelDesc =
        cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray *cuArray;
    float *out;
    cudaMalloc(&out, size);
    checkCudaErrors(cudaMallocArray(&cuArray,
                                    &channelDesc,
                                    width,
                                    height,
                                    cudaArraySurfaceLoadStore));

    dim3 dimBlock(8, 8, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

    cudaSurfaceObject_t outputSurface;
    cudaResourceDesc    surfRes;
    memset(&surfRes, 0, sizeof(cudaResourceDesc));
    surfRes.resType = cudaResourceTypeArray;
    surfRes.res.array.array = cuArray;

    checkCudaErrors(cudaCreateSurfaceObject(&outputSurface, &surfRes));
    WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, outputSurface);
    WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, out);

    checkCudaErrors(cudaDestroySurfaceObject(outputSurface));
    checkCudaErrors(cudaFree(dData));
    checkCudaErrors(cudaFreeArray(cuArray));
}
$ nvcc -I/usr/local/cuda/samples/common/inc t1648.cu -o t1648
$

The above example was hacked together rapidly from the simpleSurfaceWrite CUDA sample code. It is not intended to be functional or run "correctly". It is designed to show how overloading can be used from a code structure standpoint to address the stated objective.

Does C++ have a way to do Cuda style kernel templates, where parameters produce separate compilations?

Something like this?

#include <iostream>

template <int x>
void function() {
  if constexpr (x == 1) {
    std::cout << "hello\n";
  } else {
    std::cout << "world\n";
  }
}

int main() {
  function<3>();
}

How not to repeat myself without macros when writing similar CUDA kernels?

Updated: I was told in the comments, that classes and inheritance don't mix well with CUDA. Therefore only the first part of the answer applies to CUDA, while the others are answer to the more general C++ part of your question.

For CUDA, you will have to use pure functions, "C-style":

struct KernelVars {
  int a;
  int b;
  int c;
};

__device__ void init(KernelVars& vars) {
  INIT(); //whatever the actual code is
}

__device__ void end(KernelVars& vars) {
  END(); //whatever the actual code is
}

__global__ void KernelA(...) {
  KernelVars vars;
  init(vars);
  b = a + c;
  end(vars);
}

This is the answer for general C++, where you would use OOP techniques like constructors and destructors (they are perfectly suited for those init/end pairs), or the template method pattern which can be used with other languages as well:

Using ctor/dtor and templates, "C++ Style":

class KernelBase {
protected:
  int a, b, c;

public:
  KernelBase() {
    INIT(); //replace by the contents of that macro
  }   
  ~KernelBase() {
    END();  //replace by the contents of that macro
  }
  virtual void run() = 0;
};

struct KernelAdd : KernelBase {
  void run() { b = a + c; }
};

struct KernelSub : KernelBase {
  void run() { b = a - c; }
};

template<class K>
void kernel(...)
{
  K k;
  k.run();
}

void kernelA( ... ) { kernel<KernelAdd>(); }

Using template method pattern, general "OOP style"

class KernelBase {
  virtual void do_run() = 0;
protected:
  int a, b, c;
public:
  void run() { //the template method
    INIT(); 

    do_run();

    END();
  }
};

struct KernelAdd : KernelBase {
  void do_run() { b = a + c; }
};

struct KernelSub : KernelBase {
  void do_run() { b = a - c; }
};

void kernelA(...)
{
  KernelAdd k;
  k.run();
}

C++ design for CUDA codes

CUDA supports type templating, and it is without doubt the most efficient way to implement kernel code where you need to handle multiple types in the same code.

As a trivial example, consider a simple BLAS AXPY type kernel:

template<typename Real>
__global__ void axpy(const Real *x, Real *y, const int n, const Real a)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    for(; tid<n; tid += stride) {
        Real yval = y[tid];
        yval += a * x[tid];
        y[tid] = yval;
    }
}

This templated kernel can be instantiated for both double and single precision without loss of generality:

template axpy<float>(const float *, float *, const int, const float);
template axpy<double>(const double *, double *, const int, const double);

The thrust template library, which ships with all recent versions of the CUDA toolkit, makes extensive use of this facility for implementing type agnostic algorithms.

Fastest (or most elegant) way of passing constant arguments to a CUDA kernel

Just pass them by value. The compiler will automagically put them in the optimal place to facilitate cached broadcast to all threads in each block - either shared memory in compute capability 1.x devices, or constant memory/constant cache in compute capability >= 2.0 devices.

For example, if you had a long list of arguments to pass to the kernel, a struct passed by value is a clean way to go:

struct arglist {
    float magicfloat_1;
    float magicfloat_2;
    //......
    float magicfloat_19;
    int magicint1;
    //......
};

__global__ void kernel(...., const arglist args)
{
    // you get the idea
}

[standard disclaimer: written in browser, not real code, caveat emptor]

If it turned out one of your magicint actually only took one of a small number of values which you know beforehand, then templating is an extremely powerful tool:

template<int magiconstant1>
__global__ void kernel(....)
{
    for(int i=0; i < magconstant1; ++i) {
       // .....
    }
}

template kernel<3>(....);
template kernel<4>(....);
template kernel<5>(....);

The compiler is smart enough to recognise magconstant makes the loop trip known at compile time and will automatically unroll the loop for you. Templating is a very powerful technique for building fast, flexible codebases and you would be well advised to accustom yourself with it if you haven't already done so.

Template parameter as function specifier and compiler optimization

Short answer: no.

Templates are instantiated and generated purely at compiletime, so you can't use the values in argv, since they are not known at compile time.

Makes me wonder why you did not just give it a try and threw that code at a compiler - it would have told you that template arguments must be compile time constants.

Update:
Since you told us in the comments that it's not primarily about performance, but about readability, i'd recommend using switch/case:

template <char c> void kernel() {
  //...
  switch(c) { /* ... */ }
}

switch (argv[1][0]) {
  case 'a': 
    kernel<'a'>();
    break;
  case 'b': 
    kernel<'b'>();
    break;
  //...
}

Since the value you have to make the descision on (i.e. argv[1][0]), is only known at runtime, you have to use runtime descision mechanisms. Of those, switch/case is among the fastest, especially if there are not too many different cases (but more than two) and especially if there are no gaps between the cases (i.e. 'a', 'b', 'c', instead of 1, 55, 2048). The compiler then can produce very fast jumptables.

Should I Unify Two Similar Kernels with an 'If' Statement, Risking Performance Loss