Nvcc Cuda Cross Compiling Cannot Find "-Lcudart"

NVCC CUDA cross compiling cannot find -lcudart

It turns out that the CUDA installer I was using from NVIDIA will not allow me to cross compile for my CARMA board, but it has to be downloaded from the manufacturer SECO.

skipping incompatible libcudart.so when searching for -lcudart

This warning often happens when trying to link a 64-bit code with a 32-bit library, see this question: Skipping Incompatible Libraries at compile.

You need to distinguish 2 library files:

$CUDA_HOME/lib/libcudart.so, the 32-bit version of the cudart library.
$CUDA_HOME/lib64/libcudart.so, the 64-bit version of the cudart library.

(in your case, $CUDA_HOME is /usr/local/cuda-5.0)

Basically, the linker finds the 32-bit library first (-L options are searched in order) and returns that warning even if it ends up finding the proper library.

You probably need to add $CUDA_HOME/lib64 to your LD_LIBRARY_PATH environment variable before $CUDA_HOME/lib so that ld can find the proper library for your 64-bit architecture before the 32-bit version.

Error nvcc cannot find a supported cl version when compiling CUDA 5 with VS2012

Seems that if your code is using dynamic parallelism you need to use msvc2010

NVCC Cuda 5.0 on Ubuntu 12.04 /usr/lib/libudt.so file format not recognized

I think what @kevinDTimm and @talonmies mentioned are correct: ARM support does not exist until CUDA5.5.
Downloading from CUDA5.0 from the official website and also ARM g++ cannot work together.

For my case, my project requires CUDA5.0 and ARM together because we are developing on CARMA board. I have to use CUDA5.0 for CARMA board, and ARM for compiling the code before transfer the file to CARMA board.

Therefore, for anyone else who is working on some projects using CARMA board,
here is some tips for you: do not download CUDA5.0 from the official website, download: cuda-linux-armv7-gnueabihf-cross-compilation-rel-5.0.47-15134578.run from this website instead: url

This will support CUDA5.0 and ARM together. However, as mentioned, ARM here is only for cross compilation purposes, the code compiled in the host ubuntu VM can only be executed on the CARMA board.

After installing CUDA on Tegra Tx1 and sourcing the bashrc it will not find NVCC

First of all you should check if CUDA was really installed!

To do so go to the path:

/usr/local

There must be a cuda folder or a folder named cuda-8-0 or whatever version you installed. Remember the name and the path.
Now check your bashrc by using gedit/vi/nano/vim whatever you prefer:

vim ~/.bashrc

Go to the bottom of the file. There should be some exports regarding the PATH Variable and the LD_LIBRARY_PATH. Check if it was written to these variables and then overwritten again.

You must export the Path to the bin folder of your CUDA application and the Path to lib64 folder.

To do so at the bottom of the bashrc must stand something like this:

export PATH=/usr/local/cuda-8.0/bin: ....
export LD_LIBARY_PATH=/usr/local/cuda-8.0/lib64: ....

After the double dots may follow some other paths. make sure that the path given to the cuda application is the correct one and that it is not overwritten again.

After you made the correct changed do not forget to source the basrhc again.

why nvcc remove my if branch during compiling?

This would appear to be a bug in nvcc 10.1 (the only version I have tested). It appears that the compiler attempts at automatic inline expansion of the rgba2rgb and bgra2rgb functions are breaking somehow, so that the result of compiling this:

__device__ float3 pixel2rgb(uchar4 p, bool flag)
{
    if(flag)
    {
        return bgra2rgb(p);
    }
    else
    {
        return rgba2rgb(p);
    }
}

is effectively this:

__device__ float3 pixel2rgb(uchar4 p, bool flag)
{
    return rgba2rgb(p);
}

It isn't related to textures per se, because I can reproduce the problem with this code reading directly from global memory:

#include <stdint.h>
#include <cuda.h>
#include <cstdio>

__device__ float3 rgba2rgb(uchar4 p)
{
    return make_float3(p.x/255.0f, p.y/255.0f, p.z/255.0f);
}
__device__ float3 bgra2rgb(uchar4 p)
{
    return make_float3(p.z/255.0f, p.y/255.0f, p.x/255.0f);
}
__device__ float3 pixel2rgb(uchar4 p, bool flag)
{
    if(flag)
    {
        return bgra2rgb(p);
    }
    else
    {
        return rgba2rgb(p);
    }
}

__global__ void func2(
    uchar4* pixels,
    size_t width, size_t height,
    bool flag
)
{
    size_t x_p = blockIdx.x * blockDim.x + threadIdx.x;
    size_t y_p = blockIdx.y * blockDim.y + threadIdx.y;

    if ((x_p < width) && (y_p < height)) {

    size_t idx = x_p * width + y_p;
    uchar4 pixel = pixels[idx];
    float3 rgb = pixel2rgb(pixel, flag);

    printf("flag=%d idx=%ld rgb=(%f,%f,%f)\n", flag, idx, rgb.x, rgb.y, rgb.z);
    }
}

int main()
{
    int width = 2, height = 2;
    uchar4* data;
    cudaMallocManaged(&data, width * height * sizeof(uchar4));

    data[0] = make_uchar4(1, 2, 3, 4);
    data[1] = make_uchar4(2, 3, 4, 5);
    data[2] = make_uchar4(3, 4, 5, 6);
    data[3] = make_uchar4(4, 5, 6, 7);

    dim3 bdim(2,2);
    func2<<<1, bdim>>>(data, width, height, true);
    cudaDeviceSynchronize();

    func2<<<1, bdim>>>(data, width, height, false);
    cudaDeviceSynchronize();

    cudaDeviceReset();

    return 0;
}

$ nvcc  -arch=sm_52 -o wangwang wangwang.cu 
$ ./wangwang 
flag=1 idx=0 rgb=(0.003922,0.007843,0.011765)
flag=1 idx=2 rgb=(0.011765,0.015686,0.019608)
flag=1 idx=1 rgb=(0.007843,0.011765,0.015686)
flag=1 idx=3 rgb=(0.015686,0.019608,0.023529)
flag=0 idx=0 rgb=(0.003922,0.007843,0.011765)
flag=0 idx=2 rgb=(0.011765,0.015686,0.019608)
flag=0 idx=1 rgb=(0.007843,0.011765,0.015686)
flag=0 idx=3 rgb=(0.015686,0.019608,0.023529)

I presume that the make_uchar4 version you mention works because the compiler will do pre-computation of the results due to the constant inputs and eliminate the conversion function code all together.

Playing around, I was able to fix this by changing the code like this:

__device__ __inline__ float3 rgba2rgb(uchar4 p)
{
    return make_float3(p.x/255.0f, p.y/255.0f, p.z/255.0f);
}
__device__ __inline__ float3 bgra2rgb(uchar4 p)
{
    return make_float3(p.z/255.0f, p.y/255.0f, p.x/255.0f);
}

When I do this, the compile injects some swizzling logic into the inline PTX expansion it generates:

    ld.global.v4.u8         {%rs2, %rs3, %rs4, %rs5}, [%rd10];
    and.b16         %rs8, %rs1, 255;   <---- %rs1 is the input bool
    setp.eq.s16     %p4, %rs8, 0;
    selp.b16        %rs9, %rs2, %rs4, %p4;
    and.b16         %rs10, %rs9, 255;
    selp.b16        %rs11, %rs4, %rs2, %p4;
    and.b16         %rs12, %rs11, 255;

and things work correctly (your mileage may vary):

$ nvcc  -arch=sm_52 -o wangwang wangwang.cu 
$ ./wangwang 
flag=1 idx=0 rgb=(0.011765,0.007843,0.003922)
flag=1 idx=2 rgb=(0.019608,0.015686,0.011765)
flag=1 idx=1 rgb=(0.015686,0.011765,0.007843)
flag=1 idx=3 rgb=(0.023529,0.019608,0.015686)
flag=0 idx=0 rgb=(0.003922,0.007843,0.011765)
flag=0 idx=2 rgb=(0.011765,0.015686,0.019608)
flag=0 idx=1 rgb=(0.007843,0.011765,0.015686)
flag=0 idx=3 rgb=(0.015686,0.019608,0.023529)

I would report this as a bug to NVIDIA.

Pass preprocessing variable to NVCC for compiling CUDA?

Whatever you specify after -D, it gets defined before processing the input files. However, it does not remove the definitions that occur in the file. So, if you write -DDEBUG_OUTPUT but then you have #define DEBUG_OUTPUT in the file, the latter is a redefinition of the former. To handle that case, you can write in the file:

//if not specified earlier (e.g. by -D parameter)
#ifndef DEBUG_OUTPUT
//set it now to some default value
#define DEBUG_OUTPUT 0
#endif

Note, it has actually nothing to do with nvcc. Same behaviour appears in C/C++.

Compiling c++ and cuda code with MinGW in QTCreator

I'm a bit confused, are you using MinGW or Visual? The title seems to state that you are using MinGW but the project file seems to use a mix of both. You can't mix those two. If you compiled (or downloaded the binary directly from NVidia) CUDA with Visual Studio 2010, you HAVE to use VS10 to compile your project, otherwise it won't work.

I never used CUDA myself but it seems that the system requirements mention only Visual Studio 2008, 2010 and 2012. If you want to use it with Qt, it's possible, you just have to grab a Qt compiled with VS (there are builds for 32 and 64 bit for both on the download page. You can get Visual Studio Express for free as long as you don't create any commercial application with it.

To use QtCreator with the MSVC backend compiler go to Tools > Options > Build and Run > Kits and add a new Kit with the MSVC compiler, cdb as the debugger and the Qt version you just downloaded (it must have been compiled with the same Visual Studio version otherwise it won't work). Then open your project, go to the Projects tab (on the left) and select the Kit you just created. You should probably clean your .pro file as well before everything work smoothly.

On a side note, there are a few things that seems out of place in your linker line:

g++ -Wl,-s -Wl,-subsystem,windows -mthreads -o release\Cuda.exe release/cuda/vectorAddition_cuda.o release/obj/main.o  -lglu32 -lopengl32 -lgdi32 -luser32 -lmingw32 -lqtmain -LC:\Cuda\CudaToolkit\lib\Win32 -LC:\Cuda\CudaSamples\common\lib\Win32 -LC:\Cuda\CudaSamples\..\shared\lib\Win32 -LC:\CUDA\VS10\VC\lib -LQMAKE_LIBS -L+= -L-lmsvcrt -L-llibcmt -L-llibcpmt -lcuda -lcudart -LF:\Programs\Qt5.1.1\5.1.1\mingw48_32\lib -lQt5Gui -lQt5Core

First this -L+=, which might be caused by the escaping backslash at the end of the QMAKE_LIBDIR.

Then the syntax -L-lmsvcrt seems wrong. It might be because you are using QMAKE_LIBS, I personally never had to use it, and according to the documentation you shouldn't either as it is an internal variable. Same goes for QMAKE_LIBDIR btw. I would just use the LIBS variable for any external dependency.

Error 2 when building CUDA/C++ test program with QtCreator

What you are trying to do is unsupported.

As per the documentation, the only supported toolchain on Windows platforms is Visual Studio. You may well be able to use QT Creator and Qmake to build projects, but you must use the Microsoft toolchain to compile and link code. MinGW and gcc are unsupported. Note also that 32 bit toolchains are also unsupported in recent versions of CUDA. You can cross compile to a 32 bit target using the toolchain, but native 32 bit toolchain support on Windows was dropped several years ago.

Nvcc Cuda Cross Compiling Cannot Find "-Lcudart"