how to use gpu::Stream in OpenCV?
By default all gpu module functions are synchronous, i.e. current CPU thread is blocked until operation finishes.
gpu::Stream
is a wrapper for cudaStream_t
and allows to use asynchronous non-blocking call. You can also read "CUDA C Programming Guide" for detailed information about CUDA asynchronous concurrent execution.
Most gpu module functions have additional gpu::Stream
parameter. If you pass non-default stream the function call will be asynchronous, and the call will be added to stream command queue.
Also gpu::Stream
provides methos for asynchronous memory transfers between CPU<->GPU
and GPU<->GPU
. But CPU<->GPU
asynchronous memory transfers works only with page-locked host memory. There is another class gpu::CudaMem
that encapsulates such memory.
Currently, you may face problems if same operation is enqueued twice with different data to different streams. Some functions use the constant or texture GPU memory, and next call may update the memory before the previous one has been finished. But calling different operations asynchronously is safe because each operation has its own constant buffer. Memory copy/upload/download/set operations to the buffers you hold are also safe.
Here is small sample:
// allocate page-locked memory
CudaMem host_src_pl(768, 1024, CV_8UC1, CudaMem::ALLOC_PAGE_LOCKED);
CudaMem host_dst_pl;
// get Mat header for CudaMem (no data copy)
Mat host_src = host_src_pl;
// fill mat on CPU
someCPUFunc(host_src);
GpuMat gpu_src, gpu_dst;
// create Stream object
Stream stream;
// next calls are non-blocking
// first upload data from host
stream.enqueueUpload(host_src_pl, gpu_src);
// perform blur
blur(gpu_src, gpu_dst, Size(5,5), Point(-1,-1), stream);
// download result back to host
stream.enqueueDownload(gpu_dst, host_dst_pl);
// call another CPU function in parallel with GPU
anotherCPUFunc();
// wait GPU for finish
stream.waitForCompletion();
// now you can use GPU results
Mat host_dst = host_dst_pl;
Can I use gpu::Stream for CascadeClassifier_GPU on OpenCV and how?
CascadeClassifier_GPU
uses mixed GPU/CPU implementation and performs extra synchronizations internally, that's why it doesn't support asynchronous mode with gpu::Stream
parameter. In order to launch it asynchronously with your code, you need to use separate CPU thread for it.
Should CUDA stream be waited to be complete even if the output data are to be sent to OpenGL instead of CPU?
The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability
In particular it says that
Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.
That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources
, which is confirmed by its documentation:
This function provides the synchronization guarantee that any CUDA work issued in
stream
beforecudaGraphicsUnmapResources()
will complete before any subsequently issued graphics work begins.
So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources
which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.
OpenCV3: where has cv::cuda::Stream::enqueueUpload() gone?
It can be done now via void GpuMat::upload(InputArray arr, Stream& stream)
method:
cv::cuda::GpuMat d_mat;
cv::cuda::HostMem h_mat;
cv::cuda::Stream stream;
d_mat.upload(h_mat, stream);
Related Topics
Std::Filesystem' Has Not Been Declared After Including <Experimental/Filesystem>
Throw and Ternary Operator in C++
How to Build a Program Using C++ Driver of Mongodb
In C++, Can a Class with a Const Data Member Not Have a Copy Assignment Operator
How to Write and Read To/From a Qresource File in Qt 5
How to Make a Recursive Rule in Boost Spirit X3 in VS2017
Multiple Classes in a Header File VS. a Single Header File Per Class
Benefits and Portability of Boost Library
Long VS. Int C/C++ - What's the Point
C++11: "Narrowing Conversion Inside { }" with Modulus
How to Print the Value of Nullptr on Screen
Why Does (1 << 31) >> 31 Result in -1
Com(C++) Programming Tutorials
Problems with C++ Set Container
Why Do I Need to Repeat Template Arguments of My Base Class in Member Initalizer List