Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Overlapping parallel_for with CUDA

sklesser
Beginner
1,848 Views
Hello, I'm implementing a heterogeneous matrix multiply on the CPU + GPU by using TBB (parallel_for) and MKL on the CPU and CUDA on the GPU. My code works well when the matrix is done completely on the CPU or completely on the GPU, however I'm having trouble getting the system to do work on both devices at once - CUDA and TBB/MKL refuse to overlap.
It looks like parallel_for does not return until it is complete so I'm organizing my code like so:
copyDataToGpu()
launchMatrixMultiplyOnGpu() // <- non-blocking kernel call returns instantly
launchMatrixMultiplyOnCpu() // <- uses TBB parallel_for
synchronize() // <- waits for GPU code to finish
copyDataFromGpu()
However, it seems like the parallel_for call is blocking CUDA from continuing work since the time is exactly the time of the sum of the time it takes for CUDA and TBB to do their works individually. Is parallel_for known to block CUDA from working? I can overlap CUDA with simple CPU work like for loops just fine, but I would really like to use TBB. Alternatively, is there a form of parallel_for which is non-blocking which I can launch in the beginning of the computation (ideal, but does not seem likely). Thank you for any help!
0 Kudos
12 Replies
Anton_M_Intel
Employee
1,848 Views
Please refer to task::enqueue or simpleparallel_invoke to describe your overlapping(but in the second case you'll need to initialize TBB for at least two threads, otherwise it will execute tasks sequentially on a unicore machine).
0 Kudos
RafSchietekat
Valued Contributor III
1,848 Views
But why did it not work as presented? Does the real communication with the GPU only occur at synchronize() time, perhaps?
0 Kudos
sklesser
Beginner
1,848 Views
I replaced the TBB parallel_for with Windows threads and that does indeed overlap fine with the GPU. I'm still curious why TBB blocks though.
0 Kudos
RafSchietekat
Valued Contributor III
1,848 Views
Could you tell us if there's something in the synchronize() that's required to send and/or kick off the GPU work as well as the actual synchronisation (waiting for the results)? Or maybe only part of the work is done?
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,848 Views
Try the following pseudo code

int SplitPointRow = nRows / 2; // for starters
parallel_invoke(
[&](){ doCPUpart(A,B,C, nRows, nCols, 0, SplitPointRow); },
[&](){ doGPUdoCPUpart(A,B,C, nRows, nCols,SplitPointRow, nRows); }
);
............
void doGPUpart(double** A,double** B,double** C,int nRows, int nCols, int RowBegin, RowEnd)
{
copyDataToGPU(...)
launchGPUMatrixMultiply(...)
synchronize(...)
copyDataFromGPU(...)
}
void doCPUpart(...
{
// do subset of matrix
}

As others have pointed out you will likely need to oversubscribe your threads by at least one thread.This is due to the thread issuing the synchronize() is going to stall. If you are on a single core system this will result in serialization. If you are on a single core system you would want something like

int SplitPointRow = nRows / 2; // for starters
copyDataToGPU(...)
launchGPUMatrixMultiply(...)
doCPUpart(A,B,C, nRows, nCols, 0, SplitPointRow);
synchronize(...)
copyDataFromGPU(...)

Of course you code with two paths, one for single core, one for multi-core.

*** also, check your CUDA implementation as to multi-threaded programming issues.

Jim Dempsey
0 Kudos
Alexey-Kukanov
Employee
1,848 Views

TBB is agnostic of CUDA (as well as any other 3rd party library except for language support RTLs), so it does not consciously do anything that would prevent your asynchronous CUDA computation to run. Honestly, I have no idea why the setup you described does not work as expected. Out of curiosity, what happens if you replace the parallel_for call with a long do-nothing loop?

The idea to invoke a separate thread that makes the CUDA call makes lots of sense to me. As others noted, since it will supposedly block you should better "oversubscribe" the system. The most natural way to do that, however, is not tbb::task::enqueue() or tbb::parallel_invoke() I think, but std::thread (it's available in TBB in case your compiler does not yet support this C++11 feature). In this case, you don't have to oversubscribe TBB because its workers are not impacted.

0 Kudos
RafSchietekat
Valued Contributor III
1,848 Views
Side question (really, not a rhetorical one): why don't you use OpenCL for the CPU as well as the GPU?
0 Kudos
sklesser
Beginner
1,848 Views
The synchronize() call isn't completely required, it's actually called implicitly whenever a memcpy is done between the host and GPU memory. There isn't anything required besides the kernel call in order to kick off work on the GPU. I even tested launching some kernels on the GPU without any synchronization calls or memcpy afterwards and from my performance monitor it looks like they are still launched.
0 Kudos
sklesser
Beginner
1,848 Views
Great! Thanks for the advice / code. I'm currently working on a new project and will try this for my GPU + CPU work splitting, I'll let you know how it works as soon as it's done (hopefully a week or two).
0 Kudos
sklesser
Beginner
1,848 Views
I replaced the parallel_for with a long for loop that just incremented a counter and it had effectively no impact on the run-time until I forced the for loop took longer than the CUDA stuff. I set the loop to increment a counter and then print the counter after the full thing to make sure nothing was being optimized out.
0 Kudos
sklesser
Beginner
1,848 Views
Right now I just like CUDA more, particularly having complete control over the shared memory in a block and I've found the profiling and community / documentation on it to be much richer and established than OpenCL. I suspect in the long run OpenCL will become more common and better supported, but for now I can squeeze the most performance out of CUDA so I'm sticking with that.
0 Kudos
kankamuso
Beginner
1,848 Views
Hi,
I have come to a similar problem obtaining exactly the same times when adding new computational resourcres to a TBB pipeline (in my case a CUDA GPU). Did you solve this isse somehow?.
Regards,
Jose.
0 Kudos
Reply