- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
int SplitPointRow = nRows / 2; // for starters
parallel_invoke(
[&](){ doCPUpart(A,B,C, nRows, nCols, 0, SplitPointRow); },
[&](){ doGPUdoCPUpart(A,B,C, nRows, nCols,SplitPointRow, nRows); }
);
............
void doGPUpart(double** A,double** B,double** C,int nRows, int nCols, int RowBegin, RowEnd)
{
copyDataToGPU(...)
launchGPUMatrixMultiply(...)
synchronize(...)
copyDataFromGPU(...)
}
void doCPUpart(...
{
// do subset of matrix
}
As others have pointed out you will likely need to oversubscribe your threads by at least one thread.This is due to the thread issuing the synchronize() is going to stall. If you are on a single core system this will result in serialization. If you are on a single core system you would want something like
int SplitPointRow = nRows / 2; // for starters
copyDataToGPU(...)
launchGPUMatrixMultiply(...)
doCPUpart(A,B,C, nRows, nCols, 0, SplitPointRow);
synchronize(...)
copyDataFromGPU(...)
Of course you code with two paths, one for single core, one for multi-core.
*** also, check your CUDA implementation as to multi-threaded programming issues.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TBB is agnostic of CUDA (as well as any other 3rd party library except for language support RTLs), so it does not consciously do anything that would prevent your asynchronous CUDA computation to run. Honestly, I have no idea why the setup you described does not work as expected. Out of curiosity, what happens if you replace the parallel_for call with a long do-nothing loop?
The idea to invoke a separate thread that makes the CUDA call makes lots of sense to me. As others noted, since it will supposedly block you should better "oversubscribe" the system. The most natural way to do that, however, is not tbb::task::enqueue() or tbb::parallel_invoke() I think, but std::thread (it's available in TBB in case your compiler does not yet support this C++11 feature). In this case, you don't have to oversubscribe TBB because its workers are not impacted.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page