Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

CUDA and oversubscription


I am trying to build a pipeline in which one of the stages decides wether to run one piece of code on the CPU or on a GPU this way:

void* MyProcessFilter::operator() (void *item) {

if (++gpu_working==1) {


gpu_working = 0; // free GPU

} else {

// GPU busy, use CPU




The thing is that I have tuned the GPU code to last as long as the CPU so I would expect my code to act as it there were N+1 CPUs. I mean, when using the GPU, one CPU should stay iddle and I would expect a new task to ocupy such CPU. Nevertheless, times are exactly the same, so it seems that no other task is asigned to the thread waiting to the GPU to end its work which is consistent with what I would expect. On the other hand, I thought that If I create one more thread than CPUs I have (some degree of oversubscription), I would feel the impact of the GPU, obtaining better times. Nevertheless, this is not true and times remain the same :-(. Any clues about whatcould behappening?. How does TBB manage oversubscription?. I understand TBB does nothing special about oversubscription as I print times for tasks processing as they increase as I increase the number of threads (more threads working but at a slower pace).

Thanks in advance,


0 Kudos
3 Replies
Hi Jose,

Could you please provide a sample of your code with complete pipeline? You can make the thread private.
Does the MyProcessFilter get data from other filters in pipeline? Maybe this is not the narrow place of you application.
The cpu_worker_pipeline - does it invoke nested TBB pipeline, or it's just named this way?
0 Kudos
Black Belt
See if your GPU API supports a launch GPU app/kernel but permit the launching thread to continue (and some time later poll to see if GPU done but not stall main thread).

pseudo code

void gpu_worker_pipeline(void* item)
... (copy data into GPU)
sleep(0); // use sleep not switch_to_thread(), not _mm_pause()
... (retrieve data from GPU)
} // end gpu_worker_pipeline

(note, change sleep(0) to whatever forces an actual yield to other thread)

If your current code is performing a run_GPU_kernel_and_wait() the CUDA supplied routine may be in a compute loop performing the polling. Something like:

while(GPU_busy()) continue;
while(GPU_busy()) _mm_pause();
while(GPU_busy()) SwitchToThread();

All three of those techniques may interfere with other threads running on your system. The sleep(0) on most (all) systems actually yields. SwitchToThread() (or possibly yield()) on Windows (or possibly yield() on Linux) will (tends to) switch to threads that had been preempted while in the run state but will not switch to a thread that had been waiting for an event (e.g. file I/O) and who's event has just occurred. I cannot say if Linux/Mac suffers the same symptom. It is easy enough to stick in the Sleep(0), see what happens then try SwitchToThread() or yield().

Jim Dempsey
0 Kudos
Thanks both for responding and sorry for the delay (I did not receive the mail :-S).

Such stage is the bottleneck in my pipeline as I forced it to be. There are not nested or other "strange" structures, just a single pipeline invoking some cuda code. I am now trying with the "cudaDeviceScheduleYield" parameter, but it seems to negatively impact overall performance.

It is a pitty that you cannot make use of an accelerator benefits like another extra CPU :-(. Nevertheless, as everything is CPU-driven I am afraid nothing else can be done. I have tried to sleep threads, change priorities, etc.. but nothing seems to work. The only workaround is to artificially decrease load in one of the pipelines stages so the CUDA one can have more time to process... bus this would not be the real scenario I am afraid, and can only be used as a testing one...

Thanks a lot,

0 Kudos