Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

TBB wait_for_call

patf2000
Beginner
466 Views
Hello,

We are trying to use Parallel_For for a loop that gets called many, many times. We implemented it and we are now pegging a 4 cores of a Quad PC !! (Intel S5000VSA motherboard of course), but it's 10 times slower !

Using VTune I see under :

Processes

Process Timer%
pid_0x0 83.14%
OurExecutable.exe 15.35%

Threads (Inside OurExecutable.exe)

Thread Process Timer%
Thread131 OurExecutable.exe 52.93%
Thread125 OurExecutable.exe 46.52%

Modules (In either above Threads)Below is one of them

Module Process Timer%
tbb.dll OurExecutable.exe 65.31%
OurDLL.dll OurExecutable.exe 17.65%

Inside tbb.dll

Name Timer%
wait_for_all 49.60%
spawn 15.92%
steal_task 15.63%
allocate 4.59%
allocate 3.46%

Seems like most of the time is inside TBB.dll ?
Any thoughts ? Maybe we are trying to parallelize a loop that is already very tight yet it's called many many many times.

We were hoping we could optimize using parallel_for, but maybe we are not using it right or we are not implementing it correctly.

Any help would be greatly appreciated.

Thanks
Pat
0 Kudos
2 Replies
RafSchietekat
Valued Contributor III
466 Views
What grain size are you using? If it is too small, does increasing it (grow by factors of 10) improve performance? Do you use TBB near the inner loop (larger overhead) or near the outer loop (better)?

A TBB developer might be able to make better use of the numbers you provide (but also of your answers to the questions above).
0 Kudos
Alexey-Kukanov
Employee
466 Views

Based only on the numbers you presented, the best guess I could make is that your inner loop is just too small to be efficiently parallelized. Most of the time is spent in spawning, stealing, and waiting - which is a sign that TBB worker threads cannot find any useful work. Also, the ratio of time spent in TBB vs. in your DLL suggests that the work is too fine-grained, so overhead dominates.

I would suggest you to try parallelizing the outer loop, if possible.

0 Kudos
Reply