We are trying to use Parallel_For for a loop that gets called many, many times. We implemented it and we are now pegging a 4 cores of a Quad PC !! (Intel S5000VSA motherboard of course), but it's 10 times slower !
Using VTune I see under :
Threads (Inside OurExecutable.exe)
Thread Process Timer%
Thread131 OurExecutable.exe 52.93%
Thread125 OurExecutable.exe 46.52%
Modules (In either above Threads)Below is one of them
Module Process Timer%
tbb.dll OurExecutable.exe 65.31%
OurDLL.dll OurExecutable.exe 17.65%
Seems like most of the time is inside TBB.dll ?
Any thoughts ? Maybe we are trying to parallelize a loop that is already very tight yet it's called many many many times.
We were hoping we could optimize using parallel_for, but maybe we are not using it right or we are not implementing it correctly.
Any help would be greatly appreciated.
A TBB developer might be able to make better use of the numbers you provide (but also of your answers to the questions above).
Based only on the numbers you presented, the best guess I could make is that your inner loop is just too small to be efficiently parallelized. Most of the time is spent in spawning, stealing, and waiting - which is a sign that TBB worker threads cannot find any useful work. Also, the ratio of time spent in TBB vs. in your DLL suggests that the work is too fine-grained, so overhead dominates.
I would suggest you to try parallelizing the outer loop, if possible.