#10 "If this is a simple array copy/addition/scale function then likely 4 threads per socket would swamp the memory bandwidth."
And Dmitriy's point in #4?
But maybe I wasn't really exaggerating before in #8 about not nearly having enough information (see questions in #6). Giving hypothetical replies may not bethe best use of anyone's time until the original poster, who seems to have a preconceived idea about where the problem lies, provides access to what we need to know to say anything meaningful.
[cpp]#include "tbb/tick_count.h" #include "tbb/parallel_for.h" #include "tbb/blocked_range.h" #include "tbb/partitioner.h" using namespace tbb; #include
The assertion "the calculation of each element in the vector/matrix is the same and constant" is generally not enough to guarantee equal load. Cache misses, page faults, and interrupts can create unequal work for cores. Many years back when I was prototyping TBB on a 32-way Altix system, I observed that even for something as simple as a matrix multiply, differences in cache misses made dynamic load balancing pay off in some cases.
The intended usage model of TBB is to not think about the number of cores, but about reasonably amortizing overhead of chunks of work. For example, find chunk sizes so that about 5% of the time is spent on parallel scheduling overhead. The scheduling overhead per chunk is roughly independent of the number of processors, so the chunk size should be good across different processor counts.
For the detailed workings of the partitioners, see header file "tbb/include/partitioner.h". Method auto_partitioner::partition_type::should_execute_range has the basicsubdivide-on-steal logic. The logic in affinity_partitioner::partitioner_type is similar, albeit obscured somewhat by the affinity logic.