Maximum number of processor cores supported by tbb
Does anybody know what is the maximum number of processor cores supported by tbb. We had a trial with 64 core 1.5TB shared memory computer and in was found that after utilizing 20 - 25 cores the processes does not become efficient (quick) adding more processors. Does tbb have an upper limit?
This can probably be attributed to the inherent scalability of your test. Purely CPU-bound loads should not saturate at this number of cores, but there are many reasons why programs are not perfectly scalable, sometimes as simple as using up memory bandwidth. It depends on your program: have you seen better results with another toolkit?
The machine was running customized SUSE distribution. And we had only 2 days to execute our program. What it basically does is calculate some features of an input image set. Image set comprises of 250 images. What we did is, we used parallel_for and called feature calculating function with it. What can be the limiting factor? I think 250 images are enough to utilize cores.
Is each region of an image processed all at once, or could this be a case of cache thrashing (sic?)? Just a wild idea (you do the math): if you cannot reorganise the work around image regions but still have sufficient intra-image parallelism (serial invocations of parallelisable work), you might set aside the advice to parallelise at the highest level (images) and instead process groups of images serially, with groups small enough that they fit in the cache together (possibly one image per group). This should be simple enough to just try. Does this make sense at all, and would it be applicable in this case?
"I am thinking that auto_partitioner is not creating enough grain size to distribute task." Maybe you meant "could it be that ..."? There's no indication for such an assumption, as you could have verified by substituting simple_partitioner or logging chunk sizes (a chunk being a subrange passed to a Body invocation) or watching processor use. You would need over a thousand256 images before getting any chunks of more than 1 image from auto_partitioner with 64 processors, and even then quite unevenly distributed processing time across images plus maybe some bad luck before that by itself would degrade scalability to such an extent (extremely unevenly distributed processing time would even be a problem with simple_partitioner, and with auto_partitioner less unevenly distributed processing time would only be problematic if occurring in clusters picked up by larger chunks).
"Another questions - How can I set the number of processors to use (utilize) in TBB?" See task_scheduler_init.
I played with task_scheduler_init. But despite the value I set to task_scheduler_init, the blocked_range& r received into operator() remains same. The size of chunk should change with the value specified by task_scheduler_init right?
How do I know that task_scheduler_init has set the exact number of threads I need?
"The size of chunk should change with the
value specified by task_scheduler_init right?" With auto_partiitioner you should start seeing chunks of more than 1 images with about 8 or fewersomewhat fewer than 64 threads (for 250 iniput images). Maybe 60 might do it, it depends on which side gets the biggest part when the subrange to be split is not of even size.
"How do I know that task_scheduler_init has set the exact number of threads I need?" Use your favourite system monitor.
chunk size was 12 despite how I change the task_scheduler_init. I am using TBB version 3. Does it have a bug?
but I don't see what you would do with that knowledge.
Well.. If I know exactly what would happen then it is easier me to develop my application. Our application is in benchmarking process. So I need to check the performence vs number of utilized cores. So I need to get timing details with different number of cores.
I read some papers sumbitted to ieee by intel tbb team. It says about P x V multiplication where P is number of cores and V is a constant set to 4. I am not sure if it still valid for TBB v 3.
"chunk size was 12 despite how I change the task_scheduler_init. I am using TBB version 3. Does it have a bug?" Now you make me doubt (see below)...
"Well.. If I know exactly what would happen then it is easier me to develop my application." Since you would have extremely limited parallel overhead related to chunk sze at this level, simple_partitioner and grainsize will reliiably allow you to directly control chunk size: the input range will be split recursively while subranges are at least of size grainsize, so "grainsize" is a bit misleading in my opinion, as it will be the upper limit for chunk size, with the lower limit about grainsize/2 for large-enough input range size.
"I read some papers sumbitted to ieee by intel tbb team. It says about P x
V multiplication where P is number of cores and V is a constant set to
4. I am not sure if it still valid for TBB v 3." Hmm, 96 divided by 4x4=16 is still only 6, so I don't see how you would get to chunk size 12... maybe somebody from the TBB team could shed some clarity on what always was and what might have changed? Somehow I was thinking of an initial factor 16, but I'm not sure anymore, and since the new values don't make sense either it's probably easier to just ask (sorry).
Apparently the factor inside the divisor for the initial division is
indeed 4, not 16. I remember knowing that :-), so how did 16 slip in
there? Please also see my attempts to correct #6, #8 and #10. Maybe this
is also a good time to review what has happened to auto_partitioner
recently, and whether the problems reported with that revision have been