Community
cancel
Showing results for 
Search instead for 
Did you mean: 
asanka424
Beginner
53 Views

Maximum number of processor cores supported by tbb

Hi all,

Does anybody know what is the maximum number of processor cores supported by tbb. We had a trial with 64 core 1.5TB shared memory computer and in was found that after utilizing 20 - 25 cores the processes does not become efficient (quick) adding more processors. Does tbb have an upper limit?

Thanks.
Asanka
0 Kudos
14 Replies
RafSchietekat
Black Belt
53 Views

This can probably be attributed to the inherent scalability of your test. Purely CPU-bound loads should not saturate at this number of cores, but there are many reasons why programs are not perfectly scalable, sometimes as simple as using up memory bandwidth. It depends on your program: have you seen better results with another toolkit?
53 Views

hello,
The latest packages (TBB 4.0 Update 3)contain fixed support of processor groups on windows.
To test scalability you can use tachyon example withballs3.dat workload or primes example with big number of iterations.
As Raf mentoned you need to have enough work to utilise cores.
A practial limit of maximum cores has not found yet:) 64 cores are in the list for sure.
--Vladimir
asanka424
Beginner
53 Views

Hi Vladimir,

The machine was running customized SUSE distribution. And we had only 2 days to execute our program. What it basically does is calculate some features of an input image set. Image set comprises of 250 images. What we did is, we used parallel_for and called feature calculating function with it. What can be the limiting factor? I think 250 images are enough to utilize cores.

Asanka
RafSchietekat
Black Belt
53 Views

Is each region of an image processed all at once, or could this be a case of cache thrashing (sic?)? Just a wild idea (you do the math): if you cannot reorganise the work around image regions but still have sufficient intra-image parallelism (serial invocations of parallelisable work), you might set aside the advice to parallelise at the highest level (images) and instead process groups of images serially, with groups small enough that they fit in the cache together (possibly one image per group). This should be simple enough to just try. Does this make sense at all, and would it be applicable in this case?
asanka424
Beginner
53 Views

processing of each image is independent. So the parallel_for looks like

parallel_for(range,functor,auto_partitioner). Each image is processes seperately. I am thinking that auto_partitioner is not creating enough grain size to distribute task.

Another questions - How can I set the number of processors to use (utilize) in TBB?

Thanks,
Asanka
RafSchietekat
Black Belt
53 Views

"I am thinking that auto_partitioner is not creating enough grain size to distribute task."
Maybe you meant "could it be that ..."? There's no indication for such an assumption, as you could have verified by substituting simple_partitioner or logging chunk sizes (a chunk being a subrange passed to a Body invocation) or watching processor use. You would need over a thousand256 images before getting any chunks of more than 1 image from auto_partitioner with 64 processors, and even then quite unevenly distributed processing time across images plus maybe some bad luck before that by itself would degrade scalability to such an extent (extremely unevenly distributed processing time would even be a problem with simple_partitioner, and with auto_partitioner less unevenly distributed processing time would only be problematic if occurring in clusters picked up by larger chunks).

"Another questions - How can I set the number of processors to use (utilize) in TBB?"
See task_scheduler_init.

(2012-04-24 after 12: KCorrection attempt.)
asanka424
Beginner
53 Views


Thanks Raf,

I played with task_scheduler_init. But despite the value I set to task_scheduler_init, the blocked_range& r received into operator() remains same. The size of chunk should change with the value specified by task_scheduler_init right?

How do I know that task_scheduler_init has set the exact number of threads I need?

Thanks.
Asanka
RafSchietekat
Black Belt
53 Views

"The size of chunk should change with the value specified by task_scheduler_init right?"
With auto_partiitioner you should start seeing chunks of more than 1 images with about 8 or fewersomewhat fewer than 64 threads (for 250 iniput images). Maybe 60 might do it, it depends on which side gets the biggest part when the subrange to be split is not of even size.

"How do I know that task_scheduler_init has set the exact number of threads I need?"
Use your favourite system monitor.

(2012-04-24 after #12: KCorrection attempt.)
asanka424
Beginner
53 Views

So if I have
task_scheduler_init init(4);

and my input range is 96, then what should be the minimum size of the chunk? whould it be 12 or 6? Lets say we have auto_partitioner.

RafSchietekat
Black Belt
53 Views

"what should be the minimum size of the chunk"
You would see chunk sizes of 2 and 16 and 1 (nothing in-between), I think, but I don't see what you would do with that knowledge.

(2012-04-24 after #12: KCorrection attempt.)
asanka424
Beginner
53 Views

You would see chunk sizes of 2 and 1,

chunk size was 12 despite how I change the task_scheduler_init. I am using TBB version 3. Does it have a bug?

but I don't see what you would do with that knowledge.

Well.. If I know exactly what would happen then it is easier me to develop my application. Our application is in benchmarking process. So I need to check the performence vs number of utilized cores. So I need to get timing details with different number of cores.

I read some papers sumbitted to ieee by intel tbb team. It says about P x V multiplication where P is number of cores and V is a constant set to 4. I am not sure if it still valid for TBB v 3.


Thanks for providing these info.
RafSchietekat
Black Belt
53 Views

"chunk size was 12 despite how I change the task_scheduler_init. I am using TBB version 3. Does it have a bug?"
Now you make me doubt (see below)...

"Well.. If I know exactly what would happen then it is easier me to develop my application."
Since you would have extremely limited parallel overhead related to chunk sze at this level, simple_partitioner and grainsize will reliiably allow you to directly control chunk size: the input range will be split recursively while subranges are at least of size grainsize, so "grainsize" is a bit misleading in my opinion, as it will be the upper limit for chunk size, with the lower limit about grainsize/2 for large-enough input range size.

"I read some papers sumbitted to ieee by intel tbb team. It says about P x V multiplication where P is number of cores and V is a constant set to 4. I am not sure if it still valid for TBB v 3."
Hmm, 96 divided by 4x4=16 is still only 6, so I don't see how you would get to chunk size 12... maybe somebody from the TBB team could shed some clarity on what always was and what might have changed? Somehow I was thinking of an initial factor 16, but I'm not sure anymore, and since the new values don't make sense either it's probably easier to just ask (sorry).
RafSchietekat
Black Belt
53 Views

Apparently the factor inside the divisor for the initial division is indeed 4, not 16. I remember knowing that :-), so how did 16 slip in there? Please also see my attempts to correct #6, #8 and #10. Maybe this is also a good time to review what has happened to auto_partitioner recently, and whether the problems reported with that revision have been addressed?
asanka424
Beginner
53 Views

I raised this question to intel technical support and waiting for their response. Lets wait and see.