Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

Maximum number of processor cores supported by tbb

asanka424
Beginner
944 Views
Hi all,

Does anybody know what is the maximum number of processor cores supported by tbb. We had a trial with 64 core 1.5TB shared memory computer and in was found that after utilizing 20 - 25 cores the processes does not become efficient (quick) adding more processors. Does tbb have an upper limit?

Thanks.
Asanka
0 Kudos
14 Replies
RafSchietekat
Valued Contributor III
944 Views
This can probably be attributed to the inherent scalability of your test. Purely CPU-bound loads should not saturate at this number of cores, but there are many reasons why programs are not perfectly scalable, sometimes as simple as using up memory bandwidth. It depends on your program: have you seen better results with another toolkit?
0 Kudos
Vladimir_P_1234567890
944 Views
hello,
The latest packages (TBB 4.0 Update 3)contain fixed support of processor groups on windows.
To test scalability you can use tachyon example withballs3.dat workload or primes example with big number of iterations.
As Raf mentoned you need to have enough work to utilise cores.
A practial limit of maximum cores has not found yet:) 64 cores are in the list for sure.
--Vladimir
0 Kudos
asanka424
Beginner
944 Views
Hi Vladimir,

The machine was running customized SUSE distribution. And we had only 2 days to execute our program. What it basically does is calculate some features of an input image set. Image set comprises of 250 images. What we did is, we used parallel_for and called feature calculating function with it. What can be the limiting factor? I think 250 images are enough to utilize cores.

Asanka
0 Kudos
RafSchietekat
Valued Contributor III
944 Views
Is each region of an image processed all at once, or could this be a case of cache thrashing (sic?)? Just a wild idea (you do the math): if you cannot reorganise the work around image regions but still have sufficient intra-image parallelism (serial invocations of parallelisable work), you might set aside the advice to parallelise at the highest level (images) and instead process groups of images serially, with groups small enough that they fit in the cache together (possibly one image per group). This should be simple enough to just try. Does this make sense at all, and would it be applicable in this case?
0 Kudos
asanka424
Beginner
944 Views
processing of each image is independent. So the parallel_for looks like

parallel_for(range,functor,auto_partitioner). Each image is processes seperately. I am thinking that auto_partitioner is not creating enough grain size to distribute task.

Another questions - How can I set the number of processors to use (utilize) in TBB?

Thanks,
Asanka
0 Kudos
RafSchietekat
Valued Contributor III
944 Views
"I am thinking that auto_partitioner is not creating enough grain size to distribute task."
Maybe you meant "could it be that ..."? There's no indication for such an assumption, as you could have verified by substituting simple_partitioner or logging chunk sizes (a chunk being a subrange passed to a Body invocation) or watching processor use. You would need over a thousand256 images before getting any chunks of more than 1 image from auto_partitioner with 64 processors, and even then quite unevenly distributed processing time across images plus maybe some bad luck before that by itself would degrade scalability to such an extent (extremely unevenly distributed processing time would even be a problem with simple_partitioner, and with auto_partitioner less unevenly distributed processing time would only be problematic if occurring in clusters picked up by larger chunks).

"Another questions - How can I set the number of processors to use (utilize) in TBB?"
See task_scheduler_init.

(2012-04-24 after 12: KCorrection attempt.)
0 Kudos
asanka424
Beginner
944 Views

Thanks Raf,

I played with task_scheduler_init. But despite the value I set to task_scheduler_init, the blocked_range& r received into operator() remains same. The size of chunk should change with the value specified by task_scheduler_init right?

How do I know that task_scheduler_init has set the exact number of threads I need?

Thanks.
Asanka
0 Kudos
RafSchietekat
Valued Contributor III
944 Views
"The size of chunk should change with the value specified by task_scheduler_init right?"
With auto_partiitioner you should start seeing chunks of more than 1 images with about 8 or fewersomewhat fewer than 64 threads (for 250 iniput images). Maybe 60 might do it, it depends on which side gets the biggest part when the subrange to be split is not of even size.

"How do I know that task_scheduler_init has set the exact number of threads I need?"
Use your favourite system monitor.

(2012-04-24 after #12: KCorrection attempt.)
0 Kudos
asanka424
Beginner
944 Views
So if I have
task_scheduler_init init(4);

and my input range is 96, then what should be the minimum size of the chunk? whould it be 12 or 6? Lets say we have auto_partitioner.

0 Kudos
RafSchietekat
Valued Contributor III
944 Views
"what should be the minimum size of the chunk"
You would see chunk sizes of 2 and 16 and 1 (nothing in-between), I think, but I don't see what you would do with that knowledge.

(2012-04-24 after #12: KCorrection attempt.)
0 Kudos
asanka424
Beginner
944 Views
You would see chunk sizes of 2 and 1,

chunk size was 12 despite how I change the task_scheduler_init. I am using TBB version 3. Does it have a bug?

but I don't see what you would do with that knowledge.

Well.. If I know exactly what would happen then it is easier me to develop my application. Our application is in benchmarking process. So I need to check the performence vs number of utilized cores. So I need to get timing details with different number of cores.

I read some papers sumbitted to ieee by intel tbb team. It says about P x V multiplication where P is number of cores and V is a constant set to 4. I am not sure if it still valid for TBB v 3.


Thanks for providing these info.
0 Kudos
RafSchietekat
Valued Contributor III
944 Views
"chunk size was 12 despite how I change the task_scheduler_init. I am using TBB version 3. Does it have a bug?"
Now you make me doubt (see below)...

"Well.. If I know exactly what would happen then it is easier me to develop my application."
Since you would have extremely limited parallel overhead related to chunk sze at this level, simple_partitioner and grainsize will reliiably allow you to directly control chunk size: the input range will be split recursively while subranges are at least of size grainsize, so "grainsize" is a bit misleading in my opinion, as it will be the upper limit for chunk size, with the lower limit about grainsize/2 for large-enough input range size.

"I read some papers sumbitted to ieee by intel tbb team. It says about P x V multiplication where P is number of cores and V is a constant set to 4. I am not sure if it still valid for TBB v 3."
Hmm, 96 divided by 4x4=16 is still only 6, so I don't see how you would get to chunk size 12... maybe somebody from the TBB team could shed some clarity on what always was and what might have changed? Somehow I was thinking of an initial factor 16, but I'm not sure anymore, and since the new values don't make sense either it's probably easier to just ask (sorry).
0 Kudos
RafSchietekat
Valued Contributor III
944 Views
Apparently the factor inside the divisor for the initial division is indeed 4, not 16. I remember knowing that :-), so how did 16 slip in there? Please also see my attempts to correct #6, #8 and #10. Maybe this is also a good time to review what has happened to auto_partitioner recently, and whether the problems reported with that revision have been addressed?
0 Kudos
asanka424
Beginner
944 Views
I raised this question to intel technical support and waiting for their response. Lets wait and see.
0 Kudos
Reply