Why are CPU cores not evenly used by Intel TBB?

mickru · ‎10-01-2009

Hi,
sorry that this one does look like a cross forum post... I started a thread in the wrong forum I guess.
Anyway, we are using the commercial version of Intel TBB in version 2.2 released last August. We use it for parallel processing of captured network packets. Our application is a C++ linux service running on SLES10-SP2 using a Xeon Quad core CPU. Our understanding is, that Intel TBB ensures that all CPU cores will evenly contribute to process our data which is worked on using a parallel_for loop.
Few day's ago we have been invited by our HW supplier to test run our service on a new IBM Blade with two Nehalem CPUs. In theory we would have had 16 cores available to our service, but the output from top did show that only two cores had been used - same as on our current quad core Xeons in use. Even so the performance gain on Nehalem was huge, we assume that it would be even higher if TBB would have used all 16 cores instead of just 2.
So my question is now, how can we configure Intel TBB such that it really used all available cores on the system? We thought that this was what Intel TBB is used for - dynamically utilizing all available cores.

Any idea how we can force TBB to really use all available cores? We monitor media streams on a 1GB network connection at line-rate. This means around 21000 concurrent media sessions on the network. I do assume this is enough data to be shared among 16 cores...

Any idea?

Thanks.

areid2 · ‎10-01-2009

Is there any chance that your application hashardcoded the number of threads? For example see section 10.2 in:

http://software.intel.com/sites/products/documentation/hpc/tbb/reference.pdf

Terry_W_Intel · ‎10-01-2009

In addition to checking if the number of threads was accidentally hardcoded to 2, also check what partitioner you are using with your parallel_for and what grainsize, if any, you are specifying. A good starting point is to go with auto_partitioner (the default) and not specify a grainsize, and then tune to your needs from there.

Cheers,
Terry

mickru · ‎10-01-2009

Quoting - Terry Wilmarth (Intel)

In addition to checking if the number of threads was accidentally hardcoded to 2, also check what partitioner you are using with your parallel_for and what grainsize, if any, you are specifying. A good starting point is to go with auto_partitioner (the default) and not specify a grainsize, and then tune to your needs from there.

Cheers,
Terry

Hi,
thanks for the suggestions. But that doesn't solve the issue. I do use
tbb::task_scheduler_init tbbInit;
as shown, in the main method before doing anything else. You see, that I do not try to alter the number of threads used by the scheduler and let TBB figure out what's best. Since we run a multi threaded application, every worker thread using TBB is invoking the same code as well:
tbb::task_scheduler_init tbbInit;
so the same applies to any worker we create - no altering of the scheduler. In one of those worker threads, we then do invoke the parallel_for() code to execute the evaluation of the network data in parallel. While I know that many threads are executing within the parallel_for (10...60 threads) it shows that this is not the issue of a thread limitation. What we have found is that from the 16 powerful Nehalem cores only two had been used by the code. That is the issue and not the limitation of threads. Since I currently do not have a Nehalem system in the lab, I do have to test with our old quad core Xeon. I used the taskset command to get the CPU affinity mask of the service and it was F. That means that the service is allowed to use all 4 cores, and still only 2 of them are used.
Does this shed more light on this? I hope you can offer more things to investigate, because I had hoped that something could be configured for TBB to suggest to it to better use all CPU cores.

Thanks.

Terry_W_Intel · ‎10-02-2009

Hi,
What is the range you are passing to parallel_for? Are you specifying a grainsize? What partitioner are you using?
Does every iteration do about the same amount of work, or is there a load imbalance?

The default behavior when you do not specify the number of threads is for TBB to use as many threads as there are hardware threads available on the machine, so it should be utilizing all cores, provided it has enough work to distribute among all the threads. If it's not getting a poor decomposition in the parallel_for, then it could be a severe load imbalance. You could try timing each subrange that the parallel_for computes. Do you know how many chunks the parallel_for is breaking the full range into? What is the 10-60 number you mention -- did you mean chunks (each of which is computed by a TBB 'task') and not threads? On your 16-core machine (assuming no HT) there would be at most 16 threads created by TBB, and the parallel_for should break your loop into multiple tasks to be executed by those 16 threads. You should be able to count and time the tasks pretty easily (inside your operator() or lambda).

If there are plenty of tasks to keep 16 threads busy, and each task takes about the same time, then we have a mystery to figure out. :) But hopefully, it will lead to some insight about why you aren't getting good core utilization.

Hope this helps!
Terry

areid2 · ‎10-03-2009

I think Terry is on the right track with the load imbalance idea. I have a dual Nehalem system with 8 cores total (16 logical CPUs with HT) and TBB does use 16 worker threads on this system. You can check with task_scheduler_init::default_num_threads(). My system is Vista, but I'd be surprised if this is different on Linux.

I'm not certain how many threads your application is ending up running at any given time. You mention that you create several of your own worker threads and within each one you create a task_scheduler_init object. It is my understanding that no mater how many task_scheduler_init objects you create in your own threads, TBB will create default_num_threads()-1 worker threads total for the process. The minus one is because the thread calling parallel_for will also participate as a worker. So if only one of your worker threads is calling parallel_for at any given time, there should be exactly 16 threads working on that parallel_for. If10 of your threads are calling parallel_for at the same time, then there will be25 threads executingthevarious parallel_for calls. Maybe in this case the TBB scheduler blocks some of the threads so only 16 are active at a time, I'm not sure.I could imagine that the latter case is more prone to load imbalance.

The other potential reason for load imbalance could be IO operations. Are the threads performing IO operations inside the parallel_for loops? If this is the case, the threads may be blocking while the hardware performs the IO. This is one case where you might find better performance by increasing the number of TBB threads over the default.

Sounds like this problem will be hard to debug without access to a Nehalem system.