Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

Task arena : Unexpected thread / worker distribution

FlorentD
Beginner
2,195 Views

Hi everyone,

 

I think it may have a bug using tbb::task_arena with the thread distribution. Look at this small code :

 

 

int main()
{
    std::cout << "tbb::info::default_concurrency() = " << tbb::info::default_concurrency() << std::endl;

    tbb::task_arena arena(6);
    std::atomic<size_t> magic = {0};

    arena.execute([&]{
        tbb::parallel_for(tbb::blocked_range<size_t>(0, 1000000000),
            [&](const tbb::blocked_range<size_t>& r) {
            // Dummy work
            std::vector<int> b;
            int running_total = 23;
            for (unsigned int i=r.begin(); i < r.end(); i++)
            {
                running_total = 37 * std::sin(running_total) + std::cos(i) + i;
            }
            b.push_back(running_total);
            magic += running_total;
        });
    });

    std::cout << "Finish " << magic << std::endl;
    return 0;
}

 

 

On my computer, I have 6 cores and 12 threads (Intel i7-9750H).

 

This code limits the number of threads / workers to 6, so I expected having a CPU usage of about 50% because I have 12 available threads.

However, the CPU usage shows 78% and I don't understand why.

tbb.png

 

I am using Windows 10 (x64 bits)

 

It is a bug ?

 

Thanks,

 

0 Kudos
1 Solution
Mark_L_Intel
Moderator
1,881 Views

If you click the button at the bottom of Task Manager -- called Open Resource Manager -- it opens up another window with the Resource Manager tool. As you can see in the Task manager the CPU utilization was above 50% while executable was running (~ 67%), but Resource Manager shows 50% I guess the difference is due to other processes running too at the time of execution.cpu_util.PNG Manager tool.

 

 

View solution in original post

0 Kudos
15 Replies
SeshaP_Intel
Moderator
2,158 Views

Hi,


Thank you for posting in Intel Communities.


Could you please confirm us from which platform you are running the code i.e., is it from Visual Studio (or) Intel Command Prompt ?


If you are using Visual Studio to run the code, could you please try with Intel Command Prompt(X64) and please do let us know the results.


Thanks,

Pendyala Sesha Srinivas


0 Kudos
FlorentD
Beginner
2,144 Views

Hi,

 

I am running the code from CodeBlocks (with GCC compiler) but there is the same probem with Visual Studio.

 

What do you means by "Intel Command Prompt(X64)" ? I don't have this command prompt installed on my computer but I have the Windows Command Prompt (cmd.exe) and the issue is still there as you can see in this screenshot.

 

tbb_bug.png

I think there is a bug in TBB library ?

0 Kudos
SeshaP_Intel
Moderator
2,111 Views

Hi,


We are working on your issue internally. We will get back to you soon.


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
FlorentD
Beginner
2,100 Views

Thanks, please keep me in touch as soon as possible.

 

This issue is blocking for my project because it degrades drastically performance when using 2 tasks arena limited to 6 threads each (for a CPU with 12 threads available)

0 Kudos
Mark_L_Intel
Moderator
2,080 Views

Hello,


  • Your sample has a data race – the running_total is shared variable but is being updated in parallel_for. It looks like you'd like to do a reduction and we have tbb::parallel_reduce for that. This is a good example for starters:

https://chryswoods.com/parallel_c++/parallel_reduce.html


  • The performance and CPU utilization is affected by variety of things besides concurrency levels set in arenas, e.g., partitioner, grainsize, the sufficiency of the work given to a given thread per iteration. 


  • proTBB book available here for free is a good reference, e.g., Chapter 16.


https://library.oapen.org/handle/20.500.12657/22838


  • There is nothing unusual that you might see different CPU utilization on machines with different CPUs + different CPU frequencies.
  • It sounds that you want to understand performance of TBB parallel ops, I’d recommend Vtune for that as well -- especially given these tiny examples the results should be easy to collect.  




0 Kudos
FlorentD
Beginner
2,068 Views

Hi,

 

There is no data race in this sample. The variable running_total is not shared but private inside the parallel_for. The only shared variable is magic and it is atomic, so there is no data race. Is it correct ?

 

I know partitioners and grainsize impacts on performance but I just want to limit my arena to half of my threads (6 on my CPU because 12 available threads). So I should see about 50% on CPU usage.

 

How can I do that ?

 

Thanks for the book reference and VTune

0 Kudos
Michael_V_Intel
Employee
2,052 Views

I agree that there is no data race.

 

One thing you might try though is to use global_control . A task_arena does limit the number of threads that can simultaneously execute tasks submitted to that specific arena.  A global control object however limits the total number of threads available to the TBB library. TBB has likely created 11 worker threads and your explicit arena is using 5 of them (5 workers plus your main thread). It is surprising to see such a high CPU utilization, but it would be interesting to see if using global_control fixes the issue for you.

0 Kudos
FlorentD
Beginner
2,035 Views

Hi,

 

Same issue with global_control and a number of 6 for parameter max_allowed_parallelism :

 

tbb::global_control control(tbb::global_control::parameter::max_allowed_parallelism, 6);
std::cout << "tbb::global_control::parameter::max_allowed_parallelism = " << control.active_value(tbb::global_control::parameter::max_allowed_parallelism) << std::endl;

 

 

The CPU usage is still high at about 78%.


Sorry but I think there is a bug in TBB library with my CPU (Intel i7-9750H with 6 cores).

 

I have another computer with 8 available threads  (Intel i7-7700K with 4 cores) and using global_control or limited task_arena seems to work :

  • with global_control / task_arena limited to 1 -> CPU usage about 13%
  • with global_control / task_arena limited to 2 -> CPU usage about 26%
  • with global_control / task_arena limited to 4 -> CPU usage about 52%
  • with global_control / task_arena limited to 6 -> CPU usage about 78%
  • with global_control / task_arena limited to 8 -> CPU usage about 98%

So, I beg that the issue is maybe my CPU or all CPU with 6 cores or all CPU that have a number of cores NOT a power of 2 ?

 

Whats is wrong ? Can you investigate with a CPU with 6 cores ? I hope one TBB developer have this kind of CPU

 

Thanks,

0 Kudos
Mark_L_Intel
Moderator
1,981 Views

You were correct about the local variables as Mike already pointed out. Just in case, here is a references for more details:

https://community.intel.com/t5/Intel-oneAPI-Threading-Building/Convert-parallel-for-from-OpenMP-to-TBB/td-p/1104020


I've experimented with your code on Windows 10pro machine based on i9-9900K processor. The machine has 8 cores and 16 threads (HT on). I got this data for average CPU utilization during heavy compute in parallel_for:


# of threads CPU utlil., %

1 10

2 18

3 26

4 34

5 42

6 50

7 59

8 67

9 75

10 83

11 91

12 100


This data above seems to confirm your findings. This is with the even number of threads/cores. However, I've not observed the same behavior on Linux so far (btw, on Linux, I used a different machine with 77 cores/144 threads; I used htop with average CPU utilization option). It might be rather Windows related issue than number of threads on a specific machine. This are all preliminary experiments -- just to let you know that we will discuss this issue internally and investigate further.


0 Kudos
Mark_L_Intel
Moderator
1,904 Views

Hello,


In my previous post I used Task Manager->Performance to get average CPU utilization. After switching to the Resource Monitor which allows to monitor only the process responsible for your reproducer executable, we can't reproduce your issue anymore. Evidently, in the previous experiments, there were some other processes responsible for the additional CPU utilization.


As of today, with 8 threads running on 16 threads system we see exactly 50% CPU utilization.


The same behavior is observed on the systems with 6 threads (we configured one of our systems with msconfig to have 6 cores at the boot time). BTW, something happened with my ability to edit the posts, so in the previous post, I meant "power of two number of threads" instead of "even number of threads" but I can't change that anymore.


0 Kudos
FlorentD
Beginner
1,891 Views

Hi,

 

Can you share me a screenshot of the Resource Monitor on Windows 10 where you see 50% CPU utilization for a system with 12 threads with 6 active ?

I only use the Task Manager to see the CPU usage and it does not work for me.

 

Thanks,

0 Kudos
Mark_L_Intel
Moderator
1,882 Views

If you click the button at the bottom of Task Manager -- called Open Resource Manager -- it opens up another window with the Resource Manager tool. As you can see in the Task manager the CPU utilization was above 50% while executable was running (~ 67%), but Resource Manager shows 50% I guess the difference is due to other processes running too at the time of execution.cpu_util.PNG Manager tool.

 

 

0 Kudos
FlorentD
Beginner
1,867 Views

Hi,

 

Indeed, it seems to work with the Resource Manager Tool on Windows. However, the column "thread" does not display the right number of threads used by the process. In my case, it shows 9 over 12 and in your screenshot, it displays 11 over 16.

 

If you confirm me, that there is no issue with the task arena thread distribution, it is okay for me. It is just weird that the CPU utilization of the Task Manager does not display the right percentage.

 

Thanks for your help

0 Kudos
Mark_L_Intel
Moderator
1,838 Views

Engineering team also confirms that they don't see an issue. So, we won't be tracking this Forum post from Intel side from now on.


0 Kudos
FlorentD
Beginner
1,829 Views

Thanks for your help

0 Kudos
Reply