Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Performance issue with task_arena and task_group

diedler_f_
Beginner
421 Views

Hi,

I have a performance issue with the following code :

tbb::task_arena a(2); // limited area with no more than 2 threads
tbb::task_group dummyGroup;

dummyGroup.run([&] {
	while (veryLongTaskNotFinished) dummyTask();
});

a.execute([&] {
	// very long task which takes about 45 sec to finish
        veryLongTask();
});
dummyGroup.wait();

With dummyTask :

void dummyTask()
{
    std::vector<int> b;
    int running_total = 23;
    for (unsigned int i=0; i < 100000000; i++)
    {
        running_total = 37 * running_total + i;
    }
    b.push_back(running_total);
}

If I execute the very long task alone (without dummyTask), it takes 45sec to finished as expected.

If I execute the dummyTask concurrently with the very long task, the very long task takes now 70sec to finished !

However, my computer has 8 cores (4 physical cores and 4 logical cores). I limited the very long task with 2 threads. The dummyTask uses only 1 thread. And there is the main thread. So I have a total of 4 threads in my example.

I do not understand why, the dummyTask slows down my main task, with an overhead of 70 - 45 = 25 sec

Thanks

 

0 Kudos
4 Replies
Alexei_K_Intel
Employee
421 Views

Hi,

It is unclear what exactly leads to slowdown. Could you share the internals of veryLongTask and how the veryLongTaskNotFinished is set? Do you have some other parallelism in your application? How do you measure the execution time?

Regards,
Alex

0 Kudos
jimdempseyatthecove
Honored Contributor III
421 Views

It may be a case of the 2 threads chosen are hyper thread siblings of the same core.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
421 Views

Note, as coded above, you are:

a) running one task in the dummyGroup with threads taken from the general (default) thread pool which doesn't terminate, plus
b) running one task in the arena a (which may or may not share the same core).

I think you need a better sketch, or working reproducer that simulates the symptoms. Your above sketch does not appear to be constructed properly (compiling without error and running without error is not to be taken as confirmation of correctness).

Jim Dempsey

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
421 Views

For what it's worth, with the QuickThread library and templates you could do something like this:

atomic<bool> veryLongTaskNotFinished = true;
qt::parallel_invoke(
    OneEach_L1$, // different cores (on Core2 Duo and KNL 2 cores share L2)
    [&] { while (veryLongTaskNotFinished) dummyTask(); },
    [&] { veryLongTask(); veryLongTaskNotFinished = false; });
}

or

atomic<bool> veryLongTaskNotFinished = true;
qt::parallel_invoke(
    OneEach_L2$, // different L2 caches
    [&] { while (veryLongTaskNotFinished) dummyTask(); },
    [&] { veryLongTask(); veryLongTaskNotFinished = false; });
}

You can contact me privately if you have an interest in QuickThread (will be under one of the Gnu licenses - free to use). I am currently revising the toolkit for use with C++ 11, 14, and later. I have it working on my KNL Linux system, haven't tried a Windows build yet.

Jim Dempsey

0 Kudos
Reply