Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Wait for the whole tree of tasks - unstable?

Ovcharenko_E_
Beginner
353 Views

Hello, colleagues! I'm testing TBB tasks and do not get stable results though there is nothing random in the code. [cpp] ... typedef enumerable_thread_specific< int > PerThread; ... class Logger { ... atomic counter; atomic taskCounter; PerThread locInt; ... } ... Logger::Logger () { counter = 200; taskCounter = 0; } ... task* DispTask::execute () { Logger* lg = Logger::Instance(); // lg->Message ("DispTask::execute\t\tcounter=", lg->counter); lg->counter--; int numOfChildren = lg->counter/2; // So here (somehow) we got the number of tasks to be executed (numOfChildren) if (numOfChildren > 0) { task_list tlist; for (int i=0; iMessage ("PropTask::execute"); PerThread::reference tls = lg->locInt.local(); tls++; lg->taskCounter++; // Here must be complex computations DispTask& t = *new(task::allocate_child()) DispTask(); set_ref_count(2); spawn_and_wait_for_all (t); return NULL; } ... int main (int argc, char** argv) { Logger* lg = Logger::Instance(); DispTask& a = *new (task::allocate_root()) DispTask(); task::spawn_root_and_wait (a); // Here I would like to have all the tasks finished! Kind of a barrier. int total_counter=0; for (PerThread::iterator it = lg->locInt.begin(); it != lg->locInt.end(); it++) { total_counter += *it; } lg->Message ("Sum=", total_counter); lg->Message ("Atomic counter=", lg->taskCounter); lg->Flush(); } ... [/cpp] There are 2 types of tasks: PropTask and DispTask. PropTask always leads to DispTask so in the end of PropTask::execute() I need to somehow call (spawn , return) DispTask::execute. DispTask may lead to some non-zero number of resulting objects which means that some number of PropTask need to be executed. Other situation is that DispTask results needs no further processing - terminal point in the task branch. I count the number of PropTasks executed per thread using Thread-Local-Storage. Its sum should be constant (in the test code) because there is nothing random, just dummy decrement. But for all the different variants of the "task-branching" I sometimes get different sums. 1) allocate_child + spawn_and_wait_for_all 2) using empty continuation task: allocate_continuation + cont.allocate_child() + return &t 3) using non-empty continuation task and empty child... I also use a global atomic counter for additional check. It gives the same result. It happens so that among approx. 29 correct results (sum=9900) I get once (sum=9899 or less). Where am I wrong? Also Intel Inspector detect some "Memory leaks" and "Invalid memory access". However many people say that this could be false positive. Thank you! Any ideas?

0 Kudos
10 Replies
Ovcharenko_E_
Beginner
353 Views

Sorry for the bad code! Try to fix it here:

[cpp]
...
typedef enumerable_thread_specific< int > PerThread;
...
class Logger {
atomic<int> counter;
atomic<int> taskCounter;
PerThread locInt;
}
...
Logger::Logger ()
{
counter = 200;
taskCounter = 0;
}
...
task* DispTask::execute () {
Logger* lg = Logger::Instance();
// lg->Message ("DispTask::execute\t\tcounter=", lg->counter);
lg->counter--;
int numOfChildren = lg->counter/2;
// So here (somehow) we got the number of tasks to be executed (numOfChildren)

if (numOfChildren > 0) {
task_list tlist;
for (int i=0; i<numOfChildren-1; i++) {
tlist.push_back ( *new(task::allocate_child()) PropTask() );
}
set_ref_count(numOfChildren+1);
PropTask& lastTask = *new(task::allocate_child()) PropTask();
spawn(tlist);
spawn_and_wait_for_all(lastTask);
}
return NULL;
}
...
task* PropTask::execute () {
Logger* lg = Logger::Instance();
// lg->Message ("PropTask::execute");
PerThread::reference tls = lg->locInt.local();
tls++;
lg->taskCounter++;

// Here must be complex computations

DispTask& t = *new(task::allocate_child()) DispTask();
set_ref_count(2);
spawn_and_wait_for_all (t);
return NULL;
}
...
int main (int argc, char** argv) {
Logger* lg = Logger::Instance();

DispTask& a = *new (task::allocate_root()) DispTask();
task::spawn_root_and_wait (a);

// Here I would like to have all the tasks finished! Kind of a barrier.

int total_counter=0;
for (PerThread::iterator it = lg->locInt.begin(); it != lg->locInt.end(); it++) {
total_counter += *it;
}
lg->Message ("Sum=", total_counter);
lg->Message ("Atomic counter=", lg->taskCounter);

lg->Flush();
}
[/cpp]

0 Kudos
Ovcharenko_E_
Beginner
353 Views

Ok. Spent so much time alone but found my mistake. I hope.

The problem was not in task scheduling but in the number of tasks to be scheduled. Be careful with atomics!

Lines 19-20 should become

[cpp]

int numOfChildren = lg->counter.fetch_and_add(-1)/2;

[/cpp]

0 Kudos
jimdempseyatthecove
Honored Contributor III
353 Views

operator --(int) is provided.
int numOfChildren = lg->counter-- / 2;

Should be valid too.

By the way. I suggest you leave the old code in, conditionalize it out with a strong comment, then add the new statement with a strong comment. This way, the next person to handle your code will not make similar error.

Alternatively you could stongly comment the new statement such that it is clear why it is coded this way.

Jim Dempsey

0 Kudos
Ovcharenko_E_
Beginner
353 Views

Shure.

Two or more threads may run DispTask at the same time.

Yes, lg->counter-- is an atomic decrement (say, op1) and

int numOfChildren = lg->counter/2; is in fact an atomic fetch (say, op2)

So these are two separate atomic operatons. Classic "ABA" situation may happen. (Or however it is called)

Thread 1 puts some number (say, 14) after op1 so 14/2 child tasks should be executed. But then another thread 2 interacts and decrements the variable to 13 by its op1. So when thread 1 fetches the value in op2 it gets not 14 but 13 and executes 13/2 child tasks instead of 14/2.

So, topic name has nothing to do with the problem. Should better call it "typical mistake using atomic variable".

However, still, I get error notifications from Inspector.

0 Kudos
jimdempseyatthecove
Honored Contributor III
353 Views

In the case where lg->counter is of type atomic<int>
The post-fix operator -- is equivilent to fetch_and_add(-1)
Returns prior value of lg-counter in atomic xchgad (lock;xadd [loc],-1).

Your comments do apply to the original two statement form of your original lines 19:20

An alternate correct code would be:

[cpp]
int saveAtomicCounter = lg->counter--; 
int numOfChildren = saveAtomicCounter/2;
[/cpp]

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
353 Views

*** I must caution that you must assess your code to see if there is an issue in the code of counter changing Logger::Instance() by one thread while a different thread is inside DispTask::execute(). If there is an issue then you may need to protect the appropriate objects with lock(s). The use of atomic for counter would seem to indicate that you may require some protection inside DispTask::execute() in the event that critical objects are required not to change during the duration of execute(). 

Jim Dempsey

0 Kudos
Ovcharenko_E_
Beginner
353 Views

Back to the original problem.

As I filled my backbone prototype with computations I sometimes get crashes like this:

[plain]#3762 0x00007fdfd557317b in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7fdfcf9c3e00, parent=...,
child=<value optimized out>) at ../../src/tbb/custom_scheduler.h:448
#3763 0x00007fdfcfc6987f in CollDispTask::execute (this=0x7fdfbfd8ea40)
at src/CollDispTask.cxx:115[/plain]

Most probably there are some limitations to the size of the stack of functions. Could it be that I exceed it because of deep deep tree of tasks? So, to clarify: PropTask calls spawn_and_wait_for_all(DispTask), DispTask calls spawn_and_wait_for_all(PropTask), ....
PropTask->DispTask->PropTask->DispTask->PropTask->DispTask->PropTask->DispTask->.... Up to some moment on condition, of course.

thanks

0 Kudos
RafSchietekat
Valued Contributor III
353 Views

That would be possible.

0 Kudos
Ovcharenko_E_
Beginner
353 Views

Thank you, Raf.

Is there any way to check it? Any hints in the output? Is such a number a realistic one? #3763

Does this mean that in general there exists a limit for the size of the tree of tasks? Does not TBB control this somehow?

And of course, next question - how should I try to avoid huge trees but still wait for all tasks?

0 Kudos
RafSchietekat
Valued Contributor III
353 Views

Try to increase stack size and see what the new level is. Divide current stack size by the level and see if it is a realistic amount of stuff for a frame.

TBB cannot prevent things that would also let a sequential program crash. It will however refrain from stealing beyond a certain stack level (a measure that had to be introduced when going from depth-aware scheduling to deque-based scheduling).

One method to prevent excessive stack buildup is to replace blocking style (wait_for_all()) with continuation passing. See the documentation about that.

0 Kudos
Reply