OMP task final

Sergey_L_1 · ‎02-02-2015

Hello, I have just started to learn omp tasks and have a question about final clause. I compiled the following code based on the example from https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-243B37C4-0633-45C1-8207-66569BFDE799.htm with Intel C++ 15.0.0.108 and ran it on simple binary tree of level 3.

#include <iostream>
#include <omp.h>

const int MAX_LEVEL = 3;
using namespace std;

struct node {
	struct node *left;
	struct node *right; 
	int nodeNumber;
};

void process(node * pNode)
{
#pragma omp critical
	{
	cout<<omp_get_thread_num()<<"  "<<pNode->nodeNumber<<endl;
	}
}

void traverse( struct node *p, int lev, int finLevel ) {
	if (p->left)
#pragma omp task  final(1) // p is firstprivate by default
		traverse(p->left,lev+1,finLevel);
	if (p->right)
#pragma omp task final(1)// p is firstprivate by default
		traverse(p->right,lev+1, finLevel);
	process(p);
}

void allocateNodes(node **p, int level, int *nodeNumber) {
	*p = new node;
	(*p)->nodeNumber = *nodeNumber;
	if (level < MAX_LEVEL)
	{
		(*nodeNumber)++;
		allocateNodes(&(*p)->left, level+1, nodeNumber);
		(*nodeNumber)++;
		allocateNodes(&(*p)->right, level+1, nodeNumber);
	}
	else
	{
		(*p)->left = NULL;
		(*p)->right = NULL;
	}
}

int main()
{
	node *p;
	int nodeNumber = 0;
	allocateNodes(&p,0,&nodeNumber);
#pragma omp parallel
#pragma omp single
	traverse(p,0,-1);
}

And get strange output:

0 0
1 1
2 8
0 9
1 5
2 12
0 11
1 7
2 14
0 10
1 6
2 13
0 2
1 3
2 4

But If I compile the same code with g++ I get another output:

2 3
0 0
2 4
0 10
2 2
0 11
2 6
0 9
2 7
2 5
0 13
2 1
0 14
0 12
0 8

jimdempseyatthecove · ‎02-03-2015

I do not think the Intel C++ output is strange. It would seem like final(1) does not preclude a waiting thread from taking the task. IOW the enqueuing thread, though permitted to take the non-deferred task, doesn't get the chance to take it before some other thread takes it.

This may be an implementation detail.

If you think about it, if final(1) resulted in non-enqueuing of the task, and the issuing thread directly runs the task, then only 1 thread would run the program. There is no indication as to the number of threads declared for the OpenMP thread pool, nor the platform particulars.

Jim Dempsey

Sergey_L_1 · ‎02-04-2015

Hi Jim, thank you for your answer, but I think it is not so.

According to openmp specification:

1) It would seem like final(1) does not preclude a waiting thread from taking the task.

included task is undeferred and executed immediately by the encountering thread, i.e by the thread that started processing node 1(or node 8) in this example.

2)If you think about it, if final(1) resulted in non-enqueuing of the task, and the issuing thread directly runs the task, then only 1 thread would run the program.

final task - a task that forces all of its child tasks to become final and included tasks, that is only tasks starting from level 2 and below become included. That is why left and right subtrees are processed by 2 different threads.

I use Intel C++ Composer Version 15.0.0.108 Build 20140726 on windows and icc version 14.0.2 on Linux.

pbkenned1 · ‎02-05-2015

If you expected a correct post order traversal, then I think you need a taskwait before visiting a node, ie:

#pragma omp taskwait
process(p);

However, even adding that, I only get a correct traversal with OMP stubs, or by running with one thread for icc:

[U540192]$ ./U540192-cq-serial-stubs.cpp-icc.x

0 3
0 4
0 2
0 6
0 7
0 5
0 1
0 10
0 11
0 9
0 13
0 14
0 12
0 8
0 0
[U540192]$

[U540192]$ export OMP_NUM_THREADS=1
[U540192]$ ./U540192-cq-parallel.cpp-icc.x

0 3
0 4
0 2
0 6
0 7
0 5
0 1
0 10
0 11
0 9
0 13
0 14
0 12
0 8
0 0
[U540192]$

With gcc-4.8.2, I don't get a correct traversal, even on one thread:

[U540192]$ ./U540192-cq-parallel.cpp-gcc.x

0 0
0 3
0 4
0 2
0 6
0 7
0 5
0 1
0 10
0 11
0 9
0 13
0 14
0 12
0 8
[U540192]$

With > 1 thread, icc doesn't produce a correct traversal. And, it makes no difference whether or not I have final(1). I'll ask the developers.

Patrick

Sergey_L_1 · ‎02-05-2015

Hi Patrick. Postorder tree traversal with icc works fine, all child nodes processed before it parents.

0 14
1 7
2 11
3 13
1 6
2 10
0 12
3 4
1 5
2 9
3 3
0 8
3 2
1 1
0 0

Actually in my main application I want to process nodes in some subtree by the same thread that processed root of this subtree, starting from some layer. For example nodes 3,4,2,6,7,5,1 processed by thread 0; nodes 10, 11, 9 by thread 1; nodes 13,14,12 by thread 2; other nodes by any thread. And all this nodes processed in postorder.

0
/ \
/ \
/ \
/ \
/ \
1 8
/ \ / \
/ \ / \
2 5 9 12
/\ /\ /\ /\
3 4 6 7 10 11 13 14

Andrey_C_Intel1 · ‎02-06-2015

Sergey,

Thank you for reporting this bug. Currently in the Intel compiler the final clause is not implemented for the task construct (it is ignored). That is why you see incorrect output. The output of the GCC compiler looks fine.

We will fix this problem in future releases of the Intel compiler.

Thanks,
Andrey

pbkenned1 · ‎02-09-2015

Andrey,

Thanks for confirming final() is not yet implemented. Just to close the loop, the internal tracking number for this issue is DPD200366130.

Patrick

jimdempseyatthecove · ‎02-09-2015

Why not this:

#include <iostream>
#include <omp.h>

const int MAX_LEVEL = 3;
using namespace std;

struct node {
 struct node *left;
 struct node *right; 
 int nodeNumber;
};

void process(node * pNode)
{
#pragma omp critical
 {
 cout<<omp_get_thread_num()<<"  "<<pNode->nodeNumber<<endl;
 }
}


void traverse( struct node *p, int lev, int finLevel ) {
 if (p->left)
  traverse(p->left,lev+1,finLevel);
 if (p->right)
  traverse(p->right,lev+1, finLevel);
 process(p);
}

void traverse_task( struct node *p, int lev, int finLevel ) {
 if (p->left)
#pragma omp task  final(1) // p is firstprivate by default
  traverse(p->left,lev+1,finLevel);
 if (p->right)
#pragma omp task final(1)// p is firstprivate by default
  traverse(p->right,lev+1, finLevel);
 process(p);
}

void allocateNodes(node **p, int level, int *nodeNumber) {
 *p = new node;
 (*p)->nodeNumber = *nodeNumber;
 if (level < MAX_LEVEL)
 {
  (*nodeNumber)++;
  allocateNodes(&(*p)->left, level+1, nodeNumber);
  (*nodeNumber)++;
  allocateNodes(&(*p)->right, level+1, nodeNumber);
 }
 else
 {
  (*p)->left = NULL;
  (*p)->right = NULL;
 }
}

int main()
{
 node *p;
 int nodeNumber = 0;
 allocateNodes(&p,0,&nodeNumber);
#pragma omp parallel
#pragma omp single
 traverse_task(p,0,-1);
}

or:

#include <iostream>
#include <omp.h>

const int MAX_LEVEL = 3;
using namespace std;

struct node {
 struct node *left;
 struct node *right; 
 int nodeNumber;
};

void process(node * pNode)
{
#pragma omp critical
 {
 cout<<omp_get_thread_num()<<"  "<<pNode->nodeNumber<<endl;
 }
}

void traverse( struct node *p, int lev, int finLevel ) {
 if (p->left)
#pragma omp task  final(1) if(lev==0) // p is firstprivate by default
  traverse(p->left,lev+1,finLevel);
 if (p->right)
#pragma omp task final(1) if(lev==0) // p is firstprivate by default
  traverse(p->right,lev+1, finLevel);
 process(p);
}

void allocateNodes(node **p, int level, int *nodeNumber) {
 *p = new node;
 (*p)->nodeNumber = *nodeNumber;
 if (level < MAX_LEVEL)
 {
  (*nodeNumber)++;
  allocateNodes(&(*p)->left, level+1, nodeNumber);
  (*nodeNumber)++;
  allocateNodes(&(*p)->right, level+1, nodeNumber);
 }
 else
 {
  (*p)->left = NULL;
  (*p)->right = NULL;
 }
}

int main()
{
 node *p;
 int nodeNumber = 0;
 allocateNodes(&p,0,&nodeNumber);
#pragma omp parallel
#pragma omp single
 traverse(p,0,-1);
}

Jim Dempsey

pbkenned1 · ‎02-09-2015

Hello Jim,

Thanks for the additional examples, but I'm not sure what you're suggesting. I still see incorrect post-order traversals for both of them, when more an one thread is involved. The first example calling traverse_task() doesn't look any different to me than Sergey's original example. The second example with the if(level==0) condition seems like it might work, but fails for > 1 thread. For example, with two threads:

C:\ISN_Forums\U540192>U540192-JD2.exe
0 10
1 3
0 11
1 4
0 9
1 2
0 13
1 6
0 14
1 7
0 12
1 5
0 8
1 1
0 0

C:\ISN_Forums\U540192>

Based on the tree diagram Sergey provided, I see the correct traversal only with one thread:

C:\ISN_Forums\U540192>set OMP_NUM_THREADS=1

C:\ISN_Forums\U540192>U540192-JD2.exe
0 3
0 4
0 2
0 6
0 7
0 5
0 1
0 10
0 11
0 9
0 13
0 14
0 12
0 8
0 0

C:\ISN_Forums\U540192>

Patrick

pbkenned1 · ‎02-10-2015

Jim's second example does show that with two threads, each thread correctly process its own subtree in post-order. Thread 0 does the right subtree, and thread 1 the left. I confused myself by thinking that if final(1) was working, the output should be ordered as in the single thread case.

Patrick