I am using icc 188.8.131.52, Build 20141023 for the Intel Xeon Phi. Using the OpenMP task construct in my application, I construct a binary tree and perform a substantial amount of computations for each node. More specifically, after a node is processed, it creates two children. The children are created as new tasks and the same computations are performed on the children. However, the tree is not very deep and only a relatively small number of tasks are running at any point in time (up to around 15).
In order to take advantage of the large number of cores on the Xeon Phi, I thought to parallelize using "#pragma omp parallel" and "#pragma omp for" the loops that consist the main computational workload within each task. Although I am getting the correct results, the execution time did not improve at all. I think that this is not an issue of creating too many threads in my program, since I try to restrict the total number of running threads. For example, I timed the processing of the root node (hence only 1 task is running at that point and all cores are available for the parallelization of the loops) with and without parallelization of the loops and I get the exact same processing time. I also verified that 228 are requested by the program for this case.
It is my understanding that the OpenMP specification leaves the implementation of nested parallelism to the discretion of the implementer. Might it be the case that I don't see any performance improvement due to the fact that the Intel compiler does not support nested parallelism in this fashion (tasks and parallel loops within each task)? How could I check what is going on?
Thank you in advance,
Ioannis E. Venetis
Sometimes the mind gets stuck in the difficult parts and misses the obvious answers.
I am getting a segmentation fault now, but this is obviously some error in my parallel implementation.
Ioannis E. Venetis