Solved: Thank you Dmitry, for your

Aly__Omar · ‎03-14-2019

Dear Fellows,

I am working in code parallelization task which contains nested loops.

I did enhanced the perfromance by converting an inner loop to be parallel

The challenge now is that every itteration for the outter loop the threads are created and destroyed, which adds an overhead.

I searched a bit in how to create the thread pool once, and I landed upon taskloop construct in omp with fortran 90.

the issue now is that I applied the code as its described in the documentation of omp, but it never go inside the

!$omp taskloop
 ....never go here
!$omp end taskloop

I added the

!$omp single

!$omp taskloop

!$omp end taskloop

!$omp end single

but also didn't help at all with going into the code between the "taskloop" start and end statments

Any suggestion for what do next would be of great help.

Sincerely

Dmitry_P_Intel1 · ‎03-15-2019

Hello Omar,

A couple of comments:

"omp do" construct invocation from "omp single" is not correct and leads to a deadlock, "omp taskloop" can be called from "omp single".

So you should ether use the semantics that I proposed with "omp do" or your last option, not mixing them.

On this:

"the question now, what else does the taskloop enhanced beside avoiding multiple creation and destruction of thread pool"

Again - the thread pool is not created/destroyed at any omp parallel construct, it is reused. Avoiding usage of multiple "omp parallel" moving it before outer loop and do work sharing either through light-weight "omp do" or "omp taskloop" allows to eliminate overhead connected with parallel region data structures initialization/freeing etc.

Why VTune is not catching this time? It looks like some overhead work for "omp parallel" arrangement is executed out of parallel region mark-up. We need to think how to take it into account e.g. showing "Creation" time for Serial region in the grid that we don't do now.

Thank you, Regards, Dmitry

View solution in original post

Dmitry_P_Intel1 · ‎03-14-2019

Hello Omar,

Intel OpenMP runtime creates the pool of threads (thread team) at the first #pragma omp parallel occurrence and reuse it across the all application.

How do you measure the overhead and how do you know that this overhead is from thread creation/destroying.

I would advise to you VTune Amplifier Threading analysis to look at thread behavior and figure out and classify the overhead.

Thanks & Regards, Dmitry

Aly__Omar · ‎03-14-2019

Hello Dmitry,

Yes I use vTune amplifier and that's how I figured out there is an overhead from thread creation

so if I am doing like the below example

do i=0,10000
    call subroutinex()
end do


!inside subroutinex

!$omp parallel private(...)
!$omp do schedule(dynamic, 5)

do i=0,500
   multiple subroutine calls
end do

!$omp end do
!$omp end parallel

That doesn't destroy the thread pool each time an end parallel region is reached and re create it everytime an !$omp parallel is encountered?

Sincerely,

Omar K. Aly

Dmitry_P_Intel1 · ‎03-14-2019

Oh, I see now,

To be more precise Overhead "Creation" metric is about parallel work creation, not thread creation/destroying. The pool of threads is reused by OpenMP runtime at each outer loop iteration but OpenMP RTL spends some time to clean up omp parallel region related memory structures and create new ones at region destroy/creation etc.

Probably we can try something like this:

!$omp parallel private(i)

do i=0,10000

call subroutinex()

end do

!$omp end parallel

!inside subroutinex

!$omp do private(..) schedule(dynamic, 5)

do i=0,500

multiple subroutine calls

end do

!$omp end do

In this case the parallel region will be created one time and more light-weight construct "omp for" will distribute work by threads at each outer loop iteration. Of course you need to do the work that is not shared between threads in outer loop inside "omp single" construct.

Could you please try this out and see if it helps?

Thanks & Regards, Dmitry

Aly__Omar · ‎03-14-2019

So I don't need to use taskloop anymore ?

when I tried it it didn't go into the do loop when it was single, when I replaced single with master it went inside the do loop but because one of the variables in the private is allocatable array and its allocated inside the loop and deallocated at the end of loop, there is an exception of allocating an already allocated array

Sincerely,

Omar K. Aly

Dmitry_P_Intel1 · ‎03-14-2019

Could you please give more information on: "when I tried it it didn't go into the do loop when it was single". So you tried the proposed semantics with omp parallel on outer loop and the execution flow did not go to a do loop that you needed to execute sequentially in outer loop body?

Also - what was the biggest overhead that VTune shown on "Creation" (% from elapsed time?).

I tried you example but could not see a significant "Creation" overhead on a machine with 88 logical cores with 1M of outer loop iterations.

Thanks & Regards, Dmitry

Aly__Omar · ‎03-14-2019

Yes I tried the proposed solution and it didn't go inside the internal do loop that is directly after the !$omp do construct

and that happened when I was using !$omp single, but when I used !$omp master instead it went and showed the error I described.

My machine have only 8 logical cores and 4 physical cores.

what do you mean by creation time ?

Is it shown on hotspot or on HPC performance characterization?

Thanks in Advance,

Omar K. Aly

Dmitry_P_Intel1 · ‎03-14-2019

Ok, two more questions then:

You wrote: "Yes I use vTune amplifier and that's how I figured out there is an overhead from thread creation". What particular VTune metric showed you that the overhead was from thread creation?

I referred to metric "Creation" that is available on Bottom Up grid under "Potential Gain"

Could you please schematically (using my code) show how you introduced "omp single" pragma?

Thanks & Regards, Dmitry

Aly__Omar · ‎03-14-2019

Ok

I figured out finally why it wasn't going inside the taskloop, its because the inside do loop was using i=0, unbound(some variables)

which probably made the taskloop unable to determine what is the number before it starts, so fixing this made it work, and there is an overall gain of 13 to 17 seconds in performance.

The gain is better when using parallel before the outter loop and using taskloop for the inner loop

but surprisingly by checking the creation, it was always 0, which make me ask is this metric really reflects the time taken for creating a thread pool ?

and that takes us to the point of your question (what metric showed me the creation time issue) and my answer would be the 'kmp_fork_barrier' which is 11 seconds less when I applied the taskloop code.

but the question now, what else does the taskloop enhanced beside avoiding multiple creation and destruction of thread pool?

and why the creation time is always showing 0 at my side, in hotspot analysis, HPC analysis, and with both cases either parallel region starts inside or outside the outter loop?

Here is the code with omp single

!$omp parallel
!$omp single

do i=0, n

call subroutinex() !inner subroutine

end do

!$omp end single
!$omp end parallel


subroutinex()

!definitions goes here

!$omp single
 !some code
!$omp end single

!$omp taskloop private(.....) grainsize(...)

do j=0, n
end do

!$omp end taskloop

!$omp single
!some more code
!$omp end single

end subroutinex

Sincerely,

Omar K. Aly

Dmitry_P_Intel1 · ‎03-15-2019

Hello Omar,

A couple of comments:

"omp do" construct invocation from "omp single" is not correct and leads to a deadlock, "omp taskloop" can be called from "omp single".

So you should ether use the semantics that I proposed with "omp do" or your last option, not mixing them.

On this:

"the question now, what else does the taskloop enhanced beside avoiding multiple creation and destruction of thread pool"

Again - the thread pool is not created/destroyed at any omp parallel construct, it is reused. Avoiding usage of multiple "omp parallel" moving it before outer loop and do work sharing either through light-weight "omp do" or "omp taskloop" allows to eliminate overhead connected with parallel region data structures initialization/freeing etc.

Why VTune is not catching this time? It looks like some overhead work for "omp parallel" arrangement is executed out of parallel region mark-up. We need to think how to take it into account e.g. showing "Creation" time for Serial region in the grid that we don't do now.

Thank you, Regards, Dmitry

Aly__Omar · ‎03-15-2019

Thank you Dmitry, for your help and for the discussion :) (Y)

I will try the one with do loops, and check the difference in performance (if any) and comment here also if its a significant difference in performance, just for information sharing

Thanks again,

Omar K. Aly

Aly__Omar · ‎03-21-2019

Hello Dmitry,

sorry for re-asking again, hopefully that is not annoying

The below code as suggested by you, doesn't work for me, I don't know why, but it do enter the do region and never exits it.

once I add the parallel right before the do it starts to work

!$omp parallel

do i=0,10000

    call subroutinex()

end do

!$omp end parallel

!inside subroutinex

!$omp do private(..) schedule(dynamic, 5)

do i=0,500

   multiple subroutine calls

end do

!$omp end do

Sincerely,

Omar K. Aly

omp taskloop fortran 90