Re: Basic OpenMP question

roine_vestman · ‎09-24-2009

I seem to have difficulties to utilize all 8 cores that I have access to on a cluster. Essentially, I have three nested loops. The two inner most loops can be worked on in parallel, so I have put an OpenMP statement between the first and the second. It looks as follows:

do ix=1,nx
!$OMP PARALELL DO ....
do iy=1,ny
do iz=1,nz
...
end do
end do
!$OMP END PARALELL DO
end do

Unfortunately, the code only seems to reach a load of ny (I check that with pbstop +l ), while I would like it reach a load of upto ny*nx. There is substantial amount of work to be made for each (iy,iz), so I was expecting a load far greater than ny. I have also noticed that the load is lower if I put the iz loop outside of the iy loop and if nz
Thank you for any hints and help.

robert-reed · ‎09-24-2009

Quoting - roine.vestman@nyu.edu

I seem to have difficulties to utilize all 8 cores that I have access to on a cluster. Essentially, I have three nested loops. The two inner most loops can be worked on in parallel, so I have put an OpenMP statement between the first and the second. It looks as follows:

do ix=1,nx
!$OMP PARALELL DO ....
do iy=1,ny
do iz=1,nz
...
end do
end do
!$OMP END PARALELL DO
end do

Unfortunately, the code only seems to reach a load of ny (I check that with pbstop +l ), while I would like it reach a load of upto ny*nx. There is substantial amount of work to be made for each (iy,iz), so I was expecting a load far greater than ny. I have also noticed that the load is lower if I put the iz loop outside of the iy loop and if nz

I think what you're missing is a collapse statement. Perhaps the following will give you what you want:

do ix=1,nx
!$omp parallel do private(iy,iz) collapse(2)
do iy=1,ny
do iz=1,nz
...
end do
end do
!$omp end parallel do
end do

The description from the OpenMP 3.0 specification reads as follows:

The collapse clause may be used to specify how many loops are associated with the loop construct. ... If more than one loop is associated with the loop construct, then the iterations of all associated loops are collapsed into one larger iteration space which is then divided according to the schedule clause.

That should give you the span of tasks you are trying to achieve.

roine_vestman · ‎09-28-2009

Thank you for the suggestion. I tried adding collapse(2) but can't get it to work. The error reads:

fortcom: Error: vfi_common_model_main.f90, line 360: Syntax error, found IDENTIFIER 'COLLAPSE' when expecting one of: PRIVATE FIRSTPRIVATE REDUCTION SHARED IF DEFAULT COPYIN NUM_THREADS LASTPRIVATE ...
!$OMP PARALLEL DO DEFAULT(NONE) COLLAPSE(2) &

From which version of Intel Fortan is 'collapse' supported (i.e. from which version is OpenMP 3.0 supported)? I have Intel Fortran version 10.0.023 - do I need to upgrade?

Michael_K_Intel2 · ‎09-29-2009

Quoting - roine.vestman@nyu.edu

Thank you for the suggestion. I tried adding collapse(2) but can't get it to work. The error reads:

fortcom: Error: vfi_common_model_main.f90, line 360: Syntax error, found IDENTIFIER 'COLLAPSE' when expecting one of: PRIVATE FIRSTPRIVATE REDUCTION SHARED IF DEFAULT COPYIN NUM_THREADS LASTPRIVATE ...
!$OMP PARALLEL DO DEFAULT(NONE) COLLAPSE(2) &

From which version of Intel Fortan is 'collapse' supported (i.e. from which version is OpenMP 3.0 supported)? I have Intel Fortran version 10.0.023 - do I need to upgrade?

AFAIK you need to update to 11.x to gain support for OpenMP with Intel's compilers.

I would also suggest to rearrange your code the following way:

!$OMP PARALLEL
do ix=1,nx
!$OMP DO ....
do iy=1,ny
do iz=1,nz
...
end do
end do
!$OMP END DO
end do
!$OMP END PARALLEL

This saves frequently creating and tearing down the parallel region, as it wouldbe done with your code pattern. Creating/tearing down the parallel region can have a serious impact to performance because of administrative and barrier overhead.

Cheers,
-michael

maria · ‎10-21-2009

Quoting - Michael Klemm, Intel

AFAIK you need to update to 11.x to gain support for OpenMP with Intel's compilers.

I would also suggest to rearrange your code the following way:

!$OMP PARALLEL
do ix=1,nx
!$OMP DO ....
do iy=1,ny
do iz=1,nz
...
end do
end do
!$OMP END DO
end do
!$OMP END PARALLEL

This saves frequently creating and tearing down the parallel region, as it wouldbe done with your code pattern. Creating/tearing down the parallel region can have a serious impact to performance because of administrative and barrier overhead.

Cheers,
-michael

I am learning openMP. This is interesting point.If I have nested loops, I can apply '!$omp parallel' and '!$omp end paraellel' to the outermost do loop, and then apply '!$OMP do' and '!$OMP end do' at the loop level for parallel region. I just need to worry the data environment for the paraell region. Is this correct?

jimdempseyatthecove · ‎10-21-2009

Michael,

add private(ix) clause

!$OMP PARALLEL PRIVATE(ix)
do ix=1,nx
!$OMP DO ....
do iy=1,ny
do iz=1,nz
...
end do
end do
!$OMP END DO
end do
!$OMP END PARALLEL

Ronie,

Please note that all threads execute each ix from 1 to nx however each has its own copy of ix.
Be cautious when adding code between the do ix and do iy loops. You can have code there buy you cannot assume each thread is working on different slice of ix iteration space.

Michael's suggestion is good in that it decreases the number of times to form a thread pool from nx times to 1 time. When you get the newer version of the compiler you can then consider adding Robert's suggestion of collaps(2). Until then you could use omp_get_num_threads and omp_get_thread_num in a single loop construction for iteration over iy and iz space.

Jim Dempsey

mahmoudgalal1985 · ‎11-26-2009

Quoting - jimdempseyatthecove

Michael,

add private(ix) clause

!$OMP PARALLEL PRIVATE(ix)
do ix=1,nx
!$OMP DO ....
do iy=1,ny
do iz=1,nz
...
end do
end do
!$OMP END DO
end do
!$OMP END PARALLEL

Ronie,

Please note that all threads execute each ix from 1 to nx however each has its own copy of ix.
Be cautious when adding code between the do ix and do iy loops. You can have code there buy you cannot assume each thread is working on different slice of ix iteration space.

Michael's suggestion is good in that it decreases the number of times to form a thread pool from nx times to 1 time. When you get the newer version of the compiler you can then consider adding Robert's suggestion of collaps(2). Until then you could use omp_get_num_threads and omp_get_thread_num in a single loop construction for iteration over iy and iz space.

Jim Dempsey

Thank you

mahmoudgalal1985 · ‎11-26-2009

Quoting - jimdempseyatthecove

Michael,

add private(ix) clause

!$OMP PARALLEL PRIVATE(ix)
do ix=1,nx
!$OMP DO ....
do iy=1,ny
do iz=1,nz
...
end do
end do
!$OMP END DO
end do
!$OMP END PARALLEL

Ronie,

Please note that all threads execute each ix from 1 to nx however each has its own copy of ix.
Be cautious when adding code between the do ix and do iy loops. You can have code there buy you cannot assume each thread is working on different slice of ix iteration space.

Michael's suggestion is good in that it decreases the number of times to form a thread pool from nx times to 1 time. When you get the newer version of the compiler you can then consider adding Robert's suggestion of collaps(2). Until then you could use omp_get_num_threads and omp_get_thread_num in a single loop construction for iteration over iy and iz space.

Jim Dempsey

Thanks