When both -parallel and -openmp are specified, how compiler 11.1.056 parallelize the program?

ks-fujii · ‎02-03-2010

Hi,

Do you have any documentation on this behavior?

In my case, both auto-parallelization message and openmp message are reported at the same line of the program.

//sample.f

program sample
parameter(n=1000)
real a(n,n),b(n,n)
!$OMP PARALLEL DO
do j=1,n
do i=1,n
a(i,j)=1.0
b(i,j)=1.0
enddo
enddo
do j=1,n
do i=1,n
a(i,j)=a(i,j)*b(i,j)
enddo
enddo
stop
end

%>ifort -parallel -par-report1 -par-threshold0 -openmp -openmp-report1 sample.f

sample.f(4): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
sample.f(11): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
sample.f(4): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.

Thanks.

ks-fujii

TimP · ‎02-03-2010

Your OpenMP parallel applies only to the first pair of nested loops. Normally that would prevent auto-parallel from taking effect on those loops, but of course the auto-parallel still applies to the second pair of nested loops. If it were not for the OpenMP directive, the compiler would be free to fuse the loops, leaving only a single pair of nested loops; apparently, it may have done so. -par-threshold0, as you know, requests auto-parallelization regardless of benefit. In addition, due to popular demand for compilers to remove do-nothing loops, something may be optimized out of this example. So, although it may be interesting to dig deeper to see what the compiler has done with it, this can't be considered representative of a useful case.

ks-fujii · ‎02-03-2010

I tried some cases.

CPU:Intel Itanium2 Madison9M 1.6GHz

Compiler:Intel Fortarn Compiler 11.1.059

//test.f

parameter(n=1000)
real a(n,n),b(n,n),c(n,n)
real*8 dclock
real stime,etime
do j=1,n
do i=1,n
a(i,j)=1.0
b(i,j)=1.0
c(i,j)=0.0
enddo
enddo
stime=dclock()
!$OMP PARALLEL DO
do i=1,n
do j=1,n
do k=1,n
c(i,j)=c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
etime=dclock()
write(6,*) "time1 = " ,etime-stime

stime=dclock()
do i=1,n
do j=1,n
do k=1,n
c(i,j)=c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
etime=dclock()
write(6,*) "time2 = " ,etime-stime
stop
end

This program includes 2 nested loops for matrix product and both loops are the same, but the loop order (i, j ,k) is not optimum. So -O3 will permute the order (j,k,i).

%>ifort -O3 -parallel -par-report1 -par-threshold0 -openmp -openmp-report1 test.f
test.f(14): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
test.f(6): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
test.f(27): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.
test.f(16): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.

%>setenv OMP_NUM_THREADS 2

%>./a.out

time1 = 0.3183594
time2 = 0.1689453

Though both OpenMP messege and auto-parallelization message are still reported at the same loop, but I think this shows the first loop is parallelized with OpenMP and the second is with auto-parallelization.

Is it correct?

Thanks

ks-fujii

jimdempseyatthecove · ‎02-04-2010

ks-fujii,

In my opinion, when auto-parallelization is used in conjunction with openmp, the compiler should not auto-parallel the interior of the openmp loops as it did with your line 16. Doing so, will generally result in over-subscription of threads and degrade performance. You will have (paradoxical)issues where within a parallel region a subroutine or function is called which contains code that does not have openmp loops and are candidates for auto-parallelization. Further these subroutines and functions could potentially be called from both within an omp parallel region and called from outside an omp parallel region. For these situations an "in parallel" should be made and then either the no-threaded or the threaded variant of the code should be called.

Jim Dempsey

TimP · ‎02-04-2010

If an auto-parallel region is created inside an OpenMP parallel region, it appears to create an OMP_NESTED parallel region, where one would expect the inner loop to generate additional threads only when _OMP_NESTED is in effect, and the thread count set e.g. by OMP_NUM_THREADS has not been filled. If the compiler is doing this on account of the poor choice of outer loop nesting for the OMP PARALLEL, it would seem desirable to get a diagnostic explanation.

Perhaps these (odd) examples serve to demonstrate weaknesses of the auto-parallelizer, particularly when used in possible conflict with OpenMP.