performance problem with OpenMP

tranquynh · ‎08-17-2005

I am trying to parallelize a rather simple fortran code using openMP
I am using Intel Fortran 9 for linux, (evaluation version), l_fc_p_9.0.021
The system is a dual processor Opteron 250 with Linux Red Hat Enterprise 4
The code consists in 6 sequential blocs, each consisting of 3 nested DO loops.
When running the code with 2 threads, the gain in performance was poor (20%), although the same code, running on a IBM cluster, with the same parallelisation directives, was perfectly scalable (almost 100% gain using 2 processors)

I have modified the code, and I kept only 2 of the 6 blocs. In this case the gain in performance was nearly 100% when passing from 1 to 2 processors.
I simplified the code again, while keeping the same structure and again I found the same problem: 2 blocks, good performance improvement, more blocs, poor gain.
It seems to be a quite general behaviour
Could you explain this?

the parallelization directives are :
!$OMP PARALLEL
do n ...

! bloc 1

!$OMP DO
do k ...
do j ...
do i ...
A(i,j,k)=...
...
end do
end do
end do
!$OMP END DO

! bloc 2

!$OMP DO
do k
do j
do i
B(i,j,k)=...
...
end do
end do
end do
!$OMP END DO

etc...

TimP · ‎08-17-2005

"IBM cluster" could be a wide variety of things; AIX, linux, Windows, MacOS, PowerPC, Power3/4/5, Xeon, Opteron, Itanium, 32-bit, 64-bit, so I won't attempt to guess about that.
If you have smaller cache on the Opteron, you may be losing cache locality by splitting up loops too fine. If the Opteron is much faster, the overhead of starting and finishing parallel regions may become evident. This hardly exhausts the list of possibilities, but you haven't given much information to go on.

tranquynh · ‎08-19-2005

Thank you for the prompt answer. I am sorry for not having provided enough information. Here they are! I hope that this will help.

1)
The IBM system was IBM eServer p690+ (SMP, 32 processors, 5,2 Gflop/s, 1,3 GHz, L1 data cache : 32 Ko (2 way set associative), L1 instruction cache : 64 Ko (direct mapped), L2 data cache : 3x0,5 Mo = 1,5 Mo shared by 2 processors.

2)
The system we want to test is a HP ProLiant DL 145 server with:
2 AMD opterons, model 250
CPU MHz: 2400
FPU: Integrated
CPU(s) enabled: 2 cores, 2 chips, 1 core/chip
CPU(s) orderable: 1,2
Primary Cache: 64KBI + 64KBD on chip
Secondary Cache: 1024KB(I+D) on chip
L3 Cache: N/A
Other Cache: N/A
2GB of PC2700 DDR SDRAM running at 333MHz

Operating system is Red Hat Enterprise Server 4, 64bit EMT

3)
This is the test code :

********CODE TEST************
program test_openmp
!---------------------------------------------------!
use OMP_LIB
....
!----------------Parallel region--------------------!
!$OMP PARALLEL
do n=1,500*nmax
!--------------bloc 1
!$OMP DO
do k=1,nmax
do j=1,nmax
do i=1,nmax
store1=0.00000001*sum1(i,j,k)
sum1(i,j,k)=sum1(i,j,k)+store1
end do
end do
end do
!$OMP END DO
!-------------bloc 2
!$OMP DO
do k=1,nmax
do j=1,nmax
do i=1,nmax
store2=0.00000001*sum2(i,j,k)
sum2(i,j,k)=sum2(i,j,k)+store2
end do
end do
end do
!$OMP END DO
!------------bloc 3
...
!-----------bloc 4
...
!-----------bloc 5
....
!-------------bloc 6
....
!-------------------End
end do ! end do n

!$OMP END PARALLEL
!----------------End parallel region--------------------!
end program test_openmp

********CODE TEST************
I compiled the following code using:
I tried without parallelisation first, I just issued:
ifort test_openmp
./a.out
then I compared with:
ifort -openmp test_openmp
export OMP_NUM_THREADS=2
./a.out

These are the results:

1 Bloc : 1 thread (no OMP): computation time : 15.546s
2 threads (OMP option): computation time : 7.589s
Gain : 2.0485
2 blocs : 1 thread : computation time : 44.5s
2 threads : computation time : 22.379s
Gain : 1.9885
3 blocs : 1 thread : time : 68.764s
2 threads : time : 52.566s
Gain : 1.3081
4 blocs : 1 thread : time : 91s
2 threads : time : 71.186s
Gain : 1.278
5 blocs : 1 thread : time : 114.064s
2 threads : time : 89.173s
Gain : 1.2791
6 blocs : 1 thread : time : 135.66s
2 threads : time : 106.66s
Gain : 1.2719