Re: Bad OMP scaling Fortran

Timmo1 · ‎10-03-2009

Hi,

the following codeis a simple test to testthescalingof OpenMP.I compiled thesourcecode with -o -openmp. But i think the scaling is very bad.

Myquestionsare:

1. Is CPU_TIME appropriate for measuring thescaling of OpenMP?
2. Can somebody check the OMP directives in my code whether its written right?

PROGRAM TEST

IMPLICIT NONE

REAL, SAVE :: TIME_INIT
REAL, SAVE :: TIME_COMPUTE
INTEGER :: TIME_TEMPa
INTEGER :: TIME_TEMPb
INTEGER :: TIME_a(3)
INTEGER :: TIME_b(3)

CALL CPU_TIME(TIME_INIT)
CALL ITIME(TIME_a)
CALL COMPUTE()
CALL ITIME(TIME_b)
CALL CPU_TIME(TIME_COMPUTE)

TIME_TEMPa = (TIME_a(2)*60) + TIME_a(3)
TIME_TEMPb = (TIME_b(2)+60) + TIME_b(3)

WRITE(*,'("+TRACE..........",A,F10.3)') 'TIME COMPUTE(CPU_TIME) ', TIME_COMPUTE-TIME_INIT
WRITE(*,*),"+TRACE..........TIME COMPUTE(ITIME) " , TIME_TEMPa - TIME_TEMPb

END PROGRAM

!**********************************************************************
!* *
!**********************************************************************

SUBROUTINE COMPUTE

IMPLICIT NONE

INTEGER :: MNP = 1000
INTEGER :: MDC = 1000
INTEGER :: MSC = 1000
INTEGER :: ID, IS, IP, IT
REAL :: ACMNP

TYPE DPL_ACTION

REAL, POINTER :: AC1(:,:,:)

END TYPE DPL_ACTION

TYPE(DPL_ACTION), TARGET, SAVE :: DPL_ACTIONS

REAL, POINTER, SAVE :: AC1(:,:,:)

ALLOCATE( DPL_ACTIONS%AC1(MNP,MSC,MDC) )
AC1 => DPL_ACTIONS%AC1

!$OMP PARALLEL SHARED(AC1,MDC,MSC,MNP) PRIVATE(ACMNP,ID,IS,IP)
!$OMP DO

DO ID = 1, MDC

DO IS = 1, MSC

DO IP = 1, MNP

AC1(IP,IS,ID) = IP + IS +ID
ACMNP = AC1(IP,IS,ID)

END DO

END DO

END DO

!$OMP END DO
!$OMP END PARALLEL

END SUBROUTINE COMPUTE

i attachedthe code fordownloading. I hope somebody can give me a hot tip
Thanks in advance!

Timmo

TimP · ‎10-03-2009

By your use of CPU_TIME, you should find the total CPU time spent by all the threads. Scaling quotations would normally be based on elapsed time, with the standard OpenMP wrapper method being omp_get_wtime().
Use of omp collapse and nontemporal store could have a large effect on performance in a case like this where they may be suitable. It's possible that interprocedural analysis removes the loop entirely when you compile without OpenMP, as the compiler can see that you avoid any dependency on whether the loop is executed.

jimdempseyatthecove · ‎10-03-2009

Timmo,

The compiler optimization might have noticed that the results of the computations were not used and may have eliminated the code. With optimizatons off you should see some scaling. With full optimizations enabled, the store and load of the AC1 might be eliminated.

You also have a benign bug. Try the following (look for ! %%%...)

PROGRAM TEST

IMPLICIT NONE

REAL, SAVE :: TIME_INIT
REAL, SAVE :: TIME_COMPUTE
INTEGER :: TIME_TEMPa
INTEGER :: TIME_TEMPb
INTEGER :: TIME_a(3)
INTEGER :: TIME_b(3)

CALL CPU_TIME(TIME_INIT)
CALL ITIME(TIME_a)
CALL COMPUTE()
CALL ITIME(TIME_b)
CALL CPU_TIME(TIME_COMPUTE)

TIME_TEMPa = (TIME_a(2)*60) + TIME_a(3)
TIME_TEMPb = (TIME_b(2)+60) + TIME_b(3)

WRITE(*,'("+TRACE..........",A,F10.3)') 'TIME COMPUTE(CPU_TIME) ', TIME_COMPUTE-TIME_INIT
WRITE(*,*),"+TRACE..........TIME COMPUTE(ITIME) " , TIME_TEMPa - TIME_TEMPb
! %%%%%%%%%%%%%%%%
! 2ND CALL WRITES OUT SUM OF ARRAY
! WE ARE NOT INTERESTED IN THE SUM
! WE ARE INTRESTED IN INFORMING THE COMPILER THAT THE RESULTS
! OF AC1 ARE USED
CALL COMPUTE()

END PROGRAM

!**********************************************************************
!* *
!**********************************************************************

SUBROUTINE COMPUTE

IMPLICIT NONE

INTEGER :: MNP = 1000
INTEGER :: MDC = 1000
INTEGER :: MSC = 1000
INTEGER :: ID, IS, IP, IT
REAL :: ACMNP

TYPE DPL_ACTION

REAL, POINTER :: AC1(:,:,:)

END TYPE DPL_ACTION

TYPE(DPL_ACTION), TARGET, SAVE :: DPL_ACTIONS

REAL, POINTER, SAVE :: AC1(:,:,:)

! %%%%%%%%%%%%%%%%
IF(ASSOCIATED(AC1)) THEN
! HERE YOU WOULD DO NOTHING
! EXCEPT FOR TESTING WE WRITE SUM AND RETURN
WRITE(*,*) 'SUM=',SUM(AC1)
RETURN
! WHEN NOT TESTING YOU WOULD HAVE HAD A MEMORY LEAK
! ON 2ND AND LATER CALLS
ELSE
ALLOCATE( DPL_ACTIONS%AC1(MNP,MSC,MDC) )
AC1 => DPL_ACTIONS%AC1
ENDIF

!$OMP PARALLEL SHARED(AC1,MDC,MSC,MNP) PRIVATE(ACMNP,ID,IS,IP)
!$OMP DO

DO ID = 1, MDC

DO IS = 1, MSC

DO IP = 1, MNP

AC1(IP,IS,ID) = IP + IS +ID
ACMNP = AC1(IP,IS,ID)

END DO

END DO

END DO

!$OMP END DO
!$OMP END PARALLEL

END SUBROUTINE COMPUTE

Your sample program will be memory bandwidthbound as opposed to compute bound. Memory bandwidthbound programs do not scale well. Depending on system 2 or 3 threads can saturate memory bandwidth. Consider replacing

AC1(IP,IS,ID) = IP + IS +ID

With something with higher computation

AC1(IP,IS,ID) = SIN(REAL(IP)) + SIN(REAL(IS)) + SIN(REAL(ID))

Jim Dempsey

Timmo1 · ‎10-25-2009

Hi all,

thanks for the nice suggestion. To understand why my software is not scaling that good I have still to ask certain question. Again I provide a part of the program in order that you can follow my ideas.

The main job of the code is to read an FEM mesh, actually the element connection table and the node coordinates (x and y), which are inside in the datafile system.dat and describe the domain of interest.

Subroutines i.e NVEKTRI, AREA_SI, SPATIAL_GRID estimate the area, the normal vectors and other geometric quantities that needed for the solution of the problem, which is actually done in the main.f90. The structure is the outer do loop "it" represents the time loop whereas the other too loops call the problem "(MSC * MDC)" times.

The results for each increment in the loops within the time loop are totally independent of each other but all of the require the precomputed geometrical data that I have briefly described above. These arrays are shared arrays and defined in the global datapool module.

The Problem is that the code doesnt scale well. Actually it scales really bad as shown below. I do not understand why ... according to some OpenMP books and Internet resources I have consult this should not be the case since it is said the "reading" shared arrays is not a big problem in OpenMP.

The result of this calculation is at the end of the CFL_RD routine stored in a global array that is called ITER_EXP(:,:), which again accessed totally independently by all the threads.

My questions are:

1) How to define the arrays and how to allocate the data? I tried already a lot of stuff using e.g. threadprivate and allocate for each thread by itself but I did not found any improved. The results for the scaling of the code become even worse.

2) Can the usage of threadprivate directives in datapool.f90 speed up my code, have I done something wrong?

3) It seems in that case that the as mentioned in the 2nd comment the memory access is more expensive than the calculations itself. However, from my understanding the problem is totally independent for each of thread and therefore there should be a possibility to code that in a way so that the scaling is nearly perfect for even 10000 of CPU's. I am afraid that somehow the read access to the global arrays is the bottleneck in the code but how to change it. I tried a lot ... nothing had helped ...

4) Do I have some rough mistakes in the code? I mean with respect to OpenMP?

I appreciate any suggestion and like to thank for the comments given before ... actually I think that this is a very common problem ...

Cheers

TimP · ‎10-25-2009

Which is a common problem? The idea that writing a program for slowness will automatically improve threaded scaling?
If you would follow the standard advice to optimize your program before threading (e.g. by allowing vectorization), you would improve cache locality, which becomes much more important when you engage OpenMP.
Current platforms with more than 4 cores use multiple sockets, each with separate cache. When you require that each one reads its own copy of all the shared arrays, you prevent parallel performance scaling. I don't see that you have any 10000 CPU platform in mind which would support your idea of perfect scaling, but surely it won't happen by requiring 10000 local copies of all of your data.

Timmo1 · ‎10-25-2009

Quoting - tim18

Which is a common problem? The idea that writing a program for slowness will automatically improve threaded scaling?
If you would follow the standard advice to optimize your program before threading (e.g. by allowing vectorization), you would improve cache locality, which becomes much more important when you engage OpenMP.
Current platforms with more than 4 cores use multiple sockets, each with separate cache. When you require that each one reads its own copy of all the shared arrays, you prevent parallel performance scaling. I don't see that you have any 10000 CPU platform in mind which would support your idea of perfect scaling, but surely it won't happen by requiring 10000 local copies of all of your data.

Hi tim,

let me first thank for spending your time on this question and let me comment you reply.

1. Which is a common problem?

You questioned that this is a common problem. My suggestion that this should be a common problem, comes from the idea that if one has an independent outer loop it is somehow intuitive to apply coarse grained parallelism of the most outer loop in order to keep the parallel coding as simple as possible.

However, this seems not to work out this way ... as you have mentioned.

2. The idea that writing a program for slowness will automatically improve threaded scaling?

You mentioned the serial code is designed for slowness, glad to know that since it means there is a lot of possible space to improve it. However, I have no idea how I could do that?

3. If you would follow the standard advice to optimize your program before threading (e.g. by allowing vectorization), you would improve cache locality, which becomes much more important when you engage OpenMP.

As I have mentioned we are talking here of an approach for solving a PDE on a unstructured mesh. I have no idea how to code this in that way in order allow vectorization, since indexed memory access is one of the problems, with respect to vectorization, when working with unstructured meshes. In turn back using FDM on structured grids makes vectorization really easy or better say inherently possible. It would be really nice if you could give me some example with respect to the source code. The element connection table is saved in the array INE from which the nodal values are extracted in order to do certain computation. Do you have any idea how to vectorize the indexed memory access within the element loop over IE to MNE?

4. Current platforms with more than 4 cores use multiple sockets, each with separate cache. When you require that each one reads its own copy of all the shared arrays, you prevent parallel performance scaling. I don't see that you have any 10000 CPU platform in mind which would support your idea of perfect scaling, but surely it won't happen by requiring 10000 local copies of all of your data.

Ok, so far so good. It seems that having local copies for each threads is not the way to go. So what is remaining now.

1. Do fine grained parallelism within the inner most loop e.g. over the elements and nodes ... still accessing the shared data by all threads ... and paying the cost for synchronization ...

2. Omitting pre computing geometric quantities that are not changing during the computation and evaluating them during runtime ... this is somehow awkward to me ... ?!?!?

3. Actually, I have no more ideas ... perhaps you have some interesting text book in mind or some papers or links whatever ... ?!?!

Thanks again for your kind reply ... however I keeping wondering ... mostly about following ...

- vectorization when working on unstructured meshes
- cashe localization for the given problem ...
- accessing shared data by all threads ...

Cheers

Timmo ...

TimP · ‎10-25-2009

The indirect memory references, of course, are a big problem for efficiency in memory locality. Still, it seems that it would be much better if you would change the organization of the IE array so as to have locality there, either by reversing the subscripts, or making separate arrays.

TimP · ‎10-25-2009

Quoting - tim18

The indirect memory references, of course, are a big problem for efficiency in memory locality. Still, it seems that it would be much better if you would change the organization of the IE array so as to have locality there, either by reversing the subscripts, or making separate arrays.

If any of those indirect memory reference operations are the same for all threads, can you arrange to move them ahead of the parallel region, and possibly delay the storage by indirection until you have completed the parallel region?