Re: Intel compiler V11 and OpenMP - problem

FDSUser · ‎11-21-2008

All,

I'm actually parallelizing a program with OpenMP 3.0 directives, there I found one problem with the following call of a subroutine:

[cpp]    IF (MIXTURE_FRACTION) THEN
    !$OMP PARALLEL DO COLLAPSE(3) PRIVATE(K,J,I,ITMP,Z_VECTOR,CP_SUM,CP_MF,N)
       DO K=1,KBAR
          DO J=1,JBAR
             DO I=1,IBAR
                !IF (SOLID(CELL_INDEX(I,J,K))) CYCLE
                OpenMP_DIVG_005: IF (SOLID(CELL_INDEX(I,J,K))) THEN
                   CALL DO_NOTHING('DIVG_PART1_MIXTURE_FRACTION_RTRM')
                ELSE !OpenMP
                   ITMP = 0.1_EB*TMP(I,J,K)
                   Z_VECTOR = YYP(I,J,K,I_Z_MIN:I_Z_MAX)
                   CALL GET_CP(Z_VECTOR,Y_SUM(I,J,K),CP_MF,ITMP) !!! -> THIS Subroutine is called, than inside the subroutine the program stops with OpenMP, it runs without OpenMP
                   IF (N_SPECIES > (I_Z_MAX-I_Z_MIN+1)) THEN
                      CP_SUM = 0._EB
                      DO N=1,N_SPECIES
                         IF (SPECIES(N)%MODE/=MIXTURE_FRACTION_SPECIES) &
                         CP_SUM = CP_SUM + YYP(I,J,K,N)*SPECIES(N)%CP(ITMP)
                      END DO
                      CP_MF = CP_SUM + (1._EB-Y_SUM(I,J,K))*CP_MF
                   ENDIF
                   RTRM(I,J,K) = R_PBAR(K,PRESSURE_ZONE(I,J,K))*RSUM(I,J,K)/CP_MF
                   DP(I,J,K) = RTRM(I,J,K)*DP(I,J,K)
                ENDIF OpenMP_DIVG_005
             ENDDO
          ENDDO
       ENDDO
    !!$OMP END PARALLEL DO
[/cpp]

The GET_CP Subroutine is called and then the code stops at the following part of the code:

[cpp]    SUBROUTINE GET_CP(Z_IN,YY_SUM,CP_MF,ITMP)

    INTEGER, INTENT(IN) :: ITMP
    REAL(EB), INTENT(IN) :: Z_IN(1:I_Z_MAX - I_Z_MIN + 1),YY_SUM
    REAL(EB) ::Z(1:I_Z_MAX - I_Z_MIN + 1),CP_MF,OMYYSUM

    IF (YY_SUM >=1._EB) THEN
       CP_MF = SPECIES(0)%CP(ITMP)
       RETURN
    ELSE
       OMYYSUM = 1._EB-MAX(0._EB,YY_SUM)
       Z = MAX(0._EB,MIN(1._EB,Z_IN))/OMYYSUM  !----> THIS is the line where the code stops
    ENDIF

    CP_MF = (Z2CP_C(ITMP) + DOT_PRODUCT(Z2CP(ITMP,:),Z))/(Y_MF_SUM_C + DOT_PRODUCT(Y_MF_SUM_Z,Z))

    END SUBROUTINE GET_CP
[/cpp]

All variables are correctly initialized, there is no division by zero (OMYYSUM = 1), so all is correct. The program will be compiled, but then it hangs by running the line above. I have used other compilers (the Sun Fortran Compiler), there does no problem occure and the code works fine. Only the Intel Fortran V11 compiler produced an executable code, which fails on this line.

The error message I get by running the Intel-compiled code is:

[cpp]forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC        Routine            Line        Source             
.                  4001C410  Unknown               Unknown  Unknown
libiomp5.so        400BBE12  Unknown               Unknown  Unknown
libpthread.so.0    400D64FB  Unknown               Unknown  Unknown
libc.so.6          401C3E5E  Unknown               Unknown  Unknown[/cpp]

FDSUser · ‎11-21-2008

I forgot one information: If I compile the code without the -openmp flag, then the code runs perfect, so it seems to be a problem with the OpenMP environment

jimdempseyatthecove · ‎11-21-2008

Try adding AUTOMATIC to

REAL(EB)::Z(1:I_Z_MAX-I_Z_MIN+1),CP_MF,OMYYSUM

REAL(EB), AUTOMATIC::Z(1:I_Z_MAX-I_Z_MIN+1),CP_MF,OMYYSUM

Jim Dempsey

FDSUser · ‎11-21-2008

Quoting - jimdempseyatthecove

Try adding AUTOMATIC to

REAL(EB)::Z(1:I_Z_MAX-I_Z_MIN+1),CP_MF,OMYYSUM

REAL(EB), AUTOMATIC::Z(1:I_Z_MAX-I_Z_MIN+1),CP_MF,OMYYSUM

Jim Dempsey

I have tried this, but it has no effect. The problem occurs at the same line in the code. Furthermore this could be no solution, because if the variables are correctly implemented in a serial version of the code, there should be no changes in the OpenMP code version. If I have to check all the variables, parallelization would be very difficult, becuase the code has more than 50.000 lines and more than 500 variables.

If it is helpful, I could submit the complete code.

TimP · ‎11-21-2008

Quoting - FDSUser

I have tried this, but it has no effect. The problem occurs at the same line in the code. Furthermore this could be no solution, because if the variables are correctly implemented in a serial version of the code, there should be no changes in the OpenMP code version. If I have to check all the variables, parallelization would be very difficult, becuase the code has more than 50.000 lines and more than 500 variables.

If it is helpful, I could submit the complete code.

The -openmp option should make local arrays dynamic, even if they were not already automatic (in the standard Fortran sense) by definition, so the lack of change is to be expected. The problem usually cited with Fortran automatic arrays is the lack of error diagnosis. Supposing the allocation fails, which might happen simply because the stack size limit is reached at this point, you don't have a satisfactory way to catch it. Hence the usual recommendation for ALLOCATABLE arrays with STAT= checking. This would apply equally to the serial version, but the OpenMP version will consume more stack.

FDSUser · ‎11-21-2008

Quoting - tim18

The -openmp option should make local arrays dynamic, even if they were not already automatic (in the standard Fortran sense) by definition, so the lack of change is to be expected. The problem usually cited with Fortran automatic arrays is the lack of error diagnosis. Supposing the allocation fails, which might happen simply because the stack size limit is reached at this point, you don't have a satisfactory way to catch it. Hence the usual recommendation for ALLOCATABLE arrays with STAT= checking. This would apply equally to the serial version, but the OpenMP version will consume more stack.

Now I debuuged every line of the code for this specific problem, and I found the problem and I think it is not a stacksize or programming problem, I think it's a compiler problem:

In the first subroutine, which I posted the error starts:

[cpp]# IF (MIXTURE_FRACTION) THEN  
# !$OMP PARALLEL DO COLLAPSE(3) PRIVATE(K,J,I,ITMP,Z_VECTOR,CP_SUM,CP_MF,N)  
#    DO K=1,KBAR  
#       DO J=1,JBAR  
#          DO I=1,IBAR  
#             !IF (SOLID(CELL_INDEX(I,J,K))) CYCLE  
#             OpenMP_DIVG_005: IF (SOLID(CELL_INDEX(I,J,K))) THEN  
#                CALL DO_NOTHING('DIVG_PART1_MIXTURE_FRACTION_RTRM')  
#             ELSE !OpenMP  
#                ITMP = 0.1_EB*TMP(I,J,K)  
#                Z_VECTOR = YYP(I,J,K,I_Z_MIN:I_Z_MAX)[/cpp]

At the last line the error occurs. The Z_VECTOR variable is correctly allocated in another module, in my test case it has a size of 2, so Z_VECTOR(1:2) is allocated, not in the same module where this subroutine (code fragment above) is running. The variable YYP is also correctly allocated, the value of I_Z_MIN = 1 and the value of I_Z_MAX = 2, so YYP and Z_VECTOR have the same size, so they must match. But the problem is, that Z_VECTOR = YYP(...) is not done, and so Z_VECTOR has no values. I wrote out with WRITE(*,*) YYP(I,J,K,I_Z_MIN:I_Z_MAX) the values of YYP, and that is done perfectly, two values appear on the screen. After the line Z_VECTOR = YYP(...) I wrote out with WRITE(*,*) Z_VECTOR, and there the program hangs with the error in the first post. Stack-size is not the problem, I have 2GB of stack and max 100MB are used, so I think it is a compiler based problem. To ensure, that the size and values of Z_VECTOR and YYP(...) are correctly matching, I tried instead of Z_VECTOR = YYP(...) a new possibility with Z_VECTOR(I_Z_MIN:I_Z_MAX) = YYP(...,I_Z_MIN:I_Z_MAX) but this has also no success. If I compile without the -openmp flag, Z_VECTOR = YYP(...) or Z_VECTOR(I_Z_MIN:I_Z_MAX) = YYP(...) runs fine and no problem occurs. The values of I_Z_MIN and I_Z_MAX are calculated at the start of the program and they are not changed any more, so this could also be not the problem.

I tried the code with two other compilers and the OpenMP flags for this compilers, this is the XLF 12.1.0.0 compiler for AIX and the actual SUN Express compiler for Linux, and both compilers produce a perfectly running code. With the Intel Fortran 11 Compiler, the code fails, and I do not use any optimizations (compiling with -O0), so I think, this is not a programming problem, it's a compiler problem. My OS is Ubuntu 8.04, if this helps for debugging. If the code helpful, I can email it.

Thanks for your help!

jimdempseyatthecove · ‎11-22-2008

Tim and FDSUser,

I have an application here that (at least in the eariler versions of IVF) consistently exhibited problems in OpenMP where I have in a subroutine

subroutine foo(x)
real(8) :: x
real(8) :: TOSVX1(3), TOSVX2(3), TOSVX3(3)
...

That these local arrays are created as if SAVE were on the declaration. That is, the code generates a static copy of the TOSVXn(3) arrays. When adding AUTOMATIC

real(8), automatic:: TOSVX1(3), TOSVX2(3), TOSVX3(3)

This forces the arrays to be allocate on the stack

When used without the automatic, the multiple threads overwrite each others data in the temporary arrays.

Whether it is a compiler bug or an options conflict, I could not ascertain, I do know that by including automatic that there is no ambeguity in my intentions as to if the arrays must be local.

Jim Dempsey

jimdempseyatthecove · ‎11-22-2008

>>The Z_VECTOR variable is correctly allocated in another module,

Then Z_VECTOR cannot be private (unless it is also automatic)

Whats happening is your Z_VECTOR is being allocated in another module. i.e. The array descriptor for Z_VECTOR resides in the other module, and the allocation of Z_VECTOR data initializes the array descriptor of foo::Z_VECTOR. By making Z_VECTOR and OpenMP PRIVATE variable you are instantiating an uninitialized array descriptor of the name Z_VECTOR in the scope of the thread and within the context of the openmp parallel region. When compiling without OpenMP (and without the private) you will be using the modules copy of Z_VECTOR.

If each thread needs a seperate copy of Z_VECTOR and if you cannot stack allocate or you do not desire the overhead of allocation/deallocation as you enter and leave the parallel region then either placeZ_VECTOR into thread private data or generate the approprate Z_VECTOR reference for passing to the subroutine. e.g. use pointer to array of Z_VECTORs indexed by a thread unique sequence number. Note, when you use OpenMP nested parallelization you cannot use omp_thread_num() to obtain a unique thread number (generate your own and place it into thread private data).

Jim Dempsey

FDSUser · ‎11-23-2008

Quoting - jimdempseyatthecove

>>The Z_VECTOR variable is correctly allocated in another module,

Then Z_VECTOR cannot be private (unless it is also automatic)

Whats happening is your Z_VECTOR is being allocated in another module. i.e. The array descriptor for Z_VECTOR resides in the other module, and the allocation of Z_VECTOR data initializes the array descriptor of foo::Z_VECTOR. By making Z_VECTOR and OpenMP PRIVATE variable you are instantiating an uninitialized array descriptor of the name Z_VECTOR in the scope of the thread and within the context of the openmp parallel region. When compiling without OpenMP (and without the private) you will be using the modules copy of Z_VECTOR.

If each thread needs a seperate copy of Z_VECTOR and if you cannot stack allocate or you do not desire the overhead of allocation/deallocation as you enter and leave the parallel region then either placeZ_VECTOR into thread private data or generate the approprate Z_VECTOR reference for passing to the subroutine. e.g. use pointer to array of Z_VECTORs indexed by a thread unique sequence number. Note, when you use OpenMP nested parallelization you cannot use omp_thread_num() to obtain a unique thread number (generate your own and place it into thread private data).

Jim Dempsey

Jim,

after some changes today that's what I found out. I copied the values of YYP directly to the subroutine GET_CP without using the Z_VECTOR and it works. Allocation of Z_VECTOR in the same MODULE where the DO-Loops are helps also. But if I read the OpenMP 3.0 specs, I find no definition of this problem. I found only example A.30.2.f in the specs, but I think this is different from my construct, because my Z_VECTOR is only allocated one time and not two times like in this example. I will ask this question at the OpenMP forum, maybe it can be answered how it has to work and to be implemented based on the specification.

Thanks for your help!