Compiling Hydrodynamic Model with ifort on Intel Quad-Core

Brandon_Capso · ‎01-13-2010

I have been trying to use ifort to compile ahydrodynamic modelfor an intel quad-core processor. The program has OpenMP directives that can be activated/de-activated, and I have activated them correctly, however the program segfaults shortly after initiating a run when compiled for 4 processors.Note that the program(s) compiles and runs successfully with ifort for a single processor (with OpenMP directives removed).

I know the OpenMP directives in the codeare ok, because I have successfully compiled a very fast executable that runs on 4 processors with a trial version of the Portland Group Fortran Compiler with no changes to the code/directives.Below are the ifortcompiler options used for the compile and link step, respectively (I have a "compile"and "link" script used to compile the proper modules/subroutines).

ifort -openmp -c -module program
ifort -openmp -threads -o program

As I mentioned above, I know my compiler and linker scripts are OK, as the program compiles successfully for 4 processorsusing trial pgf90, and the only difference between the scripts for ifort and pgf90 are thecall and options.The options I am using for pgf90 are:

pgf90 -Mnosgimp -mp -c -module program
pgf90 -mp -oprogram

I have also downloaded a trial for Intel Threading Building Blocks, but the program still segfaults during the first loop, from what appears to be memory allocation issue. Note that the processor load seems to be "all over the place" when starting the program, before it crashes.

My question here is what does pgf90 do automatically that I need to tell ifort to due manually. Any help would be greatly appreciated, as I don't feel like spending $600+ on pgf90, when I already have an ifort license. I would be willing to purchase the Intel Threading Building Blocks if necessary.

Best Regards,

BC

TimP · ‎01-13-2010

pgf90 builds normally have a built-in affinity, similar to what you get with the par-affinity option of ifort. Normally, instead of using that, you would use KMP_AFFINITY environment variables, as par-affinity gives you no run-time options.
Defaults with respect to stack size limits are probably more generous with pgf90. Individual thread stack size limit is controlled by kmp_stacksize function call or environment variable; default limit is 2MB for 32-bit, 4MB for 64-bit.
The shell stack size limit (e.g. ulimit) also needs attention.
If bad behavior is associated with affinity, it's likely there are threading bugs. You might try an evaluation of Intel Thread Checker.

Brandon_Capso · ‎01-26-2010

Thank you for your response!This gives me a few more things to try...I'm surethat there are no issues with stacksize...I set KMP_STACKSIZE and allnecessary ulimit settings to maximum,as I have a lot of memory to play with.

I havn't played with affinity yet, but I will try the variables and see what happens. If it continues to malfunction, I will get the thread checker and post the output here. I appreciate your time, I can't even count how much valuable information I have read from your posts over thepast couple months. Thanks again!

Brandon

TimP · ‎01-26-2010

There's no way setting KMP_STACKSIZE "to maximum" will work, unless your idea of "maximum" is sane (4 times the default for your still undisclosed choice of OS?).

sblueknigt · ‎01-27-2010

I should have been more specific...by maximum I mean as high as I would ever need. I'm running Fedora 12 Linux version 2.6.31.9-174.fc12.i686.PAE on an intel xeon quad-core (12GB RAM). I also should have mentioned that the program runs fine (albeit much slower) when I remove the OMP directives and compile with ifort.

I got the threading analysis tools and was able to get some information, which led me right to the loops following the *OpenMP directives in the fortran code, where there was read-write write-write data-races occuring.

There are two modules that are paralellized with OMP directives within the Fortran code; one that does calculations of gridded output paramaters and writes them to a binary file (which executes first, and therefore fails first), and another that does all of the wave propogation, scatter/dispersion operations, etc.

See below...only the code from the first module is attached.

Note that changing the affinity environment variable does not affect the output from the thread checker. Also, note that the run fails shortly after it starts, when it encounters the first parallel loop. Do you see anything wrong with the code? There is definitely a problem in the second section of code (not attached), as if I remove the OMP directives on the first section, the second section fails from what I suspect are similar reasons. Thanks for your help!

Below is the output from the Intel thread checker...

|1 |Write -> |Er|1|omp|Memory write of isea at |"w3iogo|"w3iogo|
| |Write |ro| |for|"w3iogomd.f90":286 conflicts with a |md.f90"|md.f90"|
| |data-race|r | | |prior memory write of isea at |:286 |:286 |
| | | | | |"w3iogomd.f90":286 (output dependence) | | |
_______________________________________________________________________________
|2 |Read -> |Er|1|omp|Memory write of isea at |"w3iogo|"w3iogo|
| |Write |ro| |for|"w3iogomd.f90":286 conflicts with a |md.f90"|md.f90"|
| |data-race|r | | |prior memory read of isea at |:287 |:286 |
| | | | | |"w3iogomd.f90":287 (anti dependence) | | |
_______________________________________________________________________________
|3 |Write -> |Er|1|omp|Memory write of factor at |"w3iogo|"w3iogo|
| |Write |ro| |for|"w3iogomd.f90":287 conflicts with a |md.f90"|md.f90"|
| |data-race|r | | |prior memory write of factor at |:287 |:287 |
| | | | | |"w3iogomd.f90":287 (output dependence) | | |
_______________________________________________________________________________
|4 |Read -> |Er|1|omp|Memory write of factor at |"w3iogo|"w3iogo|
| |Write |ro| |for|"w3iogomd.f90":287 conflicts with a |md.f90"|md.f90"|
| |data-race|r | | |prior memory read of factor at |:292 |:287 |
| | | | | |"w3iogomd.f90":292 (anti dependence) | | |
_______________________________________________________________________________
|5 |Write -> |Er|1|omp|Memory write of fkd at |"w3iogo|"w3iogo|
| |Write |ro| |for|"w3iogomd.f90":308 conflicts with a |md.f90"|md.f90"|
| |data-race|r | | |prior memory write of fkd at |:310 |:308 |
| | | | | |"w3iogomd.f90":310 (output dependence) | | |
_______________________________________________________________________________
|6 |Read -> |Er|1|omp|Memory write of fkd at |"w3iogo|"w3iogo|
| |Write |ro| |for|"w3iogomd.f90":308 conflicts with a |md.f90"|md.f90"|
| |data-race|r | | |prior memory read of fkd at |:316 |:308 |
| | | | | |"w3iogomd.f90":316 (anti dependence) | | |
_______________________________________________________________________________
|7 |Read -> |Er|1|omp|Memory write of fkd at |"w3iogo|"w3iogo|
| |Write |ro| |for|"w3iogomd.f90":310 conflicts with a |md.f90"|md.f90"|
| |data-race|r | | |prior memory read of fkd at |:316 |:310 |
| | | | | |"w3iogomd.f90":316 (anti dependence) | | |
_______________________________________________________________________________
|8 |Write -> |Er|1|omp|Memory write of fkd at |"w3iogo|"w3iogo|
| |Write |ro| |for|"w3iogomd.f90":308 conflicts with a |md.f90"|md.f90"|
| |data-race|r | | |prior memory write of fkd at |:308 |:308 |
| | | | | |"w3iogomd.f90":308 (output dependence) | | |
_______________________________________________________________________________

***And here is the FORTRAN code with the lines indicated by the thread checker noted on the right....***

! 2. Integral over discrete part of spectrum ------------------------ *
!
DO IK=1, NK
!
! 2.a Initialize energy in band
!
AB = 0.
ABX = 0.
ABY = 0.
ABXX = 0.
ABYY = 0.
ABXY = 0.
!
! 2.b Integrate energy in band
!
DO ITH=1, NTH
!$OMP PARALLEL DO PRIVATE(JSEA)
DO JSEA=1, NSEAL
AB (JSEA) = AB (JSEA) + A(ITH,IK,JSEA)
ABX(JSEA) = ABX(JSEA) + A(ITH,IK,JSEA)*ECOS(ITH)
ABY(JSEA) = ABY(JSEA) + A(ITH,IK,JSEA)*ESIN(ITH)
ISEA = JSEA
FACTOR = MAX ( 0.5 , CG(IK,ISEA)/SIG(IK)*WN(IK,ISEA) )
ABXX(JSEA) = ABXX(JSEA) + ((1.+EC2(ITH))*FACTOR-0.5) * &
A(ITH,IK,JSEA)
ABYY(JSEA) = ABYY(JSEA) + ((1.+ES2(ITH))*FACTOR-0.5) * &
A(ITH,IK,JSEA)
ABXY(JSEA) = ABXY(JSEA) + ESC(ITH)*FACTOR * A(ITH,IK,JSEA)
END DO
END DO
!
! 2.c Finalize integration over band and update mean arrays
!
!$OMP PARALLEL DO PRIVATE(JSEA,ISEA,FACTOR)
DO JSEA=1, NSEAL
ISEA = JSEA
FACTOR = DDEN(IK) / CG(IK,ISEA)
EBD(IK,JSEA) = AB(JSEA) * FACTOR
ET (JSEA) = ET (JSEA) + EBD(IK,JSEA)
EWN(JSEA) = EWN(JSEA) + EBD(IK,JSEA) / WN(IK,ISEA)
ETR(JSEA) = ETR(JSEA) + EBD(IK,JSEA) / SIG(IK)
ETX(JSEA) = ETX(JSEA) + ABX(JSEA) * FACTOR
ETY(JSEA) = ETY(JSEA) + ABY(JSEA) * FACTOR
FKD = MAX ( 0.001 , WN(IK,ISEA) * DW(ISEA) )
IF ( FKD .LT. 6. ) THEN
FKD = FACTOR / SINH(FKD)**2
ABR(JSEA) = ABR(JSEA) + AB(JSEA) * FKD
ABA(ISEA) = ABA(ISEA) + ABX(JSEA) * FKD
ABD(ISEA) = ABD(ISEA) + ABY(JSEA) * FKD
UBR(JSEA) = UBR(JSEA) + AB(JSEA) * SIG(IK)**2 * FKD
UBA(ISEA) = UBA(ISEA) + ABX(JSEA) * SIG(IK)**2 * FKD
UBD(ISEA) = UBD(ISEA) + ABY(JSEA) * SIG(IK)**2 * FKD
END IF
ABXX(JSEA) = MAX ( 0. , ABXX(JSEA) ) * FACTOR
ABYY(JSEA) = MAX ( 0. , ABYY(JSEA) ) * FACTOR
ABXY(JSEA) = ABXY(JSEA) * FACTOR
SXX(ISEA) = SXX(ISEA) + ABXX(JSEA)
SYY(ISEA) = SYY(ISEA) + ABYY(JSEA)
SXY(ISEA) = SXY(ISEA) + ABXY(JSEA)
EBD(IK,JSEA) = EBD(IK,JSEA) / DSII(IK)
END DO
!
END DO

_________________________

Thanks again!

BC

TimP · ‎01-27-2010

It looks like you meant to make those variables private. There's always a chance, with optimization, the compiler would, in effect, make private copies, but there's no reason to count on it.

sblueknigt · ‎01-27-2010

I'm not sure exactly what you mean by this. Are you saying that when I was using trial pgf90, the compiler likely made private copies of the variables, and that's why it worked? I think I am missing something about how this works; I thought the problem was with the *OpenMP directives. Aren't the variables declared as private already?

Can I use ifort options/environment variables to fix this? If not, how exactly would I go about compiling the code with ifort to run on 4 processors?

I have a feeling that your next reply will fill a huge piece of the puzzle for me...

TimP · ‎01-27-2010

You did declare some of those local variables in your PRIVATE clause in one loop, but you missed others, causing them to default to shared.

If you want the compiler to force you to designate everything in the parallel region as shared or private, add default(none) to the OMP PARALLEL.

As compiler optimization makes a local register copy local to each thread, it's possible the potential conflicts don't materialize in all cases. If they do, possibilites include incorrect results on one or more threads, performance stalls, or nothing significant.

The DO parameters are default private, but others, even simple copies of the private variables, default to shared, in accordance with OpenMP standard.

peterklaver · ‎01-27-2010

Specifically, in section 2.b you need to add ISEA and FACTOR to the PRIVATE declaration along with JSEA. In section 2.c, you must also include FKD in the PRIVATE declaration along with the other three. As each thread is working with its own value of JSEA in any given iteration, it will need to have its own private copy of any variable whose value is directly dependent on the value of JSEA. Arrays like A, ABX and ABY, on the other hand, can be shared because each thread only works on its own part of the array. Using DEFAULT(NONE), as Tim suggests, would force you to think about whether variables can be shared, or must be made private, which is pretty useful at this stage.

FYI, in my still-nascent experience with parallel programming I have seen that race conditions are not unlike array bounds error, in that they may or may not lead to a program crashing. Sometimes the program executes to completion without a complaint, but leaves you with meaningless results. In this case the thread checker called out the problems for you, which is fortunate.

HTH

sblueknigt · ‎02-08-2010

The variables ISEA and FACTOR cannot be added to the PRIVATE declaration specifically (or the program will segfault immediately still), but I think you sent me in the right direction. I needed to explore with FIRSTPRIVATE and LASTPRIVATE declarations. I tried going through and making all variables FIRSTPRIVATE, to see what would happen, and the program did not segfault right away like it was before. I am still a bit unclear as to where/when to use one or the other, so I have basically been making "educated guesses" based on the types of variables, etc. Do you think that pgf90 would have some kind of built in algorithm to determine which variables in the "PRIVATE" clauses need to be FIRSTPRIVATE and LASTPRIVATE?

I now have the program running on4 processors using only OpenMP directives (without usingifort's auto-parallizer), however I do not have the FIRSTPRIVATE/LASTPRIVATE variables declared correctly yet, as the data generated by the model is still wrong (it's building extremely large values during the first 1-2 hours of the simulation). Below are 3 different examples/sections from the Fortran code where the OpenMP directives are inserted; what am I missing here?

!$OMP PARALLEL
!$OMP DO LASTPRIVATE(JSEA,ISEA,ILOW,ICEN,IHGH) PRIVATE(EL,EH,DENOM)
DO JSEA=1, NSEAL
ISEA = JSEA
ILOW = MAX ( 1 , IKP0(JSEA)-1 )
ICEN = MAX ( 1 , IKP0(JSEA) )
IHGH = MIN ( NK , IKP0(JSEA)+1 )
EL = EBD(ILOW,JSEA) - EBD(ICEN,JSEA)
EH = EBD(IHGH,JSEA) - EBD(ICEN,JSEA)
DENOM = XL*EH - XH*EL
FP0(ISEA) = FP0 (ISEA) * ( 1. + 0.5 * ( XL2*EH - XH2*EL ) &
/ SIGN ( MAX(ABS(DENOM),1.E-15) , DENOM ) )
END DO
!$OMP END PARALLEL

---------------------------------------------------------------

!$OMP PARALLEL DO FIRSTPRIVATE(ISPEC,FIELD) SCHEDULE(DYNAMIC,1)
DO ISPEC=1, NSPEC
IF ( IAPPRO(ISPEC) .EQ. IAPROC ) THEN
CALL W3GATH ( ISPEC, FIELD )

CALL W3XYP3 ( ISPEC, FACX, FACY, DTG, MAPSTA, &
MAPFS, FIELD, VGX, VGY )
CALL W3SCAT ( ISPEC, MAPSTA, FIELD )
END IF
END DO

------------------------------------------------------------------

IF ( FLCTH .OR. FLCK ) THEN
DO ITLOC=ITLOCH+1, NTLOC
$OMP PARALLEL DO LASTPRIVATE(JSEA,ISEA,IX,IY) SCHEDULE(DYNAMIC,1)
DO JSEA=1, NSEAL

ISEA = JSEA
IX = MAPSF(ISEA,1)
IY = MAPSF(ISEA,2)
IF ( MAPSTA(IY,IX) .EQ. 1 ) THEN
CALL W3KTP3 ( ISEA, FACTH, FACK, CTHG0(IY), &
CG(:,ISEA), WN(:,ISEA), DW(ISEA), &
DDDX(IY,IX), DDDY(IY,IX), CX(ISEA), &
CY(ISEA), DCXDX(IY,IX), DCXDY(IY,IX), &
DCYDX(IY,IX), DCYDY(IY,IX), VA(:,JSEA) )

END IF
END DO
END DO
END IF
!

________________________________

Thanks again!

peterklaver · ‎02-08-2010

In your post that presented code snippet 2.b there was a clear, unambiguous case (as the thread profiler pointed out) of the variables ISEA and FACTOR not being "thread safe". If this were the only thing wrong, then making them PRIVATE (assuming that they were declared with a SAVE attribute in the same program unit containing the parallel region) would fix the problem. Making them FIRSTPRIVATE appears unnecessary, because both ISEA and FACTOR have values assigned to them inside the parallel region before they are subsequently used; LASTPRIVATE would matter if ISEA and FACTOR were used outside the parallel region with whatever value they ended up at.

If, as you say, just making them PRIVATE still gives you bogus results, it's likely that there is something else going on. Now you have provided three more parallel regions; I suppose there are still more. In the first example, I don't see the point in making that first group of variables LASTPRIVATE unless the values are going to be used outside the parallel region after the looping is done. They absolutely need to be PRIVATE, but not necessarily LASTPRIVATE. In the second example, I can see why FIELD might need to be FIRSTPRIVATE, because it is being used in the subroutine calls and must therefore have a value assigned to it; if the subroutines will not modify it, however, then it doesn't need to be PRIVATE.

The last example also has subroutine calls. You have to be very careful, I have found, when calling subroutines from parallel regions, that the subroutines do not have any locally declared variables that need to be thread safe. In my experience, anything that needs to be thread safe in this situation has to be:

Declared in the program unit that calls the routine, with SAVE attribute;
Made PRIVATE;
Passed in the argument list; and
Declared in the subroutine with the appropriate INTENT.

Otherwise you open yourself up to problems (although, as tim18 points out, the problems don't always materialize). Perhaps pgf90 took care of this somehow, but I'm pretty sure ifort doesn't. My guess is there are a variety of unsafe declarations in this code. You might want to go back to the thread profiler and try to get each parallel region behaving properly, one at a time. It may also be helpful to get a sense of where exactly the calculations are going awry; having worked quite a bit with EFDC and ECOM, however, I understand that the problem is not trivial. Good luck!

Grant_H_Intel · ‎02-11-2010

You mention in a note above that the PGI compiler runs the executable, but did you check the generated numerical results against the results of the serial version? If there are data races (caused by incorrect privatizing of variables) the behavior of the program is impossible to predict. Sometimes, it may seg fault, and sometimes it may just give wrong answers. It depends entirely on the timing of the generated code, the relative timing of the threads, and where variables are allocated.

It is pretty unlikely that the PGI compiler is actually producing the correct code automatically if the variables are not classified properly. It is much more likely that the code issometimes either crashing or producing incorrect answers for both compilers if the program has data races. If Intel Thread Checker shows no data races and the program still crashes, then the problem may be stack sizes, memory allocation or variable initialization problems (including "SAVE" attribute for subroutine variables). Because different OpenMP compilers treat these aspects differently, the symptoms of a crash may vary even if the program is free of data races. I hope this helps!

- Grant