Compiled Fortran program occasionally does't run full loop

backsoon · ‎03-25-2009

Hi there,
I have some issues with Fortran code under Linux and hope somebody can help me with that (and naturally that I'm in the right forum and didn't oversee any possible solutions). I'm not really that familiar with Fortran programming nor a Linux pro, but the program I'm using has been compiled succefully under Windows (MS Development Studio I think) and runs stable. Now I'm trying to transfer it to Linux (Kubuntu 8.04) and compile it with ifort 11.0 on an Intel P8400. All my system is usually up-to-date.
After some initial problems (dos2unix,...yes, it's all not that easy), it actually does compile without error massages, only saying that several loops and partial loops have been vectorized. The program itself is quite simple, I can't post the code, though, because it has ~11.000 lines and is copyrighted by another group. Basically, it reads data from several input files, performs some rather simple calculations and writes the results to some output files. It is being used to calculate agricultural parameters for several years (this is the main loop and can be set in an options file) and that is where the problem is at the moment: if I run the program once, it's all fine. I have to run it, though, several hundred-thousand times in a batch/shell script (bash), whereas another program changes some of the input files and further input files are being exchanged in the current directory. Currently, I'm trying to run it for 5 years, so the main loop is carried out five times. It usually does this, but after a while, it starts to carry it out every now and then only 1-3 times or not at all. I had a similar problem initially, which was due to not running the executable with "sudo ...", which is obviously necessary to allow the file to create and change new files. Sometimes (after several performs) it also gives out a SIGSEGV error (severe error (174)..). I read, that this can be due to the limit of stacks, so I set "ulimit -a unlimited". Still doesn't work. I also read, that in older versions of ifort, there was a problem with arrays (which is the major type of input data), so I tried compiling it with "ifort -heap-arrays -axP". Also didn't work. My current version of the executable is now compiled without any parameters. I hope somebody can figure out from this rough and maybe a little confusing description, where the problem might be. Could quite likely be, that it has to do with me file permissions (sudo...) or my shell, but I thought, it would best fit here.
Thanks in advance. Please tell me, if you need further information.

P.S.: Just run it again, to get the correct error message for the SIGSEGV error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
Xep 000000000044FD45 Unknown Unknown Unknown
Xep 000000000041DDD2 Unknown Unknown Unknown
Xep 000000000040348C Unknown Unknown Unknown
libc.so.6 00007F8235877466 Unknown Unknown Unknown
Xep 0000000000403389 Unknown Unknown Unknown

this time I compiled using "ifort -xW -fp-model precise -V"

TimP · ‎03-25-2009

The -traceback option is needed for any confidence to be placed in the stack trace. However, the indication of a fault in libc.so might be consistent with a problem in data file access.
The supposition that certain cases happened to work in the Windows build but not in the linux build is not good evidence of program correctness. Have you run the problem cases in a build with -check set?
Running with sudo is not an ideal nor an "obvious" move; presumably it means you don't have correct permissions on certain data files, or they are shared incorrectly among programs. It's possible that a fault in the data files may crash the program. You should have each file operation include stat= processing (or equivalent) so that you get a message of your own choice each time such a problem occurs.

jimdempseyatthecove · ‎03-25-2009

Since you are running through a shell (in a loop or in a long series of steps), after program crash, can you resume the shell at the correct step, and will the program run or crash?

If the program run (continues) then inspect your shell to insure that it is coordinated properly with the application. In particular that you are not starting the next step prior to completing the preceeding step. i.e. not all of the data had been flushed to the disk prior to running the next step.

If the program crashes with the next step, then the data set written is corrupted or was written incorrectly.

On some systems the file delete/rename/relink may be defferred so this could be getting in the way when running different processes from a shell (as opposed to running within a process).

Jim Dempsey

backsoon · ‎03-25-2009

Thanks for the fast responses. Regarding the suggestions from

tim18:

I've compiled now using "ifort -traceback -check ...". The executable now doesn't perform any calculations anymore but gives out the following errors for each run:

forrtl: error (63): output conversion error, unit 51, file /home/krz/MyProject/Xepic/EPIC_OUT.OUT
Image PC Routine Line Source
Xepic0509 00000000005D41A1 Unknown Unknown Unknown
Xepic0509 00000000005D3175 Unknown Unknown Unknown
Xepic0509 00000000005826DA Unknown Unknown Unknown
Xepic0509 00000000005559A2 Unknown Unknown Unknown
Xepic0509 00000000005551D1 Unknown Unknown Unknown
Xepic0509 000000000057FF27 Unknown Unknown Unknown
Xepic0509 000000000042A678 MAIN__ 1939 Epic0509.for
Xepic0509 000000000040348C Unknown Unknown Unknown
libc.so.6 00007FF04BBE5466 Unknown Unknown Unknown
Xepic0509 0000000000403389 Unknown Unknown Unknown
forrtl: severe (408): fort: (3): Subscript #1 of the array KDF1 has value 0 which is less than the lower bound of 1

Image PC Routine Line Source
Xepic0509 00000000005D41A1 Unknown Unknown Unknown
Xepic0509 00000000005D3175 Unknown Unknown Unknown
Xepic0509 00000000005826DA Unknown Unknown Unknown
Xepic0509 00000000005559A2 Unknown Unknown Unknown
Xepic0509 00000000005548BE Unknown Unknown Unknown
Xepic0509 000000000052E919 nftbl_ 8711 Epic0509.for
Xepic0509 0000000000540631 inifp_ 7859 Epic0509.for
Xepic0509 0000000000429D6E MAIN__ 1935 Epic0509.for
Xepic0509 000000000040348C Unknown Unknown Unknown
libc.so.6 00007FF04BBE5466 Unknown Unknown Unknown
Xepic0509 0000000000403389 Unknown Unknown Unknown

whereas EPIC_OUT.OUT is one of the output files (didn't have any problems with that one before). Regarding the array KDF1: in the source code it looks like there is a line to prevent it from becoming or rather addressing item "0":

...
IF(NDF.EQ.0)GO TO 1
DO 264 L=1,NDF
IF(KDF(L).EQ.JX(7))GO TO 392
264 CONTINUE
1 NDF=NDF+1
KDF(NDF)=JX(7)
KDF1(JX(7))=NDF
...

Or am I worng there?
This sudo thing is surely not quite comfortable, but it seems to be necessary on Ubuntu to be allowed a program to overwrite files....I've changed though the permissions of all input and outpunt files to -rwxrwxrwx. So I wouldn't expect a problem there. I'm actually waiting for access to a server, where I might have a better surrounding and be able to check, whether it maybe all depends on my current machine or ifort installation.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

for the suggestions from jimdempseyatthecove:

I was guessing that the shell script might "run too fast" (say: not write the files fast enough, before performing the next step). Is this actually possible? The script file looks more or less like this (might look a little awkward, but couldn't figure out a better way...):

...
cp ~/MyProject/Data/Soil/EPICSoilWise/4885.sol ./TEMP.SOL
cp ~/MyProject/Data/Climate/CRU_TS_2.10/Monthly/pre/Land/127245.txt ./TEMP.PCP
cp ~/MyProject/Data/Climate/CRU_TS_2.10/Monthly/tmx/Land/127245.txt ./TEMP.TMX
cp ~/MyProject/Data/Climate/CRU_TS_2.10/Monthly/tmn/Land/127245.txt ./TEMP.TMN
cp ~/MyProject/Data/Climate/CRU_TS_2.10/Monthly/wet/Land/127245.txt ./TEMP.WTD
./modawec <==== 1st executable. creates input files for 2nd one, but doesn't seem to make trouble
awk 'NR==4 {sub($1, "32.25") sub($2, "-116.75") sub($3, "332.00"); print}1' TEMP.SIT > sitetemp
uniq sitetemp > unisite
rm sitetemp
mv unisite TEMP.SIT
awk 'NR==5 {sub($3, ".06"); print}1' TEMP.SIT > sitetemp
uniq sitetemp > unisite
rm sitetemp
mv unisite TEMP.SIT
awk 'NR==1, NR==3 {print} NR==4 {sub($8, "1.00"); print} NR==5 {sub($3, "130.00"); print} NR==6, EOF {print}' epiccont.dat > contemp
mv contemp epiccont.dat
./Xepic0509 <===== 2nd executable, that "crashes" every now and the
... (repeats some thousand times)

the second executable doesn't really crash like hang, it just doesn't perform the whole procedure. logging the output into a file looks about like this:

RUN#= EPIC_OUT SIT#= 200 WP1#= 1 SOL#= 999 OPS#= 5

YEAR 1 OF 5

YEAR 2 OF 5

YEAR 3 OF 5

YEAR 4 OF 5

YEAR 5 OF 5

RUN#= EPIC_OUT SIT#= 200 WP1#= 1 SOL#= 999 OPS#= 5

YEAR 1 OF 5

YEAR 2 OF 5

RUN#= EPIC_OUT SIT#= 200 WP1#= 1 SOL#= 999 OPS#= 5

YEAR 1 OF 5

YEAR 2 OF 5

YEAR 3 OF 5

YEAR 4 OF 5

YEAR 5 OF 5

So the 1st and 3rd output are correct, whereas in the 2nd one, the loop wasn't completed. Because the output is being written continuously to the outputfiles (e.g. year1.txt .... year5.txt) it's also not possible to resume anything. There are just some lines missing in between, which messes up all the data.
Is there a way, to configure a shell to wait for each step to complete (I'm currently using bash)? I could not directly find a solution for that in the bash manpage.

jimdempseyatthecove · ‎03-25-2009

I would guess the JX(7) is 0 when the error occurs.

Jim Dempsey

backsoon · ‎03-26-2009

Yes, you're right, thanks for the hint. I confused KDF with KDF1 when I said it's being prevented from becoming 0. I'll try to trace that back. But still I'm using the exact same files as with the Windows version, so it can't really be an error in the input files, but must be some problem with either compiling or executing the program...

TimP · ‎03-26-2009

Quoting - backsoon

I'm using the exact same files as with the Windows version, so it can't really be an error in the input files

I don't understand that comment. You implied earlier that many of the files were written while the system is running.
Among the common ways for a file to break when copied from Windows to linux would be automatic removal of carriage returns.
The low level run time functions derive from totally different source code, supported by organizations with different points of view, and different handling of exceptions.
The possibility of results differing erratically when reading a file which another program still has open for write was mentioned already.

backsoon · ‎03-30-2009

Thanks again for all the input. As far as I can judge for the moment, the program seems to compile and run stable now, after re-installing ifort and re-importing all source files from windows to linux. So it seems, I messed something up during all my try-outs, copying and changeing of options. I'm sure, your suggestions will still come in handy, in case I get further issues when transferring it to a server.
Regarding the "different points of view...": That's where I'm stuck right now, trying to figure out the correct floating point precisions, which differ only a little between my Win32 and linux executables, but lead to very different final results through several calculations. I'll open another thread for this special topic, though, in case I can't figure it out myself.