Community
cancel
Showing results for 
Search instead for 
Did you mean: 
John_D_12
Beginner
1,056 Views

Possible compiler bug with multithreaded file access

Hello,

 

There seems to be an Intel Fortran compiler bug introduced sometime around version 19.1.0.

 

It's a bit difficult to pin down, but the problem occurs in our electronic structure code Elk (http://elk.sourceforge.net/) with direct-access files when many threads are writing to the same file, but not at the same time.

Here is a snippet of the code that writes the file:

!$OMP CRITICAL(u122)
open(122,file=trim(fname),form='UNFORMATTED',access='DIRECT',action='WRITE', &
recl=recl)
write(122,rec=ik) vkl(:,ik),nmatmax,nstfv,nspnfv,evecfv
close(122)
!$OMP END CRITICAL(u122)

 

And the code that reads the file:

!$OMP CRITICAL(u122)
open(122,file=trim(fname),form='UNFORMATTED',access='DIRECT',action='READ', &
recl=recl)
read(122,rec=ik) vkl_,nmatmax_,nstfv_,nspnfv_,evecfv
close(122)
!$OMP END CRITICAL(u122)

 

The named OMP CRITICAL sections ensure that only one thread is reading or writing at the same time. Unit 122 is never used anywhere else in the code, just in these two subroutines.

Now the bug. It occurs when I run the code with several threads.

The error is:

forrtl: severe (554): direct I/O not consistent with OPEN options

 

The Intel Fortran versions which are affected (and to which I have access) are 19.1.0 and 19.1.1.

 

Intel Fortran versions 18.0.0, 18.0.5, 19.0.4 and 19.0.5 work fine. GFortran also works fine.

 

To test this, you'll have to download Elk, compile it and run 'make test'. Most tests pass but several crash because of this problem.

 

(Thanks to Pavlov Nikita for pointing out the bug in the first place.)

 

Regards,

John.

(Max-Planck Institute, Germany.)

 

0 Kudos
16 Replies
jimdempseyatthecove
Black Belt
1,046 Views

Not that this is related to the error message...

In your read, you use evecfv, shouldn't this be evecfv_? (or is evecfv PRIVATE?)

FWIW I found this:

 

severe (554): Direct I/O not consistent with OPEN options

FOR$IOS_F6203. A REC= option was included in a statement that transferred data to a file that was opened with the ACCESS='SEQUENTIAL' option.

 

Do you have any files that are opened elsewhere that use SEQUENTIAL access?

If so, you might want to insert some ASSERT code to assure it didn't snag unit 122.

Jim Dempsey

John_D_12
Beginner
1,034 Views

Hi Jim,

 

Reading evecfv is the intention of the code. The other variables have the underscore to avoid conflict with their global equivalents.

 

Unit 122 is not used anywhere else in the code. Also, this code has been working fine for the past decade or so with several generations of Intel compilers. The error appears to be a recent development.

 

Regards,

John.

 

Steve_Lionel
Black Belt Retired Employee
1,029 Views

Which compile options are you using? 

All I can think of is that the CRITICAL section isn't doing what it is supposed to, and that you end up with this code executing in more than one thread at the same time.  Regardless, reporting here using a snippet isn't going to be productive - I suggest you open a ticket at https://supporttickets.intel.com/servicecenter?lang=en-US and see if you can provide a complete reproducible example.

jimdempseyatthecove
Black Belt
1,016 Views

>>Unit 122 is not used anywhere else in the code

The point I was trying to make is "You know what the code is supposed to be doing"

All I asked was to programically verify that it is not being used.

Examples of potential points of error:

1) You are using (elsewhere) a variable (not parameter, not literal) that you assume is something other than 122.

2) You are using NEWUNIT, and for some reason internal to IVF its assigned unit number (supposed to be negative) somehow corrupts non-NEWUNIT unit numbers. (bug in IVF).

The supposition of problem with critical can be tested by encapsulating the open and close "protected" critical sections with an OpenMP lock and unlock of global shared variable. (also note that this be the issue, you now have a work around)

Jim Dempsey

John_D_12
Beginner
1,004 Views

All file open statements in Elk have an explicit unit number. NEWUNIT is not used anywhere in the code.

 

The fact that all the code works on most versions of Intel Fortran and all versions of GFortran suggests that this is a compiler problem.

 

I'll see if I can create a small test program with reproduces the error.

 

Thanks,

John.

 

John_D_12
Beginner
975 Views

The compile options are

ifort -O3 -ip -qopenmp -traceback

 

Unfortunately, try as I might, I can't produce a self-contained example.

 

I do have some more information though. When I insert:

 

!$OMP CRITICAL(u122)
open(122,file=trim(fname),form='UNFORMATTED',access='DIRECT',action='READ',recl=recl)

!**************
inquire(122,action=action_,form=form_,name=name_,named=named_,opened=opened_,access=access_)
print *
print *,'action ',action_
print *,'form ',form_
print *,'name ',trim(name_)
print *,'named ',named_
print *,'opened ',opened_
print *,'access ',access_
!**************

read(122,rec=ik) vkl_,nmatmax_,nstfv_,nspnfv_,evecfv
close(122)
!$OMP END CRITICAL(u122)

 

... then every so often, I get

 

 action READ                            
 form UNFORMATTED                     
 name /cobra/u/jdewhurs/elk/src/EVECFV.OUT
 named  T
 opened  T
 access DIRECT                          
 
 action READ                            
 form UNFORMATTED                     
 name /cobra/u/jdewhurs/elk/src/EVECFV.OUT
 named  T
 opened  T
 access SEQUENTIAL                      
forrtl: severe (554): direct I/O not consistent with OPEN options
Image              PC                Routine            Line        Source             
elk                00000000011DA68B  Unknown               Unknown  Unknown
elk                00000000011FA83C  Unknown               Unknown  Unknown
elk                000000000046DFCC  getevecfv_                 81  getevecfv.f90
elk                000000000041BF0E  rhomagv_                   46  rhomagv.f90
libiomp5.so        00002B01CD110CC3  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        00002B01CD096283  Unknown               Unknown  Unknown
libiomp5.so        00002B01CD09524E  Unknown               Unknown  Unknown
libiomp5.so        00002B01CD11119C  Unknown               Unknown  Unknown
libpthread-2.22.s  00002B01CD3ED6DA  Unknown               Unknown  Unknown
libc-2.22.so       00002B01CD8F527D  clone                 Unknown  Unknown

 

As you can see, the access to the file has been randomly switched to SEQUENTIAL despite being opened in the previous line with DIRECT. As I mentioned before, unit 122 is only ever used with DIRECT access.

 

It appears that the latest versions of Intel Fortran (19.1.0 and 19.1.1) are opening unit 122 unbeknownst to the rest of the code and not respecting the CRITICAL statement. I also tried removing the name of the CRITICAL section (u122) but that didn't work either.

 

Regards,

John.

 

andrew_4619
Valued Contributor III
951 Views

A long shot but maybe the stored data  variables for the open are getting corrupted by some unrelated problem. Try setting the "access" as a variable just before the open and then write that also. That will in my mind confirm it is a problem in the Fortran RTL or not.

 

Steve_Lionel
Black Belt Retired Employee
940 Views

Also try adding -threads. I don't think this ought to be necessary, but it can't hurt to try. This adds a call from the main program to have the run-time library protect itself against multithread access. (I assume the main program is in Fortran.)

I also have a sneaking suspicion that OPEN on a connected unit is involved somehow. If I had to guess, the bug is in the run-time library and not the compiler. I suggest opening a ticket with Intel support and give them what you have so far.

 

John_D_12
Beginner
939 Views

I stored the string 'DIRECT' in a variable and used that with 'access' in the open statement (Is this what you meant?).

 

However the problem remains.

 

jimdempseyatthecove
Black Belt
937 Views

Similar to Andrew's post "unrelated problem"

The cause of this anomaly may be due to memory corruption. Compile your program with full compile time and runtime diagnostics. This may catch a programming error (incorrect interface, array bounds exceeded, etc...). Note, the diagnostics will not catch all such errors.

One particular problem (RE: Serial works, Parallel has problems) that can show up (or at least used to be a problem) was using PRIVATE(unallocatedArray) as opposed to FIRSTPRIVATE(unallocatedArray). The problem used to be that the contents of the private array descriptor was not initialized.

Jim Dempsey

John_D_12
Beginner
928 Views

I tried recompiling the code with -threads but the problem persisted. (The entire code is in Fortran.)

 

I also followed your suggestion and opened a ticket with Intel support.

 

Thanks, John.

andrew_4619
Valued Contributor III
922 Views

yes that is what I meant, did was the value of the variable still "direct" when the open said it was sequential and crashed?  I think filing a ticket is the way to go. 

Cohen__Ronald
Beginner
657 Views

I wonder if there is any progress with this. I have the same problem with the same code.

severe (554): direct I/O not consistent with OPEN options

 

I did verify that it is not a problem with the run time library I think. If I compile after

. /home/beegfs/intel/compilers_and_libraries_2019.5.281/linux/bin/compilervars.sh

giving

ifort -v
ifort version 19.0.5.281

 

The code runs fine, even if I try running under the new compiler environment:

ifort -v
ifort version 19.1.2.254

 

 

But compiling under the newest compiler gives the error reported above.

Is there a workaround yet to use the new compiler, or a bug fix coming?

 

Ron Cohen

 

 

John_D_12
Beginner
641 Views

I opened a ticket with Intel Support and they confirmed that it is indeed a compiler bug.

 

The versions affected are 19.1.0, 19.1.1, 19.1.2 and 19.1.3.

 

I'll update this thread as soon as a fix or workaround is available. In the meantime, you'll have to use version 19.0.5 or earlier.

 

John.

 

Barbara_P_Intel
Moderator
620 Views

Regarding the status of this compiler bug, the compiler developer root caused the problem and is testing the solution.  As you know it is an inconsistent failure.  Expect a fix in a future release.

 

NikitaPavlov
Beginner
263 Views

I also tested the Elk code and found that this bug is fixed in Intel oneAPI release (ifort version 2021.1).