Solved: Skipping records in an unformatted sequential file with mixed record types

Jacek_J_ · ‎02-06-2016

Hello,

I have troubles deploying code Telemac http://opentelemac.org with Intel Fortran Compiler 16. All routines applying an old trick to skip lengthy records in an unformatted sequential file in order to get to some place in which one wants to read, deliver read error when compiled with optimisation higher than O0. The trick is instead of reading thousands of numbers, just read one number from this record and skip to the next record with next READ, and so on until you get where you want. This trick (a typical old Fortran way of doing things...) stopped working with optimised Intel 16.

I have written a short program containing the original Telemac routines (skipgeo) and a routine being a simple workaround with allocating large enough buffers (skipgeo_improved). The code works well with Intel Fortran 14 and with gfortran (gcc 4.8.4) and yields a read error with optimised Intel 16, catched by a wrapper routine (lit) for the Fortran READ. If you use -warn all -catch all and/or -O0, everything goes well.

I wonder if this is not an optimisation bug. I have not found in the documentation anything about changes in unformatted sequential file treatment.

Please find included code and the file to be read (big endian).

Looking forward for your reactions, best regards,

jaj

Steven_L_Intel1 · ‎03-02-2016

I expect this to be fixed in Parallel Studio XE 2016 Update 3. The underlying problem was that the run-time library was incorrectly positioning the file when only part of a very long record was read, due to a bug in the buffering implementation.

View solution in original post

mecej4 · ‎02-06-2016

Your program contains at least these errors: In subroutine SKIPGEO, arrays XBID, W and IBID are declared with actual size=1, but these arrays are passed as arguments to LIT, where these variables are declared to have size=NVAL, and NVAL may be as high as 72. Similarly for the character variable CBID. Reading more than one element of any of these arrays from the file, e.g., reading XBID(2), would cause array overrun and possibly cause memory corruption.

Whether this actually happens or not would be known only from a detailed examination of the program behavior. In general, optimized buggy code has unpredictable run time behavior.

jimdempseyatthecove · ‎02-06-2016

It may be an optimization bug, possibly where the optimizer mistakenly removed what it thought was dead code. As a potential work around and confirmation of this, try the following

After you READ the one number, insert a statement

IF(ISNAN(TheOneNumber)) PRINT *,"Not supposed to happen"

This will insure the compiler optimization will not assume TheOneNumber (whatever its name) was .NOT. never used, and thus the code used to generate its value is subject to removal.

Jim Dempsey

jimdempseyatthecove · ‎02-06-2016

mecej,

Thanks for examining the code for blatant errors. In the event that when after the code is corrected it still exhibits this symptom, then he can try my diagnostic.

Jim Dempsey

Jacek_J_ · ‎02-06-2016

Hello,

@mecej4: Thank you for your interest. The routine skipgeo is not supposed to -read- the lengthy records (positions marked with (1) to (4)), but to -skip- them reading just one number in a record and turn to the next record while executing the new READ. Therefore the short declarations of dummy fields you mention. Please note, everywhere by LIT calls -- records marked with (1..4) -- the value of NVAL=1. There are perfectly no memory overruns. Please note this is a veteran legacy code, in duty since ca. 1985...

Best regards, jaj

Jacek_J_ · ‎02-06-2016

Hello,

@Jim Dempsey. Please note I mention Fortran READ statement errors, the record in LIT is read with

READ(CANAL,END=100,ERR=101)(W(J),J=1,NVAL)

so it jumps to the label given by ERR. Please run the code, you should have the output like (on other platforms one may have error already by (1)):

jaj@neo:~/prog/telemac/v6p3r2/work/litanie$ ./generr_intel
opening the geometry file
skipping geometry improved
(1)
(2)
(3)
(4)
skipping geometry original
(1)
(2)
(3)
(4)
LIT : ABNORMAL END OF FILE
ONE INTENDED TO READ
A RECORD OF 1 VALUES
OF TYPE : R4
ON LOGICAL UNIT : 10

PLANTE: PROGRAM STOPPED AFTER AN ERROR
2

Best regards, jaj

jimdempseyatthecove · ‎02-07-2016

Jacek,

The system I am using to inspect your program is Windows 7 Pro x64 with IVF V16 update 1.

The program reads the first header record, then experiences an EOF on the next read. In examining the contents of the geo_wesel.slf file using a hex dump it appears that the records were written using Big-endian format. When I change your open statement:

!*OPEN (inp, FILE='geo_wesel.slf', FORM='unformatted', STATUS='unknown', ACTION='read') 
  OPEN (inp, FILE='geo_wesel.slf', FORM='unformatted', STATUS='unknown', ACTION='read', CONVERT='BIG_ENDIAN')

The output becomes:

 opening the geometry file
 skipping geometry improved
 (1)
 (2)
 (3)
 (4)
 skipping geometry original
 (1)
 (2)
 (3)
 (4)
 closing the geometry file

Jim Dempsey

Jacek_J_ · ‎02-07-2016

Hello,

@Jim Dempsey : Thank you for your time. Yes, the file is big endian, as given in the compilation instructions in the comments at the very beginning of the code included:

! please note the file geo_wesel.slf is written in big endian
! it is a sequential unformatted file with records of mixed type
! called "Telemac Serafin format"
! => use export F_UFMTENDIAN=big for reading correctly!
! => or set appropriate compiler flags:
!
! ifort -convert big_endian generr.f90 -o generr_intel
! gfortran -fconvert=big-endian generr.f90 -o generr_gfortran
!
! (the error occurs as well when using little endian files)
!
! NOTICE: 
! ifort -warn all -check all -convert big_endian generr.f90 -o generr_intel
! ifort -O0 -convert big_endian generr.f90 -o generr_intel
! deliver correctly running executables... Optimisation problem?

Please note we encounter this problem while reading little endian files as well, conversion is not the solution and is not a part of this problem. Reading with a bad conversion delivers also a READ error, bit at the very beginning of the file and not when skipping longer records first.

Anyway, thank you for your interest... I would be very thankful if you could include some information what optimisation level you use when getting the result you quote. When it is higher than O0, then we know that the compiler for Windows is maybe optimising in another way then the one for Linux. The compiler I use is:

ifort (IFORT) 16.0.1 20151021

Looking forward for your answer,

Best regards,

jaj

mecej4 · ‎02-07-2016

Jacek, I think that the problem may be closer to a Fortran RTL problem than an optimizer problem, in that I can reliably generate the premature end-of-file with the test code given below on your data file (sequential, big-endian, unformatted, variable record sizes), even with the /Od option on Windows, using the 32- and 64-bit 16.0.1 compilers. The bug is not seen with the 11.1.070 compiler, Lahey and Gfortran, all of which use the same unformatted file format.

The test code is a greatly stripped-down version of your code. The correct output should be:

 W =    3678.738
Normal end

but, because of the bug, the actual output is:

forrtl: severe (24): end-of-file during read, unit 10, file s:\lang\Jacek\geo_wesel.slf
Image              PC        Routine            Line        Source
libifcoremd.dll    5FF819A2  Unknown               Unknown  Unknown
libifcoremd.dll    5FFBE60F  Unknown               Unknown  Unknown
gen.exe            00341295  _GENERR_ip_SKIPGE          25  gen.f90
gen.exe            003410B8  _MAIN__                     7  gen.f90
...

PROGRAM generr

  IMPLICIT NONE

  OPEN (10, FILE='geo_wesel.slf', FORM='unformatted', STATUS='OLD', &
        ACTION='read', CONVERT='BIG_ENDIAN') 
  CALL skipgeo ()

  CLOSE(10) 
  STOP 'Normal end'

CONTAINS

   SUBROUTINE SKIPGEO ()

      REAL W(1)
      INTEGER IB(10),I
!
      REWIND 10
      do i=1,6
       	 read(10)
      end do
      CALL LIT(IB,1)
      read(10)w(1)
      read(10)w(1)
      write(*,*)'W = ',w(1)
      RETURN
   END SUBROUTINE SKIPGEO
      
   SUBROUTINE LIT (I, NVAL)
      
      INTEGER, INTENT(IN)             :: NVAL
      INTEGER, INTENT(INOUT)          :: I(NVAL)
!
      read(10)i(1:nval)
      return
   END SUBROUTINE LIT

END PROGRAM generr

Furthermore, replacing the call to LIT() by the equivalent line

     read(10)ib(1)

makes the bug disappear. Similarly, if I compile the source using the 16.0.1 compiler and then link the OBJ file to the runtime library of the 14.0.4.237 compiler, the bug goes away.

Jacek_J_ · ‎02-07-2016

Dear Mecej4,

thank you very much for your time and work. I can reproduce your results, erroneous with Intel 16.0.1 and correct with gfortran based on gcc 4.8.4 on my Linux laptop

Linux neo 3.13.0-77-generic #121-Ubuntu SMP Wed Jan 20 10:50:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Sadly, if you suspect Fortran RTL, this is very bad news indeed. I've also had correct runs with Intel 14 in the past, so it seems that your workaround with compiling with Intel 16 and linking with 14 libraries is thoroughly consistent...

So what should I do now? Submit this bug to some appropriate "complaints booth" by intel? Wait for a new compiler release with fingers crossed? ,^)

Confused, but thankful for your analysis, with best regards, jaj

mecej4 · ‎02-07-2016

Reporting the bug here should be sufficient. The Intel personnel usually respond within a day or two (not counting week-ends). Since we now have a short reproducer, and reproducing the bug does not seem to affected by optimization level, it should be easy for them to see that there is a problem and file a bug report.

On the other hand, the fix for the bug may not become available until one or two compiler updates have been released.

mecej4 · ‎02-07-2016

Here is an even shorter reproducer, and a little-endian unformatted input file to go with it.

PROGRAM generr

  IMPLICIT NONE
  integer :: ib(1),wi
  OPEN (10, FILE='geo_wesel.lit', FORM='unformatted', STATUS='OLD', &
        ACTION='read') 
  REWIND 10
  call lit(ib,1)   ! replacing by "read(10)ib(1)" makes bug go away
  write(*,'(A8,2x,Z8)')'IB(1) = ',ib
  read(10)wi
  write(*,'(A8,2x,Z8)')'W = ',wi

CONTAINS      

   SUBROUTINE LIT (V, n)
      integer, intent(in) :: n
      INTEGER, INTENT(OUT)  :: V(n)

      read(10)v(1:n)

      return
   END SUBROUTINE LIT

END PROGRAM generr

IFort 14.0.4.237 (Windows) output:

IB(1) =    1000000
    W =   DE009B44

IFort 16.0.1 (Windows) output:

IB(1) =    1000000
    W =   6B22D045

Steven_L_Intel1 · ‎02-07-2016

Thanks for the smaller test cases. I will send this on to the developers.

Jacek_J_ · ‎02-08-2016

Hello,

@ Steve Lionel : Thank you for your interest and passing the case to the developers.

The very essence of this error is the different behaviour of READ if we want to read all the written records applying correct field lengths or just want to skip the record reading one number and passing to the next record. We encounter READ errrors or a premature(?) end-of-file. Note this has nothing to do with the endianess of the input file.

Please note as well that although the conclusions of mecej4 (thank you for your help) might be perfectly right (Fortran RTL?), his/her way of reading the provided Telemac input file might be confusing while searching for errors because the behaviour might be dependent on the type of variable to be read of if it is a field or not(?). The original input file structure description is given in comments in the routine skipgeo and the reading with perfectly set field lenths is realised in skipgeo_improved.

Looking forward to Intel developers reactions, best regards, jaj

Steven_L_Intel1 · ‎02-08-2016

Escalated as issue DPD200381641. Another data point - If I build with the 15.0 compiler, I get an "internal error" in the run-time library.

Steven_L_Intel1 · ‎03-02-2016

I expect this to be fixed in Parallel Studio XE 2016 Update 3. The underlying problem was that the run-time library was incorrectly positioning the file when only part of a very long record was read, due to a bug in the buffering implementation.

Jacek_J_ · ‎03-18-2016

Dear Steve,

sorry for not answering, I was on strictly non-internet holidays... Thank you for solving the problem, please follow my thanks to the developers team ,^) So far as I understand it remains for me to wait for the new update... Please inform me, what would be the approximate release date date one can assume to be realistic?

Best(!) regards, Jacek

Steven_L_Intel1 · ‎03-18-2016

I think May 2016.

Jacek_J_ · ‎06-30-2016

Hello,

I am extremely sorry, but with Parallel Studio Update 3 the problem remains as before. Disappointed...

Best regards, Jacek

Steven_L_Intel1 · ‎06-30-2016

Indeed, it seems that part of the fix didn't get into 16.0.3. It is fixed in 17.0 Beta (I tried it) and should also be fixed in 16.0.4.