Unformatted File I/O

emr150 · ‎05-05-2008

I have a most perplexing problem with some code that we just released. We have a file that was written using the following code:

K = 9
OPEN ( UNIT=98, FILE='my.data', FORM='UNFORMATTED', STATUS='NEW')
WRITE ( 98 ) ( ( AN_ARRAY(I,J), I=0,K ), J=-1,15 )
CLOSE (UNIT=98, STATUS='KEEP')

...and read using the following code:

K = 9
OPEN ( UNIT=119, FILE='my.data', FORM='UNFORMATTED', STATUS='OLD', READONLY )
READ ( 119 ) ( ( AN_ARRAY(I,J), I=0,K ), J=-1,15 )

Relatively simple.....worked for us for years. However, one of our customers reported that they get a read error on their system when the read operation executes:

forrtl: severe (22): input record too long, unit 119, file /some_path/my.data

The confusing part is this same program and same binary file (residing on an NFS mount) executes fine on another system in the same lab, as well as our machines here at the development lab.

Some configuration details:
Machine that fails:

RHEL 3, kernel 2.6.14.5, libc 2.3.2, gcc 3.2.3 20030502, Intel P4 (family 15, model 2, stepping 9) 2.80 GHz, ifort 8.1.029 20050702

Machine that works:

RHEL 4, 2.6.9-42.EL, 2.3.4, gcc 3.4.6 20060404, Intel P4 (family 15, model 2, stepping 9) 3.00 GHz

It should be noted that this same code has previously worked on RHEL 3 -- I'm still trying to track down what the difference is. The code is compiled on the machine that fails -- the machine that works does not have the compiler installed.

Compile flags: -132 -nbs -align dcommons -static-libcxa -nus -zero -save -xN -axN -fp_port -c -O0 -prec_div -no_cpprt -fpstkchk -ccdefault fortran -fpe0 -convert native

Link flags: -static-libcxa -Wl,-d -Wl,--sort-common

We also tried compiling with the "-convert native" flag removed, with no effect.

Thanks in advance,
Eric

Steven_L_Intel1 · ‎05-05-2008

My guess is that there is an NFS problem. You don't need -convert native, that's the default.

Two things to try. First, look at environment variables to make sure that the user has not set any of those that change Fortran unformatted data conversion. Second, ask the user to do a "od -t x4" of the file and compare it to what you see with the same file.

Ron_Green · ‎05-05-2008

along the same lines as Steve's suggestion: a fast check, use 'md5sum' on the same file on the 2 systems - do the checksums match?

I was wondering if the file was created and read on the same computer, or if the file had been moved between systems?

emr150 · ‎05-05-2008

Regarding the -convert native, we explicitly specified that as a potential solution, in case for whatever reason one of the other options was overriding the default.

And while it could potentially be an NFS problem, one of the first things I did was an "md5sum my.data" and compared the hashes (and found them to be identical) on both systems. Even when the file is copied to a local path, it still doesn't like it for whatever reason. I'm ready to resign myself to the fact that it's something system-related (that I unfortunately can't do anything about) rather than code or compiler-related, but I figured I'd check in here first.

Another note, as I'm seeing another reply was posted -- The file is created on our development machines here, and then distributed as a binary data file to the client. The md5sum of the file here and both machines there match.

Thanks again,
Eric

Ron_Green · ‎05-05-2008

Eric,

I tested on a 'close' EL3, kernel 2.4.21-37 with same gcc as user - I believe this is RHEL3U6.

I didn't have the exact 8.1, so I used a older one and a newer one: 8.1.022 and 8.1.036. I can't reproduce the error, the code frag runs just fine, as expected, with your options save the -xN -axN which my older compiler didn't like.

So like you, I am suspicious of the RHEL3. I don't have any older RHEL3 systems of that era to test against.

ron

Ron_Green · ‎05-06-2008

Eric,

One parting shot - I was stewing over this issue last night. We are sure the file is the same and uncorrupted. The error the user is seeing indicates that the read is attempting to fetch too much data. I think you mentioned that you release source code and allow the users to compile on their target platform. Well if it isn't the data file, it must be the source code OR the system. Of the 2, I suspect the source code. Ask the user to md5sum his sources.

As someone who works support, we see time and time again customers swearing that the code ran on X but doesn't run on Y. And they swear it's "the same code". After much probing we often find the customer changed something small in the code, claiming "it should not affect the results". So I'd be politely suspicious of your user. Might put a few print statements before the read and have him execute that code.

good hunting

ron

emr150 · ‎05-06-2008

Thanks to all for your help on this. I revisited Steve's original suggestion and started from scratch and looked at everything that had to do with the format conversion. My original investigation turned up nothing, but upon a re-search today, I found that the F_UFMTENDIAN variable was set to "big". As it turns out, there is another organization in a different room that uses this same system over the network, and they were experimenting with turning this environment variable on and off in the global bashrc file. So my session that had been open on the one machine for days didn't exhibit this issue, while the new session I had opened up to the compiling machine did (because it was sourcing the modified bashrc file). If I "unset F_UFMTENDIAN", all works well.

Apparently another case of the left hand not knowing what the right hand is doing.

Thanks again.