Open and Read BIG-endian written unformatted binary data on LITTLE-endian machine

mad-matts · ‎02-09-2010

Hallo

I try to read a binary data file which was written with FORTRAN 77 unformatted (and I guess sequential) on a 64-bit BIG-endian machine as follows:

First record: Header consisting of four 8-Byte Reals followed by three 4-Byte Integers. Being the first record, the header is preceeded by 4 Bytes and followed by 4 Bytes

Following records: 3*384*384 Reals (8-Byte), preceeded and followed by 4 Bytes

3*384*384 Reals (8-Byte), preceeded and followed by 4 Bytes

and so on

In total there are 258 records, the first one is the header, the following ones as described above.

However, the machine I am trying to read the binary data file now, is a 64-bit LITTLE-endian machine (SGI Altix) and I think I do not correctly open or read the file respectively.

The lines

open(20,file='TestFile.bin',status='old',form='unformatted',access='direct',recl=4)
read(20,rec=1) head

where head is a variable containing 4 8-Byte Reals and 3 4-Byte Integers was resulting in numbers I did not expect at all, while

open(30,file='TestFile.bin',status='old',form='unformatted')
read(30) header

was resulting in run time error.

Unfortunatelly I also found the following on the web (http://wapedia.mobi/en/Endianness#6.):

"Fortran sequential unformatted files created with one endianness usually cannot be read on a system using the other endianness because Fortran usually implements a record (defined as the data written by a single Fortran statement) as data preceded and succeeded by count fields, which are integers equal to the number of bytes in the data. An attempt to read such file on a system of the other endianness then results in a run-time error, because the count fields are incorrect."

Is there any possibility to read a binary data file with the properties described above on a little endian machine anyway ?

Thanks

Steven_L_Intel1 · ‎02-09-2010

There are two issues you need to address.

The first is the big-endian record length and values. Add to your OPEN:

,CONVERT="BIG_ENDIAN"

This will tell Intel Fortran to read the record lengths and the data big-endian. Unfortunately, you are reading into a RECORD or derived type variable "head", which disables the data conversion (the record length will be handled properly). To deal with this, you need to replace "head" with the names of scalar or array variables of the proper type. This can even be the components of "head". So let's say that HEAD has two components, the REAL*8 array R and the INTEGER*4 array I. You could write:

READ (30) head%r,head%i

for the rest of the data, read arrays and not record/derived-type variables.

There is no need to try to fake it with direct access.

mad-matts · ‎02-16-2010

Thank you !

I think I got it running.

However, I got another question:

Let the data file content look as follows:

XXXX hhhhhhhh hhhhhhhh hhhhhhhh hhhhhhhh hhhh hhhh hhhh XXXX

XXXX aaaaaaaa bbbbbbbb cccccccc dddddddd ......... zzzzzzzz XXXX

XXXX aaaaaaaa bbbbbbbb cccccccc dddddddd .......... zzzzzzzz XXXX

.

XXXX aaaaaaaa bbbbbbbb cccccccc dddddddd .......... zzzzzzzz XXXX

Now I am interested in all cccccccc, however ONLY in all ccccccccc.

Is there a possibility of reading only all cccccccc in FORTRAN without reading the whole data file line by line (since it is very big) ? This is possible for example in MATLAB. However I didn't find a possibility so far to do that in FORTRAN.

Thanks for any hints !

Steven_L_Intel1 · ‎02-16-2010

I can't think of a way to do this in Fortran other than reasing record by record.

Hirchert__Kurt_W · ‎02-16-2010

It seems unlikely to me that any language reads all the ccccccc data without reading the entire file, although some may syntactically hide this fact, and many (including Fortran) allow you to immediately discard all the data other than the ccccccc data.

You mention reading "line by line", a phrase that normally applies to textual (i.e., formatted) files. If it really is a formatted file, it would be possible to write a format such that a single READ statement could read the entire file extracting only the ccccccc data.

However, since this is a follow-up to your earlier question about unformatted files, my assumption is that you really were asking about reading the file "record by record". This is necessary, but it doesn't have to be particularly expensive or hard. The reading part of your code could look something like the following:

[fortran]read(u)   ! skip the header record
do i=1,n  ! assuming you know the size of ccccccc
  read(u) (dummyvar,j=1,2),ccccccc(i)
end do[/fortran]

As you can see, you don't need to explicitly read data that appears in the record after the data you want, so in the header record (where I assumed you wanted nothing) you can use an empty list. You do have to explicitly read the data in the record before the data you want, but you read it into variables you use over and over again, so you don't have to waste significant memory on the unwanted data. [In your case, you do spend a few extra CPU cycles converting that leading data from big-endian to little-endian, but that is a relatively cheap operation, so I don't think that will significantly affect your performance.] I used an implied-DO for the data to be skipped to make it easy to switch which data you extract, but if you know you will always want the third number in each record, you could avoid the loop with something like "read(u) dummyvar,dummyvar,ccccccc(i)".

mad-matts · ‎02-17-2010

OK.

Thanks a lot for the information !!!

(and of course "line by line" means "record by record")

One problem is still remaining:

The prefix and postfix of each unformatted record in the data file consist of 4 bytes containing the length of the record.

However, the machine I now need the data to read on seems to expect 8 byte pre- and postfixes for each record (at least thats what it does if write a unformatted file on that machine) and, thus, it seems not to be able to read the file using sequential access. Direct acces is not possible since the header record length differs from the data record length.

How can I address this problem ?

Steven_L_Intel1 · ‎02-17-2010

The other machine must be using g77. Intel Fortran doesn't have an option to use 8-byte record lengths. (It handles records larger than 2GB in another way, which gfortran also implements.)

You could use ACCESS='STREAM' and read the record lengths yourself with a file opened 'BIG_ENDIAN'. You'd then have to process the beginning and ending record lengths properly.

mad-matts · ‎02-17-2010

Thanks.

That's exactly what I tried.

However, on my local machine it is working, but on the parallel machine I need to process the data on, I get a compile time error. Maybe the non-standard stream access is unknown there (?)

When using

open(file_id,..... access='stream',convert='BIG_ENDIAN')

read(file_id,POS=5) head0

for start reading data at the 5th byte the compiler complained about a syntax error concerning the POS in the read statement (lower case does not change anything).

It didn't work with neither gfortran, ifort nor g77 (of course). I also tried both -assume byterecl and -assume nobyterecl as compiler flag.

On my local machine I didn't get any problems with this syntax using gfortran....

mad-matts · ‎02-17-2010

and concerning the pre- and postfixes:

I wrote an unformatted sequential sample file (both little endian) both on my local MacBook and on the SGI Altix. However, if I compare the header in hex format I get the following:

MacBook

0000000 002c 0000 0000 0000 c345 4002 0000 0000
0000016 4000 409f 0000 0000 0000 3ff0 0000 0000
0000032 0000 4000 0180 0000 0101 0000 0186 0000
0000048 002c 0000
0000052

SGI Altix

0000000 002c 0000 0000 0000 0000 0000 c345 4002
0000016 0000 0000 4000 409f 0000 0000 0000 3ff0
0000032 0000 0000 0000 4000 0180 0000 0101 0000
0000048 0186 0000 002c 0000 0000 0000
0000060

As could be seen there are 4 extra bytes in the prefix (byte 5-8: 0000 0000) and in the postfix (byte 57-60: 0000 0000).

I used gfortran to compile the fortran program that created the sample file.

TimP · ‎02-17-2010

Perhaps you could use a current version of gfortran to read the file and write out a copy in their current format.

mad-matts · ‎02-17-2010

I would like to avoid doing stuff anywhere else than on the super computer since one data file is already about 3.3GB and the whole set of data is around 3TB.

jimdempseyatthecove · ‎02-18-2010

Presumably at some point in time files may ingress or egress from your system. As an earlier post illustrated the unformatted record seperators (record byte count) could be 4 or 8 bytes (other bytes?, e.g. 2 or 16??). IOW the recore byte count is unknown as to number of bytes for the byte count or if it is big endian or little endian.

Therefore, open the file in stream mode once.
Read 4 bytes
Determine if little or big endian
Fix byte count (assuming 4 byte and .lt. 4GB records)
Advance assumed byte count over assumed record size
Read assumed trailing byte count
See if matches original 4 bytes from record
if not, read 2 additional bytes
discard (shift out) 1st two of second assumed header while shifting in the 2 new bytes
...

i.e. locate 2nd record byte count field using 1st byte count field

The seperation of these two byte count fields - the (swapped if necessary) value of the byte count, will tell you the byte size of the byte count field.

Now you have a) size of byte count field, and b) indication if byte count field is little or big endian, and c) a sanity check pattern (unswapped byte count) for record length verification.

With this information you should be able to import files from anywhere.

After you have this working for this particular database, you might consider generalizing the program such that you supply not only an input file spec and output file spec (unless reading into RAM), but supply a file nameof a file containingwhat looks like the contents of a stripped down user defined type (of your design). Example

"4REAL(8), 4INTEGER(4)"

Or some other abstract description of your design

"REAL(8)::(1000:2000), INTEGER(4)::(4)"

Jim Dempsey

mad-matts · ‎02-19-2010

I finally got it:

open(fid ..... access='stream', convert='BIG_ENDIAN')

read(fid,POS=X)

does what I want it to do. However, it didn't work with gcc Version 4.1.2 which was the defaul on the machine I use. It works, however, with gcc Version 4.3.4

mad-matts · ‎02-19-2010

Thanks for your help !!