File I/O Overhead at Intel Fortran 13 and Higher Versions

Soon-Heum_K_ · ‎04-12-2015

Hi!

I encounter the I/O performance drop at Intel compiler version 13 and onwards. The MPI code operates a simple read from the mesh file in the following way:

READ(IUNIT,IOSTAT=IOSTAT) &
((FOO,ISIZ=1,NS),IDIM=1,DL-1), &
((FOO,ISIZ=1,SL-1), &
(DATA(ISIZ-SL+1,IDIM-DL+1),ISIZ=SL,SU), &
(FOO,ISIZ=SU+1,NS), &
IDIM=DL,DU), &
((FOO,ISIZ=1,NS),IDIM=DU+1,ND)

where FOO, DATA(:,:) are real variables, NS, DL, SL... are predefined parameters. What it really does is, it reads the entire data file and stores the subzone which is assigned to that specific processor. Therefore NS, ND are globally fixed and SL, SU, DL, DU differ from processor to processor. All processors access the same file. Ideally, the entire read operation should complete by a single instruction.

The symptom is that, the read performance is really bad if the code is compiled with Intel Fortran 13 or higher version, while the overhead was non-recognizable when I compiled with Intel version 12. As I found from the strace, this read command tried to read the entire dataset by a single read operation at Intel 12 as desired. On the other hand, the compiler attempts to read every 8-byte real value separately at Intel 13 and higher. I wonder if any run-time option was introduced at Intel compiler version 13, which is related to the size of read buffer or whatever.

Anyone experienced the same trouble or know the remedy?

Thank you in advance.

- Jeff

TimP · ‎04-12-2015

Do you set buffered_io^?

Soon-Heum_K_ · ‎04-13-2015

The addition of "-assume buffered_io" flag in the compile time solved the problem. Thanks Tim for your advice. I will leave a short summary in the next post, for anyone who may experience the same issue as me.

By the way, was there any change on I/O standard between Intel Fortran 12 and 13?

Best regards,

Jeff

Soon-Heum_K_ · ‎04-13-2015

Issue:
File read operation drastically slowed down at Intel Fortran 13 and higher (in comparison to Intel Fortran 12).

Read Operation in the Code:
File open statement in my MPI code is follows:

OPEN(UNIT=IUNIT,FILE=DSET%FNAME,STATUS='OLD',ACTION='READ',&
ACCESS='SEQUENTIAL',FORM='UNFORMATTED',IOSTAT=IOSTAT)

Code attempts to read the entire file by a single read operation:

READ(IUNIT,IOSTAT=IOSTAT) &
((FOO,ISIZ=1,NS),IDIM=1,DL-1), &
((FOO,ISIZ=1,SL-1), &
(DATA(ISIZ-SL+1,IDIM-DL+1),ISIZ=SL,SU), &
(FOO,ISIZ=SU+1,NS), &
IDIM=DL,DU), &
((FOO,ISIZ=1,NS),IDIM=DU+1,ND)

where FOO, DATA(:,:) are real variables, NS, DL, SL... are predefined parameters. Each MPI processor accesses the same file, stores valid data set in DATA(:,:) while non-needed data are read to FOO(1) variable (and thus ignored). Total size of read (which are determined by NS, ND from above read statement) is fixed over processors (i.e., all processors read the entire file while they only keep the subset). SL, SU, DL, DU define the starting and ending indices of the partition to each processor, thus differ from processor to processor.

Symptom:
Tracking with strace command reveals that the entire dataset is read by a signle instruction at Intel 12. On the other hand, the compiler attempts to read every double-precision value separately at Intel 13 and higher.

Solution:
Referring to the comment by Tim Prince and the web article (http://kiwi.atmos.colostate.edu/rr/tidbits/intel/macintel/doc_files/source/extfile/optaps_for/fortran/optaps_prg_io_f.htm), re-compiled the code by adding the buffered_io option ("-assume buffered_io"). The change on total run-time (most of which are dedicated for file read operation) is

At Intel 12 (without buffered_io option): 0m7.442s
At Intel 15 (without buffered_io option): 3m15.751s
At Intel 15 (with buffered_io option): 0m5.283s

The value will change at different file size, different MPI cores, different file system, etc., but the above result clearly demonstrates that the I/O performance can significantly improve with buffered I/O option turned on (from Intel 13 and above).

Conclusion:
In case you experience the slowdown in I/O operation, you can measure the change in I/O performance by enabling buffered I/O option.

Steven_L_Intel1 · ‎04-13-2015

We have made several changes to the way I/O buffering is done over the past couple of versions. It's a tradeoff between performance and memory use. The release notes went into quite a bit of detail on this - I suggest you read them.