Large file I/O (F90 with ifort) -- degradation with increased file size

mpbro · ‎01-09-2008

I have an application which needs to do a lot of I/O on large files (>1TB/run). Thus it is unsurprising that I'm worried about speed. I wrote a test program which simply allocates an array of variable size, writes that array to /tmp/, and reads it back. I will attach that program below:

!-----------------------------------------------------------------------
program IOSpeed

use system_time_mod
use ifport

implicit none

integer :: l, n, n1, n2, n3, iostat, sysstat
real :: fs
type(timer) :: tr, tw
real, dimension(:,:,:), allocatable :: data

call system_time_init()

!---------------------------------------------------------------------
! Loop over increasing file size,
!---------------------------------------------------------------------
sysstat = system("touch /tmp/test")
do l=1,10

n = 512000*(2**l)
n1 = 512
n2 = 1000
n3 = n/n1/n2

allocate( data(n1,n2,n3) )
sysstat = system("rm /tmp/test")
open(1,file='/tmp/test',form='BINARY',status='NEW',iostat=iostat)

call start_timer(tw)
write(1) data
call stop_timer(tw)

rewind(1)

call start_timer(tr)
read(1) data
call stop_timer(tr)

fs = (1.0*n*4)/1024000

write(0,*) fs,fs/tw%telapsed,fs/tr%telapsed

close(1)
deallocate(data)
end do
!---------------------------------------------------------------------

call exit(0)

end program IOSpeed
!-----------------------------------------------------------------------

A few caveats about the code:
1) I'm using the Intel fortran compiler, hence the use ifport to enable system calls
2) I'm using a timing module of my own creation (just a front-end for system_clock())
3) The BINAR Y file type is not necessarily a fortran standard (to my knowledge)

I'm using the following ifort flags:

-assume bscc -assume byterecl -fpp -mtune=pentium4 -O3 -static-intel -vms -w -WB -threads

If you run the code, for each file size, it prints the file size, and the I/O speed for the read and write in MB/sec.

My machine:
1) Intel quad core 64-bit, 4GB shared memory
2) Fedora Core 7
3) /tmp is ext3 (I converted another disk to ext2 (no journaling) and see the same behavior)

Here is a GNUPLOT figure of the performance, for three separate runs.

http://www.voxproperty.com/images/2008_01_09_f90_io_speed_vs_filesize.png

You will notice a fairly predictable performance pattern for each file size. The sampling along the x-axis is non-uniform--data occurs at powers of 2: 2MB, 4MB, ... 2048 MB. Write performance declines precipitously at 256 MB, whereas read performance declines
precipitously at 1024 MB.

I won't be surprised if this is a system issue, but just in case it's something that can be improved with compiler optimization or programming habits, I posted here.

Thank you for your time...

Morgan Brown

mpbro · ‎01-09-2008

I discovered a curious fact: I re-ran this test on a machine with only 2Gb of memory (versus 4Gb) and found the the precipitous decline in read and write I/O rate comes at around half the memory/file size on the 2Gb machine as the 4Gb machine.

4Gb machine (replotted with lines and points):
http://www.voxproperty.com/images/2008_01_09_f90_io_speed_vs_filesize_brown01.png

2Gb machine:
http://www.voxproperty.com/images/2008_01_09_f90_io_speed_vs_filesize_brown02.png

Hmm....

TimP · ‎01-09-2008

Did you find that buffered_io didn't help? If you have a RAID controller (possibly requiring the Windows driver), does that help?

mpbro · ‎01-10-2008

I just recompiled with -assume buffered_io and it did not help.

I have a RAID5-enabled disk, and I notice that base I/O rate is about double that to regular local disk. But I do see the same dropoff.

Here is a thread on comp.lang.fortran where a poster has repeated the behavior; thus it's probably not an ifort issue, per se.

oh_moose · ‎01-10-2008

One effect you see comes from the cache and the fact that its size is limited. Linux will cache some operations, even those that affect the integrity of the file system. But that depends on the file system and the options you use to mount the disk.

I presume you have a utility called blockdev on your Linux system.

Try blockdev --getra /dev/hda to get the current number of readahead sectors. It is probably set to 256. Increase it with blockdev --setra 16384 /dev/hda
(replace the device name)

Readahead will of course only improve read operations, although there are also plenty of read operations involved when you write a file (updates of the file system, allocation tables). Play with the number a bit. Try 8192 and 32768. But do not make the number too large. Optimize the number for each disk based on the application. File transfers from a disk with lots of large data benefit from a larger number, while the disk with the operating system works better with a smaller number.

Try to optimize the ratio of the buffer size in your Fortran READ and the readahead buffer size (maybe ratio 1:2).

If you use a RAID system, then check the number of blocks per group (see documentation of mkfs.ext3).

Try alternative journaling file systems (JFS, ReiserFS, maybe xfs).

For a real-world application you would probably want to do an asynchronous I/O.

Looking forward to see some new plots.

jimdempseyatthecove · ‎01-10-2008

Morgan,

I suggest you experiment using RECL=nnn on the OPEN(s) then writing/reading stripes of the array as opposed to using the default RECL and writing the whole array in "one" operation. Often the default values give less than stellar performance.

Jim Dempsey

jimdempseyatthecove · ‎01-10-2008

Morgan,

A second hint. Extending a file with each sequential write often takes much longer than computing the file size, writing a "junk" last record, rewinding, then writing into preallocated disk space.

Jim Dempsey

oh_moose · ‎01-14-2008

How about the OPEN option INITIALSIZE? The VMS Fortran documentation says "VMS only". Intel Fortran for Linux does not complain about this option, but I cannot tell wether this option is a dummy on Linux or not.
(On VMS I also find EXTENDSIZE quite useful.)

Steven_L_Intel1 · ‎01-14-2008

INITIALSIZE and EXTENDSIZE seem to be ignored in Intel Fortran. These are tied to RMS options that don't exist on Windows and Linux. I see that they are silently accepted but not documented.

oh_moose · ‎01-14-2008

Linux does support posix_fallocate and fallocate. If Morgan's performance problem is related to the allocation of blocks (needs to be tested), and if the routines improve performance for large files, then it would be nice to have these two features for the Linux version of the compiler. Even if the final file size is not know at the time of the Fortran OPEN call, a large value for EXTENDSIZE could help for large files. But I assume that some extra code would be required to set the actual file size with the Fortran CLOSE call since Linux does not keep the actual and allocated file size separated (VMS does).