Unfomatted sequential read - slow performance

John_Sturton · ‎10-08-2009

I have a large (600MB) unformatted sequential file created by Microsoft Power Station V4.00. Reading this file back with Intel Fortran is around four times slower than reading the same file with Microsoft Power Station V4.00.

The source code:

open(10, open(10,file=c_FileName,form='unformatted',status='old')
read(10,end=101,err=102)r_Radius,i_Direction
i_Count = 0
do 10
read(10,end=101,err=102_Error)r_X,r_Y,r_Z
i_Count = i_Count+1
if (mod(i_Count,1000000) .eq. 0) then
print *,i_Count,r_X,r_Y,r_Z
endif
10 continue

The compile line:

ifort -assume:byterecl -integer-size:32 -iface:cvf -warn:argument_checking -warn:truncated_source -fpscomp:ioformat -fp:source -MD -assume:buffered_io read_ud.f

So my program prints to the console every 100000 reads. What is "odd" is that if I interrupt the program with Ctrl-Break, I get a Windows "Application Error" dialog displayed, but the program continues to run in the background, and the program print statements are now generated around 4 times more rapidly.

So I guess my questions are:

1) Is this slow unformatted read performance a known issue?

2) Are there ways to improve it

3) How come the program suddenly accelerates when I hit Ctrl-Break

Cheers,

John Sturton

TimP · ‎10-08-2009

As unformatted files aren't portable, I'm not entirely surprised about strange behaviors. Do you know whether your problem is associated with the print statement; for example, if you redirect to a file, does it speed up? If not, have you tried making an updated copy of the file written by a recent ifort? If necessary, you might write a simple program to read the old file and write out a new one.

John_Sturton · ‎10-08-2009

Quoting - tim18

As unformatted files aren't portable, I'm not entirely surprised about strange behaviors. Do you know whether your problem is associated with the print statement; for example, if you redirect to a file, does it speed up? If not, have you tried making an updated copy of the file written by a recent ifort? If necessary, you might write a simple program to read the old file and write out a new one.

Thanks for your suggesions. However,

1) Redirecting the prints does not speed up the execution

2) Running the program on a updated copy of the original data file (created by reading the original and writing a new file) does not speed up the execution.

BTW, the compiler version is:

Intel Visual Fortran Compiler Professional for applications running on IA-32, Version 11.1 Build 20090511 Package ID: w_cprof_p_11.1.035

Ron_Green · ‎10-08-2009

Interesting. Now you are doing the timing on the same computer, right? There are big differences introduced by the OS, because ultimately the read/write calls go down to operating system service calls. So Vista may be quite a bit of a different beast that Win 98. And the disk controller logic will of course be different between machines as could the disks themselves.

ron

Ron_Green · ‎10-08-2009

After thinking about this a bit more, I have a few other comments.

First, you are reading a non-native file format in ifort. 'unformatted' data files have the data stored in binary format for sure, however there are record markers and file marks that need to delineate the data. There is no standard for these marks, so each compiler has it's own 'native' record and file marks.

The data file you are reading was created by another compiler and thus does not use the same storage format as that used by ifort. You are using -fpscomp so that ifort can read this non-native file. So you've added this compatibility layer of software that is slowing down the reads (and writes). And you're reading in small chunks, which means you're going through this extra layer of software translation for each read - and you have a boatload of reads going on. You could amortize this cost with large block reads if the data was layed out differently in the file. The extra software overhead would diminish to noise if the data was written as one record and you re-read that as one record, for example.

Also, you have that whole mod() and print going on in the loop. Why don't you yank that test/print code out of there, put a call to CPU_TIME() before and after the loop to get a clean timing on just the read time without also including all the if tests, the calls to MOD() and potentially the writes.

ron

John_Sturton · ‎10-08-2009

Quoting - Ronald W. Green (Intel)

After thinking about this a bit more, I have a few other comments.

First, you are reading a non-native file format in ifort. 'unformatted' data files have the data stored in binary format for sure, however there are record markers and file marks that need to delineate the data. There is no standard for these marks, so each compiler has it's own 'native' record and file marks.

The data file you are reading was created by another compiler and thus does not use the same storage format as that used by ifort. You are using -fpscomp so that ifort can read this non-native file. So you've added this compatibility layer of software that is slowing down the reads (and writes). And you're reading in small chunks, which means you're going through this extra layer of software translation for each read - and you have a boatload of reads going on. You could amortize this cost with large block reads if the data was layed out differently in the file. The extra software overhead would diminish to noise if the data was written as one record and you re-read that as one record, for example.

Also, you have that whole mod() and print going on in the loop. Why don't you yank that test/print code out of there, put a call to CPU_TIME() before and after the loop to get a clean timing on just the read time without also including all the if tests, the calls to MOD() and potentially the writes.

ron

As mentioned in reply to a previous suggestion I have re-created a "native" ifort copy of the original MPS data file, so my program is now running on a "native" file-format file. And in reply to a previous question, yes, my timing test are all run on the same PC (XP-32bit).

The reason for the mod() and print() calls inside the loop are so that I can quickly see if any modifs I make to the source or compilation options have any effect on the execution time - with the Intel compiler I get a print every 2 seconds, with MPS it is 5 or 6 times that rate (two-to-three per second). I can of course remove the test from the loop and stare forlornly at my blank screen whilst my hard disk churns away, but the end result is the same - ifort-V11 is many times slower than MPS-V4.

And yes, if the data had been written differently it could no doubt be read back in a more efficient manner. However, the data is there in its current format, and this data has to be re-read and processed.

Ron_Green · ‎10-08-2009

Quoting - john_sturton

As mentioned in reply to a previous suggestion I have re-created a "native" ifort copy of the original MPS data file, so my program is now running on a "native" file-format file. And in reply to a previous question, yes, my timing test are all run on the same PC (XP-32bit).

The reason for the mod() and print() calls inside the loop are so that I can quickly see if any modifs I make to the source or compilation options have any effect on the execution time - with the Intel compiler I get a print every 2 seconds, with MPS it is 5 or 6 times that rate (two-to-three per second). I can of course remove the test from the loop and stare forlornly at my blank screen whilst my hard disk churns away, but the end result is the same - ifort-V11 is many times slower than MPS-V4.

and for the native ifort written data - you removed the -fpscomp and -assume:buffered_io options? Does the buffered IO help in this case?

It is possible that MPS is faster for the case presented here. Every Fortran Runtime library has it's own sweet spots for record sizes vs performance. I don't know that it's a bug, and we don't have MPS in house to test this. If possible, I would recommend blocking the data into larger records. That is sure to help the speed of both MPS and ifort.

John_Sturton · ‎10-09-2009

Quoting - Ronald W. Green (Intel)

and for the native ifort written data - you removed the -fpscomp and -assume:buffered_io options? Does the buffered IO help in this case?

It is possible that MPS is faster for the case presented here. Every Fortran Runtime library has it's own sweet spots for record sizes vs performance. I don't know that it's a bug, and we don't have MPS in house to test this. If possible, I would recommend blocking the data into larger records. That is sure to help the speed of both MPS and ifort.

Yes, I removed the -fpscomp and -assume:buffered_io options. As far as I know the buffered IO option only effects writing to files, so this effectively had no effect on the reading performance.

How can one explain the fact that hitting Ctrl-Break causes the reading to accelerate? Is the executable divided into two threads, one of which is monitoring the Keyboard (and consuming processor ticks) and one of which is doing the file I/O? Hitting Ctrl-Break then "interrupts" the Keyboard thread and leaves the FileI/O thread to carry on alone?

Indeed, it does seem that the slower file I/O performance is somehow linked to the Multi-threaded nature of the Intel-produced executable (Note: MPS is single-thread). If I re-compile the program with Intel and specify -ML (static single thread) linker switch, (linking against the copy of LIBC.LIB supplied with MS-VC6) then the resultant executable does run much quicker, with I/O performance comparable to that of the MPS-produced exexcutable.

Ron_Green · ‎10-09-2009

John,

I think you indeed did find the cause of the slowness. Especially if your system is a single core system. The context switching between the event monitoring (keyboard, mouse) thread(s) and the computational thread could indeed be at the root of this. And yes, multithreaded libraries are typically a little slower since there are locks and protections to prevent data buffer corruptions, race conditions, etc with the idea that multiple threads in your application could potentially be using the same IO library calls simultaneously. There is overhead for this extra protection, and with small reads or writes this additional overhead shows itself pretty dramatically. Again, for larger blocked reads and writes those extra milliseconds for extra code paths disappear into the noise. But for small IO and hundreds of thousands of calls, any extra code in each call really adds up.

Intel currently is shipping 80% of it's processors as dual-core or better, and I believe it's next year where that figure will be 100%. Some customers wonder how to use those cores, but I think you've just demonstrated how. Although many codes are not parallelized, they still can benefit from another core or 3 by running those background threads on real cores and not time-sharing a single core. Or you could run 2 copies of your application simultaneously. But again, for traditional serial code there's nothing faster than good old single-threaded, static libraries.

If you have a newer version of the 11.x compiler, you may want to explore the asynchronous IO feature of Fortran 2003. This allows read/writes to be done asynchronously in background while your foreground computation continues. We used this all the time in the 'good old days' on the Vax (not the F2003 version, but a DEC extension).

As for multithreaded libraries: Microsoft is advocating using shared, multithreaded libraries going forward. This of course because of the whole multicore processor ecosphere.

I'm glad you got to the root of this.

ron

John_Sturton · ‎10-12-2009

Quoting - Ronald W. Green (Intel)

John,

I think you indeed did find the cause of the slowness. Especially if your system is a single core system. The context switching between the event monitoring (keyboard, mouse) thread(s) and the computational thread could indeed be at the root of this. And yes, multithreaded libraries are typically a little slower since there are locks and protections to prevent data buffer corruptions, race conditions, etc with the idea that multiple threads in your application could potentially be using the same IO library calls simultaneously. There is overhead for this extra protection, and with small reads or writes this additional overhead shows itself pretty dramatically. Again, for larger blocked reads and writes those extra milliseconds for extra code paths disappear into the noise. But for small IO and hundreds of thousands of calls, any extra code in each call really adds up.

Intel currently is shipping 80% of it's processors as dual-core or better, and I believe it's next year where that figure will be 100%. Some customers wonder how to use those cores, but I think you've just demonstrated how. Although many codes are not parallelized, they still can benefit from another core or 3 by running those background threads on real cores and not time-sharing a single core. Or you could run 2 copies of your application simultaneously. But again, for traditional serial code there's nothing faster than good old single-threaded, static libraries.

If you have a newer version of the 11.x compiler, you may want to explore the asynchronous IO feature of Fortran 2003. This allows read/writes to be done asynchronously in background while your foreground computation continues. We used this all the time in the 'good old days' on the Vax (not the F2003 version, but a DEC extension).

As for multithreaded libraries: Microsoft is advocating using shared, multithreaded libraries going forward. This of course because of the whole multicore processor ecosphere.

I'm glad you got to the root of this.

ron

I'm afraid I can't agree with you that the cause of the problem is primarily the multi-threaded nature of the executable, and this for two main reasons:

1) My machine is a dual-core machine

2) Performing the same file I/O operation on the same file in a multi-thread C++ program (compiled with MS-C++ V9) is 5 to 6 times quicker.

I really don't want this thread to "go cold" - it seems that there *is* a performance issue with the Intel Fortan compiler's implementation of sequential unformatted reads and this is likely to be a major head-ache for us.