- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a large (600MB) unformatted sequential file created by Microsoft Power Station V4.00. Reading this file back with Intel Fortran is around four times slower than reading the same file with Microsoft Power Station V4.00.
The source code:
open(10, open(10,file=c_FileName,form='unformatted',status='old')
read(10,end=101,err=102)r_Radius,i_Direction
i_Count = 0
do 10
read(10,end=101,err=102_Error)r_X,r_Y,r_Z
i_Count = i_Count+1
if (mod(i_Count,1000000) .eq. 0) then
print *,i_Count,r_X,r_Y,r_Z
endif
10 continue
The compile line:
ifort -assume:byterecl -integer-size:32 -iface:cvf -warn:argument_checking -warn:truncated_source -fpscomp:ioformat -fp:source -MD -assume:buffered_io read_ud.f
So my program prints to the console every 100000 reads. What is "odd" is that if I interrupt the program with Ctrl-Break, I get a Windows "Application Error" dialog displayed, but the program continues to run in the background, and the program print statements are now generated around 4 times more rapidly.
So I guess my questions are:
1) Is this slow unformatted read performance a known issue?
2) Are there ways to improve it
3) How come the program suddenly accelerates when I hit Ctrl-Break
Cheers,
John Sturton
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your suggesions. However,
1) Redirecting the prints does not speed up the execution
2) Running the program on a updated copy of the original data file (created by reading the original and writing a new file) does not speed up the execution.
BTW, the compiler version is:
Intel Visual Fortran Compiler Professional for applications running on IA-32, Version 11.1 Build 20090511 Package ID: w_cprof_p_11.1.035
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First, you are reading a non-native file format in ifort. 'unformatted' data files have the data stored in binary format for sure, however there are record markers and file marks that need to delineate the data. There is no standard for these marks, so each compiler has it's own 'native' record and file marks.
The data file you are reading was created by another compiler and thus does not use the same storage format as that used by ifort. You are using -fpscomp so that ifort can read this non-native file. So you've added this compatibility layer of software that is slowing down the reads (and writes). And you're reading in small chunks, which means you're going through this extra layer of software translation for each read - and you have a boatload of reads going on. You could amortize this cost with large block reads if the data was layed out differently in the file. The extra software overhead would diminish to noise if the data was written as one record and you re-read that as one record, for example.
Also, you have that whole mod() and print going on in the loop. Why don't you yank that test/print code out of there, put a call to CPU_TIME() before and after the loop to get a clean timing on just the read time without also including all the if tests, the calls to MOD() and potentially the writes.
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First, you are reading a non-native file format in ifort. 'unformatted' data files have the data stored in binary format for sure, however there are record markers and file marks that need to delineate the data. There is no standard for these marks, so each compiler has it's own 'native' record and file marks.
The data file you are reading was created by another compiler and thus does not use the same storage format as that used by ifort. You are using -fpscomp so that ifort can read this non-native file. So you've added this compatibility layer of software that is slowing down the reads (and writes). And you're reading in small chunks, which means you're going through this extra layer of software translation for each read - and you have a boatload of reads going on. You could amortize this cost with large block reads if the data was layed out differently in the file. The extra software overhead would diminish to noise if the data was written as one record and you re-read that as one record, for example.
Also, you have that whole mod() and print going on in the loop. Why don't you yank that test/print code out of there, put a call to CPU_TIME() before and after the loop to get a clean timing on just the read time without also including all the if tests, the calls to MOD() and potentially the writes.
ron
As mentioned in reply to a previous suggestion I have re-created a "native" ifort copy of the original MPS data file, so my program is now running on a "native" file-format file. And in reply to a previous question, yes, my timing test are all run on the same PC (XP-32bit).
The reason for the mod() and print() calls inside the loop are so that I can quickly see if any modifs I make to the source or compilation options have any effect on the execution time - with the Intel compiler I get a print every 2 seconds, with MPS it is 5 or 6 times that rate (two-to-three per second). I can of course remove the test from the loop and stare forlornly at my blank screen whilst my hard disk churns away, but the end result is the same - ifort-V11 is many times slower than MPS-V4.
And yes, if the data had been written differently it could no doubt be read back in a more efficient manner. However, the data is there in its current format, and this data has to be re-read and processed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As mentioned in reply to a previous suggestion I have re-created a "native" ifort copy of the original MPS data file, so my program is now running on a "native" file-format file. And in reply to a previous question, yes, my timing test are all run on the same PC (XP-32bit).
The reason for the mod() and print() calls inside the loop are so that I can quickly see if any modifs I make to the source or compilation options have any effect on the execution time - with the Intel compiler I get a print every 2 seconds, with MPS it is 5 or 6 times that rate (two-to-three per second). I can of course remove the test from the loop and stare forlornly at my blank screen whilst my hard disk churns away, but the end result is the same - ifort-V11 is many times slower than MPS-V4.
and for the native ifort written data - you removed the -fpscomp and -assume:buffered_io options? Does the buffered IO help in this case?
It is possible that MPS is faster for the case presented here. Every Fortran Runtime library has it's own sweet spots for record sizes vs performance. I don't know that it's a bug, and we don't have MPS in house to test this. If possible, I would recommend blocking the data into larger records. That is sure to help the speed of both MPS and ifort.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
and for the native ifort written data - you removed the -fpscomp and -assume:buffered_io options? Does the buffered IO help in this case?
It is possible that MPS is faster for the case presented here. Every Fortran Runtime library has it's own sweet spots for record sizes vs performance. I don't know that it's a bug, and we don't have MPS in house to test this. If possible, I would recommend blocking the data into larger records. That is sure to help the speed of both MPS and ifort.
Yes, I removed the -fpscomp and -assume:buffered_io options. As far as I know the buffered IO option only effects writing to files, so this effectively had no effect on the reading performance.
How can one explain the fact that hitting Ctrl-Break causes the reading to accelerate? Is the executable divided into two threads, one of which is monitoring the Keyboard (and consuming processor ticks) and one of which is doing the file I/O? Hitting Ctrl-Break then "interrupts" the Keyboard thread and leaves the FileI/O thread to carry on alone?
Indeed, it does seem that the slower file I/O performance is somehow linked to the Multi-threaded nature of the Intel-produced executable (Note: MPS is single-thread). If I re-compile the program with Intel and specify -ML (static single thread) linker switch, (linking against the copy of LIBC.LIB supplied with MS-VC6) then the resultant executable does run much quicker, with I/O performance comparable to that of the MPS-produced exexcutable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you indeed did find the cause of the slowness. Especially if your system is a single core system. The context switching between the event monitoring (keyboard, mouse) thread(s) and the computational thread could indeed be at the root of this. And yes, multithreaded libraries are typically a little slower since there are locks and protections to prevent data buffer corruptions, race conditions, etc with the idea that multiple threads in your application could potentially be using the same IO library calls simultaneously. There is overhead for this extra protection, and with small reads or writes this additional overhead shows itself pretty dramatically. Again, for larger blocked reads and writes those extra milliseconds for extra code paths disappear into the noise. But for small IO and hundreds of thousands of calls, any extra code in each call really adds up.
Intel currently is shipping 80% of it's processors as dual-core or better, and I believe it's next year where that figure will be 100%. Some customers wonder how to use those cores, but I think you've just demonstrated how. Although many codes are not parallelized, they still can benefit from another core or 3 by running those background threads on real cores and not time-sharing a single core. Or you could run 2 copies of your application simultaneously. But again, for traditional serial code there's nothing faster than good old single-threaded, static libraries.
If you have a newer version of the 11.x compiler, you may want to explore the asynchronous IO feature of Fortran 2003. This allows read/writes to be done asynchronously in background while your foreground computation continues. We used this all the time in the 'good old days' on the Vax (not the F2003 version, but a DEC extension).
As for multithreaded libraries: Microsoft is advocating using shared, multithreaded libraries going forward. This of course because of the whole multicore processor ecosphere.
I'm glad you got to the root of this.
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you indeed did find the cause of the slowness. Especially if your system is a single core system. The context switching between the event monitoring (keyboard, mouse) thread(s) and the computational thread could indeed be at the root of this. And yes, multithreaded libraries are typically a little slower since there are locks and protections to prevent data buffer corruptions, race conditions, etc with the idea that multiple threads in your application could potentially be using the same IO library calls simultaneously. There is overhead for this extra protection, and with small reads or writes this additional overhead shows itself pretty dramatically. Again, for larger blocked reads and writes those extra milliseconds for extra code paths disappear into the noise. But for small IO and hundreds of thousands of calls, any extra code in each call really adds up.
Intel currently is shipping 80% of it's processors as dual-core or better, and I believe it's next year where that figure will be 100%. Some customers wonder how to use those cores, but I think you've just demonstrated how. Although many codes are not parallelized, they still can benefit from another core or 3 by running those background threads on real cores and not time-sharing a single core. Or you could run 2 copies of your application simultaneously. But again, for traditional serial code there's nothing faster than good old single-threaded, static libraries.
If you have a newer version of the 11.x compiler, you may want to explore the asynchronous IO feature of Fortran 2003. This allows read/writes to be done asynchronously in background while your foreground computation continues. We used this all the time in the 'good old days' on the Vax (not the F2003 version, but a DEC extension).
As for multithreaded libraries: Microsoft is advocating using shared, multithreaded libraries going forward. This of course because of the whole multicore processor ecosphere.
I'm glad you got to the root of this.
ron
I'm afraid I can't agree with you that the cause of the problem is primarily the multi-threaded nature of the executable, and this for two main reasons:
1) My machine is a dual-core machine
2) Performing the same file I/O operation on the same file in a multi-thread C++ program (compiled with MS-C++ V9) is 5 to 6 times quicker.
I really don't want this thread to "go cold" - it seems that there *is* a performance issue with the Intel Fortan compiler's implementation of sequential unformatted reads and this is likely to be a major head-ache for us.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page