Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Optimizing I/O Performance

emc-nyc
Beginner
756 Views
I've got an app which must write a very wide, formatted file. (It must be formatted for a downstream app). A typical output file will have about 20,000 records, each with about 35,000 6 character wide columns. (yes, that is about 4 gig.) It takes about 20,000 seconds wall-time and about 760 seconds cpu time; clearly this task is i/o bound and a faster processor won't help!)

Other than buying some solid-state discs, is there a prospect for a significant (at least 30%) reduction in wall-clock time by mucking around with buffer count and buffer sizes?

0 Kudos
8 Replies
Steven_L_Intel1
Employee
756 Views
I think you need to buffer this yourself and use binary I/O (FORM='BINARY') or even better, bypass the Fortran I/O system and use CreateFile/WriteFile directly. Write larger chunks - insert the CR-LF delimiters if needed. Try writing 4KB at a time. You can use internal WRITE if you really need to use formatted I/O, but if it's just numeric to text, there are faster ways of doing this.

Steve
0 Kudos
TimP
Honored Contributor III
756 Views
A RAID disk array should increase performance, if you truly don't want to alter the file format. It will increase your need for backups as well.
0 Kudos
Steven_L_Intel1
Employee
756 Views
I think the key here is to reduce the number of write requests. This can be done by buffering many small records into one larger record, as I suggested above. If you insert line terminators yourself, the file will still look the same to your applications once written.

Writing zillions of 6-byte records will mean spending a lot of overhead in the code for writing records.

Steve
0 Kudos
emc-nyc
Beginner
756 Views
Steve:
I'm writing the entire record with one statement, by writing a single row from a 2-d array. My only performance issue is the time to write the output file; for space reasons the program makes multiple passes through the input file. On the first, it determines the number of data elements per record and writes the few of concern to an (unformatted) scratch file, to avoid the cost of re-parsing each record; on the second and following passes it populates the output array. Performance only becomes really bad when the data require the really large output record lengths: the test data I'm using results in output records about 200 000 bytes wide. It takes about 330 milliseconds to write each output record after its been populated; which is about 600 000 bytes/second. Am I being unrealistic in hoping to reduce this to (say) 100 milliseconds per record?
0 Kudos
Steven_L_Intel1
Employee
756 Views
I'm not sure I understand exactly what you're doing, as initially you said the records were six bytes.

Let me suggest as an experiment trying it with the Win32 APIs CreateFile and WriteFile - they're pretty simple to use. See if that helps at all.

It may be that the run-time library is making a copy of the data, which could slow things down.

Steve
0 Kudos
emc-nyc
Beginner
756 Views
Steve,

My expository writing skills departed sometime before I started wearing bifocals. I believe I said there were about 30000 6 byte fields per record, and about 20000 records.


If the RTL is making a copy of the data is there any way to purge the copy or manipulate it directly? (I don't fear assembler. It's just the last time I used assembly language I was writing an expression parser on a Univac 1108)

Using CreateFile/WriteFile does seem to be a bit quicker than the RTL.

I do know that running this program makes the disk drive very busy, almost a bad as a virus scanner.
0 Kudos
Steven_L_Intel1
Employee
756 Views
Ideally, the RTL, if you did a write of a single array, would not copy the data before issuing the WriteFile call, it would write directly from your array. However, I don't think CVF does that right now, though it might under some circumstances. This is an area we will look at for future improvements.

If WriteFile works for you, then I suggest you use it.

Steve
0 Kudos
kdkeefer
Beginner
756 Views
Hi,
Sorry for butting in, but I find this problem intriguing. I used to do a lot of programming for real time applications. The behavior you describe (i.e huge discrepancy between wall clock time and reported CPU time)used to occur frequently on systems in which one wrote to the disk directly (i.e. no such thing as files, one wrote to the disk block by block). The "operating system" (this was a Xerox/SDS Sigma 5) was foreground/background and when a program relinquished control for a DMA transfer, the CPU time was not counted. If one's timing were unfortunate, the hardware could end up waiting a full disk revolution to write each block(sector). Transfers that should have taken a few seconds (and not recorded as CPU time) could take several minutes. I know little about modern hardware, but is there a chance that the rate at which the main processor supplies data to the receiving hardware is mismatched? (I know that modern disks have hardware buffers and all that jazz and I also have no idea what _exactly_ is reported as user CPU time.) But the symptoms are so similar to this ancient problem and the size and nature of the transfers you describe are so unusual that I can't keep from speculating.
Regards,
Keith
0 Kudos
Reply