Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28435 Discussions

the most efficient way in Fortran to read large files backward from the end to the beginning

DataScientist
Valued Contributor I
3,713 Views

I wonder if anyone has any suggestions on how to efficiently read a relatively large (formatted/unformatted) file, (of 1 Gb size or more) that may not even fit in RAM, backward from the end, within the Fortran language standard? Thanks in advance!

0 Kudos
1 Solution
Steve_Lionel
Honored Contributor III
3,704 Views

Reading is one thing, but what are you doing with what you read? Do you need the whole thing read in, or are you operating on a piece at a time? Was this file written by Fortran? Are all the records the same length? 

Without knowing anything about what you're doing, I would probably go for STREAM access and use the POS= specifier on a READ to position within the file. This assumes you know how to find the previous record. You can get the size of the file in bytes with the SIZE= specifier in INQUIRE.

View solution in original post

16 Replies
Steve_Lionel
Honored Contributor III
3,705 Views

Reading is one thing, but what are you doing with what you read? Do you need the whole thing read in, or are you operating on a piece at a time? Was this file written by Fortran? Are all the records the same length? 

Without knowing anything about what you're doing, I would probably go for STREAM access and use the POS= specifier on a READ to position within the file. This assumes you know how to find the previous record. You can get the size of the file in bytes with the SIZE= specifier in INQUIRE.

DataScientist
Valued Contributor I
3,691 Views

Very helpful hint, Steve, Thank you. The problem is not overly complex: A large heterogeneous table of data consisting of the order of 10^6 million lines is written to an output file. However, at the end of the work, the file must be read from the end, line by line backward, for some further post-processing. For small data tables, a set of arrays can hold the data in the memory for post-processing at the end without the need to reread the data from the file. However, on systems with limited RAM this will be likely problematic.

What I need is something similar to sequential IO, but with the ability to be to jump to a particular line in the file and read that line, which may contain a collection of formatted real/integer/character values. I can think of BACKSPACE serving that purpose, but that does not seem to be an optimal efficient choice.

0 Kudos
mecej4
Honored Contributor III
3,670 Views

A__King wrote: A large heterogeneous table of data consisting of the order of 10^6 million lines is written to an output file.

Indeed? At an average 50 characters per line, that is a 50 Terabyte file, and there is no HDD of that size that is available to the average PC user. 

What part of the design decisions are set in stone? Why not write the file in the "reverse" order in the first place, or do a one-time reverse-copy to a new file, which is then read in forward order as often as needed?

GVautier
New Contributor II
3,660 Views

Opening the file in append mode sets the file pointer after the last line.

With one "backspace," you can read the last line. Then with 2 "backspace", you can read the previous line and so on.

As IO is buffered I think that "backspace" implies limited disk access ie only when a line overlaps buffer boundary. Try to find the optimum buffer size.

 

andrew_4619
Honored Contributor II
3,648 Views
The implementation of backspace is I think very dependant on the compiler. On some compilers in the past I have found it really slow. I would be interested on any comments on that as I have not used it for a long time.
Steve_Lionel
Honored Contributor III
3,631 Views

When I worked on the VAX FORTRAN RTL, I implemented an optimized form of BACKSPACE that reduced the number of seeks, but it would not cope well with what you are doing. I am not aware that the Intel implementation does anything other than rewind and forward space, which would be horrible for your case.

If you open for APPEND access, you may need two BACKSPACEs when starting out, one to back over the "ENDFILE record" (which is virtual).

If the file is written with formatted I/O, and the record lengths vary, there isn't an efficient way to do it with straightforward standard statements. 

@andrew_4619 's suggestion of keeping track of locations is a good one, and is similar to the optimization I implemented on VAX, but it requires reading forward through the file, periodically (every 1000?) recording the current position (INQUIRE will tell you) so that you can position to that, then reading every record in that block, recording its position, and only then can you start to work backwards.

I share @mecej4 's skepticism about your file size upper bound.

DataScientist
Valued Contributor I
3,622 Views

Thanks, Steve and everyone, The file size I mentioned in the original post is inaccurate (has duplicate millions). It is on the order million line, not trillions.

0 Kudos
mecej4
Honored Contributor III
3,598 Views

A__King wrote:

The file size I mentioned in the original post is inaccurate (has duplicate millions). It is on the order million line, not trillions.

Now that we have reduced the size of the file by six orders of magnitude with one sentence, my recommendation would be: "Read the entire file into memory (array of lines) in sequential order, and then process the in-memory data in any order that you want."

0 Kudos
DataScientist
Valued Contributor I
3,559 Views

@mecej4 For a 1-10 million that works very well. Indeed, that's what I have been doing for testing. But for any demanding real problem where the number of lines goes beyond tens or hundreds of millions (which can happen frequently), the memory (re)allocation can fail (and has failed in my tests). (The program basically estimates the new allocation required when it runs out of memory and reallocates the arrays).

I had the idea of @andrew_4619 with sequential IO, and I have noticed in tests that block IO is much faster ( @FortranFan ), which is likely, not surprising to many here. The second major insight came from @Steve_Lionel and @jimdempseyatthecove to use stream access instead of sequential (which seems to be a great efficient solution for reading backward) and keep a record of the line starts. That is something that I have not used in the past, and I am not quite sure if I understand fully how it works, in particular, when IO is formatted. For example, if I write a stream with

 

"(*(g0.13,:,','))"

 

 

(or perhaps with some more complex delimiter) then how should it be read back into memory from a file via stream access. For now, I am going to avoid formatted Stream IO, but if anyone has any experience with formatted stream access, I would appreciate sharing it here. ( @FortranFan ) and if I succeed in implementing block stream IO, I will share the results here.

 

 

0 Kudos
JohnNichols
Valued Contributor III
3,539 Views

There is a computer program called VEDIT.  I have used it since 1988, actually you can run IFORT from within the program as a macro, which is what I did in the late 1980's.  

But, it can read any size file I have ever thrown at it and really quickly,  we use it for super large DXF files for creating AUTOCAD drawings and finding errors in the files, Autocad just stops without telling you the line or the error.   It beats NOTEPAD++ etc hands down 

He is a really nice guy, I would ask him how he did it in 1988 on 8086 computers.  I talked to him last week. 

I still use the program as sometimes you have text files you cannot open.  

It really is the best text editor money can buy. 

Steve_Lionel
Honored Contributor III
3,543 Views

Formatted stream I/O isn't much different from formatted record I/O on Windows/Linux/Mac. It writes record markers on output and recognizes them on input. On platforms where formatted records may use out-of-band record markers (VMS, for example), they could be different.

DataScientist
Valued Contributor I
3,682 Views

Since each row could contain a list of heterogeneous (delimited) values, the easiest way that comes to my mind at the moment is to keep a record of the beginnings of all lines while the lines are being written to the file. Then once the work is done, read each line as a string, split the string by the delimiter, and read the value of each part into the corresponding variable. I am curious to know what you think of such stream reading style.

On a side note, I understand that stream access is likely faster than sequential for this particular problem discussed in this question, however, should one generally expect any fundamental performance difference between a stream and sequential IO?

 

0 Kudos
andrew_4619
Honored Contributor II
3,679 Views

Data format allowing I would consider a "block" method and STREAM/POS. I would define a large  size that works with memory limits of , say 100MB and then read a 100MB block of the file. Then work that in memory and when that is exhausted ditch it and get the next/previous chunk .  Less file operations is always going to win in a speed race.  

jimdempseyatthecove
Honored Contributor III
3,604 Views

Minor modification to andrew_4619

for your data file, "yourDataFile.dat" have an index file "yourDataFile.idx"

When you open "yourDataFile.dat" determine if there is a "yourDataFile.idx".
If not, create one.
If found, perform an INQUIRE to get the last modify dates of both files, if the index file is older than the datafile, recreate it.

The contents of the index file are the POS of the start of every n'th line (you determine n). If you use fixed field width (or binary), you could read the contents of the index file by computing the index directly. Once you have the index of the desired n'th interval, you read the lesser of n records or to end of file into an array of n records (noting record index of EOF in the case of EOF). Once in memory you can then read the array backwards. Repeat for prior n records of file.

Jim Dempsey

FortranFan
Honored Contributor II
3,590 Views

@DataScientist ,

Now that you have received some suggestions, it will be helpful if you can try them out in a small example and share your results and your summary online (here, or better yet on GitHub) of what you found to be "the most efficient way."

0 Kudos
Reply