Solved: Re: the most efficient way in Fortran to read large files backward from the end to the beginning

DataScientist · ‎07-11-2020

I wonder if anyone has any suggestions on how to efficiently read a relatively large (formatted/unformatted) file, (of 1 Gb size or more) that may not even fit in RAM, backward from the end, within the Fortran language standard? Thanks in advance!

Steve_Lionel · ‎07-11-2020

Reading is one thing, but what are you doing with what you read? Do you need the whole thing read in, or are you operating on a piece at a time? Was this file written by Fortran? Are all the records the same length?

Without knowing anything about what you're doing, I would probably go for STREAM access and use the POS= specifier on a READ to position within the file. This assumes you know how to find the previous record. You can get the size of the file in bytes with the SIZE= specifier in INQUIRE.

View solution in original post

Steve_Lionel · ‎07-11-2020

Reading is one thing, but what are you doing with what you read? Do you need the whole thing read in, or are you operating on a piece at a time? Was this file written by Fortran? Are all the records the same length?

Without knowing anything about what you're doing, I would probably go for STREAM access and use the POS= specifier on a READ to position within the file. This assumes you know how to find the previous record. You can get the size of the file in bytes with the SIZE= specifier in INQUIRE.

DataScientist · ‎07-11-2020

Very helpful hint, Steve, Thank you. The problem is not overly complex: A large heterogeneous table of data consisting of the order of 10^6 million lines is written to an output file. However, at the end of the work, the file must be read from the end, line by line backward, for some further post-processing. For small data tables, a set of arrays can hold the data in the memory for post-processing at the end without the need to reread the data from the file. However, on systems with limited RAM this will be likely problematic.

What I need is something similar to sequential IO, but with the ability to be to jump to a particular line in the file and read that line, which may contain a collection of formatted real/integer/character values. I can think of BACKSPACE serving that purpose, but that does not seem to be an optimal efficient choice.

mecej4 · ‎07-12-2020

A__King wrote: A large heterogeneous table of data consisting of the order of 10^6 million lines is written to an output file.

Indeed? At an average 50 characters per line, that is a 50 Terabyte file, and there is no HDD of that size that is available to the average PC user.

What part of the design decisions are set in stone? Why not write the file in the "reverse" order in the first place, or do a one-time reverse-copy to a new file, which is then read in forward order as often as needed?

GVautier · ‎07-12-2020

Opening the file in append mode sets the file pointer after the last line.

With one "backspace," you can read the last line. Then with 2 "backspace", you can read the previous line and so on.

As IO is buffered I think that "backspace" implies limited disk access ie only when a line overlaps buffer boundary. Try to find the optimum buffer size.

andrew_4619 · ‎07-12-2020

The implementation of backspace is I think very dependant on the compiler. On some compilers in the past I have found it really slow. I would be interested on any comments on that as I have not used it for a long time.

Steve_Lionel · ‎07-12-2020

When I worked on the VAX FORTRAN RTL, I implemented an optimized form of BACKSPACE that reduced the number of seeks, but it would not cope well with what you are doing. I am not aware that the Intel implementation does anything other than rewind and forward space, which would be horrible for your case.

If you open for APPEND access, you may need two BACKSPACEs when starting out, one to back over the "ENDFILE record" (which is virtual).

If the file is written with formatted I/O, and the record lengths vary, there isn't an efficient way to do it with straightforward standard statements.

@andrew_4619 's suggestion of keeping track of locations is a good one, and is similar to the optimization I implemented on VAX, but it requires reading forward through the file, periodically (every 1000?) recording the current position (INQUIRE will tell you) so that you can position to that, then reading every record in that block, recording its position, and only then can you start to work backwards.

I share @mecej4 's skepticism about your file size upper bound.

DataScientist · ‎07-12-2020

Thanks, Steve and everyone, The file size I mentioned in the original post is inaccurate (has duplicate millions). It is on the order million line, not trillions.

mecej4 · ‎07-12-2020

A__King wrote:

The file size I mentioned in the original post is inaccurate (has duplicate millions). It is on the order million line, not trillions.

Now that we have reduced the size of the file by six orders of magnitude with one sentence, my recommendation would be: "Read the entire file into memory (array of lines) in sequential order, and then process the in-memory data in any order that you want."

DataScientist · ‎07-13-2020

@mecej4 For a 1-10 million that works very well. Indeed, that's what I have been doing for testing. But for any demanding real problem where the number of lines goes beyond tens or hundreds of millions (which can happen frequently), the memory (re)allocation can fail (and has failed in my tests). (The program basically estimates the new allocation required when it runs out of memory and reallocates the arrays).

I had the idea of @andrew_4619 with sequential IO, and I have noticed in tests that block IO is much faster ( @FortranFan ), which is likely, not surprising to many here. The second major insight came from @Steve_Lionel and @jimdempseyatthecove to use stream access instead of sequential (which seems to be a great efficient solution for reading backward) and keep a record of the line starts. That is something that I have not used in the past, and I am not quite sure if I understand fully how it works, in particular, when IO is formatted. For example, if I write a stream with

"(*(g0.13,:,','))"

(or perhaps with some more complex delimiter) then how should it be read back into memory from a file via stream access. For now, I am going to avoid formatted Stream IO, but if anyone has any experience with formatted stream access, I would appreciate sharing it here. ( @FortranFan ) and if I succeed in implementing block stream IO, I will share the results here.

JohnNichols · ‎07-13-2020

There is a computer program called VEDIT. I have used it since 1988, actually you can run IFORT from within the program as a macro, which is what I did in the late 1980's.

But, it can read any size file I have ever thrown at it and really quickly, we use it for super large DXF files for creating AUTOCAD drawings and finding errors in the files, Autocad just stops without telling you the line or the error. It beats NOTEPAD++ etc hands down

He is a really nice guy, I would ask him how he did it in 1988 on 8086 computers. I talked to him last week.

I still use the program as sometimes you have text files you cannot open.

It really is the best text editor money can buy.

Steve_Lionel · ‎07-13-2020

Formatted stream I/O isn't much different from formatted record I/O on Windows/Linux/Mac. It writes record markers on output and recognizes them on input. On platforms where formatted records may use out-of-band record markers (VMS, for example), they could be different.

FortranFan · ‎07-14-2020

Here's one link you might helpful: https://community.intel.com/t5/Intel-Fortran-Compiler/Text-file-to-allocatable-string/td-p/1033082

DataScientist · ‎07-11-2020

Since each row could contain a list of heterogeneous (delimited) values, the easiest way that comes to my mind at the moment is to keep a record of the beginnings of all lines while the lines are being written to the file. Then once the work is done, read each line as a string, split the string by the delimiter, and read the value of each part into the corresponding variable. I am curious to know what you think of such stream reading style.

On a side note, I understand that stream access is likely faster than sequential for this particular problem discussed in this question, however, should one generally expect any fundamental performance difference between a stream and sequential IO?

andrew_4619 · ‎07-11-2020

Data format allowing I would consider a "block" method and STREAM/POS. I would define a large size that works with memory limits of , say 100MB and then read a 100MB block of the file. Then work that in memory and when that is exhausted ditch it and get the next/previous chunk . Less file operations is always going to win in a speed race.

jimdempseyatthecove · ‎07-12-2020

Minor modification to andrew_4619

for your data file, "yourDataFile.dat" have an index file "yourDataFile.idx"

When you open "yourDataFile.dat" determine if there is a "yourDataFile.idx".
If not, create one.
If found, perform an INQUIRE to get the last modify dates of both files, if the index file is older than the datafile, recreate it.

The contents of the index file are the POS of the start of every n'th line (you determine n). If you use fixed field width (or binary), you could read the contents of the index file by computing the index directly. Once you have the index of the desired n'th interval, you read the lesser of n records or to end of file into an array of n records (noting record index of EOF in the case of EOF). Once in memory you can then read the array backwards. Repeat for prior n records of file.

Jim Dempsey

FortranFan · ‎07-12-2020

@DataScientist ,

Now that you have received some suggestions, it will be helpful if you can try them out in a small example and share your results and your summary online (here, or better yet on GitHub) of what you found to be "the most efficient way."