- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fortran gurus:
I'm looking for the fastest, safest, and most portable way to read the entire contents of a text file into a Fortran allocatable string. Here's what I've come up with:
subroutine read_file(filename, str) implicit none character(len=*),intent(in) :: filename character(len=:),allocatable,intent(out) :: str !parameters: integer,parameter :: n_chunk = 256 !chunk size for reading file [arbitrary] character(len=*),parameter :: nfmt = '(A256)' !corresponding format statement character(len=1),parameter :: newline = new_line(' ') integer :: iunit, istat, isize character(len=n_chunk) :: chunk integer :: filesize,ipos character(len=:),allocatable :: tmp !how many characters are in the file: inquire(file=filename, size=filesize) !is this portable? !initialize: ipos = 1 !where to put the next chunk !preallocate the str array to speed up the process for large files: !str = '' allocate( character(len=filesize) :: str ) !open the file: open(newunit=iunit, file=trim(filename), status='OLD', iostat=istat) if (istat==0) then !read all the characters from the file: do read(iunit,fmt=nfmt,advance='NO',size=isize,iostat=istat) chunk if (istat==0) then !str = str//chunk str(ipos:ipos+isize-1) = chunk ipos = ipos+isize elseif (IS_IOSTAT_EOR(istat)) then if (isize>0) then !str = str//chunk(1:isize)//newline str(ipos:ipos+isize) = chunk(1:isize)//newline ipos = ipos+isize+1 else !str = str//newline str(ipos:ipos) = newline ipos = ipos + 1 end if elseif (IS_IOSTAT_END(istat)) then if (isize>0) then !str = str//chunk(1:isize) str(ipos:ipos+isize) = chunk(1:isize)//newline ipos = ipos+isize+1 end if exit else stop 'Error' end if end do !resize the string if (ipos<filesize+1) str = str(1:ipos-1) close(iunit, iostat=istat) else write(*,*) 'Error opening file: '//trim(filename) end if end subroutine read_file
Some notes/questions about this:
- This routine will read the 100 MB file at https://github.com/seductiveapps/largeJSON in about 1 sec on my PC.
- Is it really portable to use the SIZE argument of INQUIRE to get the number of characters? I notice that the string I end up with is somewhat smaller than this value, but that could be due to #3. What is the portable way to get the file size in number of characters (I'd like it to also work on other non-ifort compilers, as well as on other platforms).
- I don't think this way preserves the Windows line breaks (if present), since it essentially reads it line by line and then inserts the newline character. The string I end up with is smaller (which is why I'm trimming it at the end). Is there a way to read it in a way that includes the line breaks as is?
- The original (naive) version of the routine (see the commented-out bits, e.g., str = str//chunk) is extremely slow and also causes stack overflows for very large strings. The slowness makes sense to me due to all the reallocations, but I didn't expect it to cause stack overflows. Is that to be expected?
- Any other improvements that anyone can see?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My first thought is to use unformatted, stream reads rather than formatted sequential. This will preserve the terminators. You can read chunks directly. The only trick is that if the file size isn't an exact multiple of the chunk, you'll get a premature EOF. Then you can read one character at a time to finish.
SIZE= on INQUIRE is standard.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- You should strongly consider defending against a filesize of -1 (unknown length) in case the Fortran processor is lazy or the "file" is actually a pipe of some sort. Even if the filesize is positive, you may want to defend against there being more data in the file than the filesize indicates (unlikely, but the standard doesn't guarantee that filesize > datasize).
- There's no benefit to nominating the length of the character buffer in the format string - that just opens up the possibility of getting the buffer length and the format string out of sync. Just use "(A)".
- There's no point trimming a filename for OPEN - OPEN does that for you (as a result, you cannot open a file with trailing blanks in its name in standard Fortran).
- With formatted sequential access (as you have now), I don't think you can "legally" get IOSTAT_END and a non-zero size. With formatted stream access you might, but note that if the file doesn't actually have a terminating newline, (which is the only way you can have IOSTAT_END and a positive size) then I think you are adding one.
- If simply want to transfer the "bytes" as they are on disk (and not do things like newline translations), then use unformatted stream. It will be faster. You can also then make your chunk size much, much bigger - because you (and the Fortran processor) no longer care about newlines in the file.
- When self assigning, in the case of str = str // whatever, the compiler has to build a temporary for the right hand side of the assignment. It may (depending on compiler options) put that temporary on the stack. If it does that and that temporary is big, then your stack will overflow.
- Using an indent setting of four spaces is bad for your health.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for all the suggestions! I'll try the unformatted stream read. I've never actually used that feature before.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's my updated version that reads the whole file at once using form='unformated', access='stream'. It definitely makes a difference. This one reads the same 100 MB file in only 0.1 seconds!
subroutine read_file(filename, str) implicit none character(len=*),intent(in) :: filename character(len=:),allocatable,intent(out) :: str !local variables: integer :: iunit,istat,filesize character(len=1) :: c open(newunit=iunit,file=filename,status='OLD',& form='UNFORMATTED',access='STREAM',iostat=istat) if (istat==0) then !how many characters are in the file: inquire(file=filename, size=filesize) if (filesize>0) then !read the file all at once: allocate( character(len=filesize) :: str ) read(iunit,pos=1,iostat=istat) str if (istat==0) then !make sure it was all read by trying to read more: read(iunit,pos=filesize+1,iostat=istat) c if (.not. IS_IOSTAT_END(istat)) & write(*,*) 'Error: file was not completely read.' else write(*,*) 'Error reading file.' end if close(iunit, iostat=istat) else write(*,*) 'Error getting file size.' end if else write(*,*) 'Error opening file.' end if end subroutine read_file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What about uff-8 encoded files?
What if they contain non-ascii characters (assuming the compiler supports ISO10646/UCS4? (I know that, currently, in the case of ifort, this is hypothetical…)
Looking at the wikipedia entry on utf-8 encoding, the leading bits will determine the number of bytes (1-6) used to encode each character. In this case, I suspect (but please confirm) that one cannot simply just read the file using unformatted stream io into a character variable of either default OR ISO_10646 kind.
Presumably if one understood utf-8 really well, the entire file could be read to a ‘DEFAULT’ character string, and then reprocess in memory to convert it to UCS4/ISO10646 using the bit manipulation intrinsics and transfer statements. But the conversion process looks like it would be quite involved and it might be quite a bit slower than just reading in utf-8 encoded files with formatted stream IO… What do you think?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page