I'm looking for the fastest, safest, and most portable way to read the entire contents of a text file into a Fortran allocatable string. Here's what I've come up with:
subroutine read_file(filename, str) implicit none character(len=*),intent(in) :: filename character(len=:),allocatable,intent(out) :: str !parameters: integer,parameter :: n_chunk = 256 !chunk size for reading file [arbitrary] character(len=*),parameter :: nfmt = '(A256)' !corresponding format statement character(len=1),parameter :: newline = new_line(' ') integer :: iunit, istat, isize character(len=n_chunk) :: chunk integer :: filesize,ipos character(len=:),allocatable :: tmp !how many characters are in the file: inquire(file=filename, size=filesize) !is this portable? !initialize: ipos = 1 !where to put the next chunk !preallocate the str array to speed up the process for large files: !str = '' allocate( character(len=filesize) :: str ) !open the file: open(newunit=iunit, file=trim(filename), status='OLD', iostat=istat) if (istat==0) then !read all the characters from the file: do read(iunit,fmt=nfmt,advance='NO',size=isize,iostat=istat) chunk if (istat==0) then !str = str//chunk str(ipos:ipos+isize-1) = chunk ipos = ipos+isize elseif (IS_IOSTAT_EOR(istat)) then if (isize>0) then !str = str//chunk(1:isize)//newline str(ipos:ipos+isize) = chunk(1:isize)//newline ipos = ipos+isize+1 else !str = str//newline str(ipos:ipos) = newline ipos = ipos + 1 end if elseif (IS_IOSTAT_END(istat)) then if (isize>0) then !str = str//chunk(1:isize) str(ipos:ipos+isize) = chunk(1:isize)//newline ipos = ipos+isize+1 end if exit else stop 'Error' end if end do !resize the string if (ipos<filesize+1) str = str(1:ipos-1) close(iunit, iostat=istat) else write(*,*) 'Error opening file: '//trim(filename) end if end subroutine read_file
Some notes/questions about this:
My first thought is to use unformatted, stream reads rather than formatted sequential. This will preserve the terminators. You can read chunks directly. The only trick is that if the file size isn't an exact multiple of the chunk, you'll get a premature EOF. Then you can read one character at a time to finish.
SIZE= on INQUIRE is standard.
Here's my updated version that reads the whole file at once using form='unformated', access='stream'. It definitely makes a difference. This one reads the same 100 MB file in only 0.1 seconds!
subroutine read_file(filename, str) implicit none character(len=*),intent(in) :: filename character(len=:),allocatable,intent(out) :: str !local variables: integer :: iunit,istat,filesize character(len=1) :: c open(newunit=iunit,file=filename,status='OLD',& form='UNFORMATTED',access='STREAM',iostat=istat) if (istat==0) then !how many characters are in the file: inquire(file=filename, size=filesize) if (filesize>0) then !read the file all at once: allocate( character(len=filesize) :: str ) read(iunit,pos=1,iostat=istat) str if (istat==0) then !make sure it was all read by trying to read more: read(iunit,pos=filesize+1,iostat=istat) c if (.not. IS_IOSTAT_END(istat)) & write(*,*) 'Error: file was not completely read.' else write(*,*) 'Error reading file.' end if close(iunit, iostat=istat) else write(*,*) 'Error getting file size.' end if else write(*,*) 'Error opening file.' end if end subroutine read_file
What about uff-8 encoded files?
What if they contain non-ascii characters (assuming the compiler supports ISO10646/UCS4? (I know that, currently, in the case of ifort, this is hypothetical…)
Looking at the wikipedia entry on utf-8 encoding, the leading bits will determine the number of bytes (1-6) used to encode each character. In this case, I suspect (but please confirm) that one cannot simply just read the file using unformatted stream io into a character variable of either default OR ISO_10646 kind.
Presumably if one understood utf-8 really well, the entire file could be read to a ‘DEFAULT’ character string, and then reprocess in memory to convert it to UCS4/ISO10646 using the bit manipulation intrinsics and transfer statements. But the conversion process looks like it would be quite involved and it might be quite a bit slower than just reading in utf-8 encoded files with formatted stream IO… What do you think?