What about uff-8 encoded

Jacob_Williams · ‎03-03-2015

Fortran gurus:

I'm looking for the fastest, safest, and most portable way to read the entire contents of a text file into a Fortran allocatable string. Here's what I've come up with:

    subroutine read_file(filename, str)

    implicit none
    
    character(len=*),intent(in)  :: filename
    character(len=:),allocatable,intent(out) :: str
        
    !parameters:
    integer,parameter  :: n_chunk = 256      !chunk size for reading file [arbitrary]
    character(len=*),parameter :: nfmt = '(A256)'    !corresponding format statement
    character(len=1),parameter :: newline = new_line(' ')
    
    integer :: iunit, istat, isize
    character(len=n_chunk) :: chunk
    integer :: filesize,ipos
    character(len=:),allocatable :: tmp
    
    !how many characters are in the file:
    inquire(file=filename, size=filesize)  !is this portable?
    
    !initialize:
    ipos = 1    !where to put the next chunk
    
    !preallocate the str array to speed up the process for large files:
    !str = ''
    allocate( character(len=filesize) :: str )
    
    !open the file:
    open(newunit=iunit, file=trim(filename), status='OLD', iostat=istat)
    
    if (istat==0) then
            
        !read all the characters from the file:
        do
            
            read(iunit,fmt=nfmt,advance='NO',size=isize,iostat=istat) chunk
            
            if (istat==0) then
            
                !str = str//chunk
                str(ipos:ipos+isize-1) = chunk
                ipos = ipos+isize
            
            elseif (IS_IOSTAT_EOR(istat)) then
                            
                if (isize>0) then
                    !str = str//chunk(1:isize)//newline
                    str(ipos:ipos+isize) = chunk(1:isize)//newline
                    ipos = ipos+isize+1
                else
                    !str = str//newline                
                    str(ipos:ipos) = newline
                    ipos = ipos + 1
                end if
                
            elseif (IS_IOSTAT_END(istat)) then        
            
                if (isize>0) then
                    !str = str//chunk(1:isize)
                    str(ipos:ipos+isize) = chunk(1:isize)//newline
                    ipos = ipos+isize+1
                end if
                
                exit
            
            else
                stop 'Error'
            end if
        
        end do
                
        !resize the string
        if (ipos<filesize+1) str = str(1:ipos-1)
		
        close(iunit, iostat=istat)
       
    else
        write(*,*) 'Error opening file: '//trim(filename)
    end if
    
    end subroutine read_file

Some notes/questions about this:

This routine will read the 100 MB file at https://github.com/seductiveapps/largeJSON in about 1 sec on my PC.
Is it really portable to use the SIZE argument of INQUIRE to get the number of characters? I notice that the string I end up with is somewhat smaller than this value, but that could be due to #3. What is the portable way to get the file size in number of characters (I'd like it to also work on other non-ifort compilers, as well as on other platforms).
I don't think this way preserves the Windows line breaks (if present), since it essentially reads it line by line and then inserts the newline character. The string I end up with is smaller (which is why I'm trimming it at the end). Is there a way to read it in a way that includes the line breaks as is?
The original (naive) version of the routine (see the commented-out bits, e.g., str = str//chunk) is extremely slow and also causes stack overflows for very large strings. The slowness makes sense to me due to all the reallocations, but I didn't expect it to cause stack overflows. Is that to be expected?
Any other improvements that anyone can see?

Steven_L_Intel1 · ‎03-03-2015

My first thought is to use unformatted, stream reads rather than formatted sequential. This will preserve the terminators. You can read chunks directly. The only trick is that if the file size isn't an exact multiple of the chunk, you'll get a premature EOF. Then you can read one character at a time to finish.

SIZE= on INQUIRE is standard.

IanH · ‎03-03-2015

You should strongly consider defending against a filesize of -1 (unknown length) in case the Fortran processor is lazy or the "file" is actually a pipe of some sort. Even if the filesize is positive, you may want to defend against there being more data in the file than the filesize indicates (unlikely, but the standard doesn't guarantee that filesize > datasize).
There's no benefit to nominating the length of the character buffer in the format string - that just opens up the possibility of getting the buffer length and the format string out of sync. Just use "(A)".
There's no point trimming a filename for OPEN - OPEN does that for you (as a result, you cannot open a file with trailing blanks in its name in standard Fortran).
With formatted sequential access (as you have now), I don't think you can "legally" get IOSTAT_END and a non-zero size. With formatted stream access you might, but note that if the file doesn't actually have a terminating newline, (which is the only way you can have IOSTAT_END and a positive size) then I think you are adding one.
If simply want to transfer the "bytes" as they are on disk (and not do things like newline translations), then use unformatted stream. It will be faster. You can also then make your chunk size much, much bigger - because you (and the Fortran processor) no longer care about newlines in the file.
When self assigning, in the case of str = str // whatever, the compiler has to build a temporary for the right hand side of the assignment. It may (depending on compiler options) put that temporary on the stack. If it does that and that temporary is big, then your stack will overflow.
Using an indent setting of four spaces is bad for your health.

Jacob_Williams · ‎03-03-2015

Thanks for all the suggestions! I'll try the unformatted stream read. I've never actually used that feature before.

Jacob_Williams · ‎03-04-2015

Here's my updated version that reads the whole file at once using form='unformated', access='stream'. It definitely makes a difference. This one reads the same 100 MB file in only 0.1 seconds!

    subroutine read_file(filename, str)

    implicit none
    
    character(len=*),intent(in) :: filename
    character(len=:),allocatable,intent(out) :: str
    
    !local variables:
    integer :: iunit,istat,filesize
    character(len=1) :: c
  
    open(newunit=iunit,file=filename,status='OLD',&
            form='UNFORMATTED',access='STREAM',iostat=istat)
    
    if (istat==0) then

        !how many characters are in the file:
        inquire(file=filename, size=filesize)
        if (filesize>0) then
            
            !read the file all at once:
            allocate( character(len=filesize) :: str )
            read(iunit,pos=1,iostat=istat) str
        
            if (istat==0) then
                !make sure it was all read by trying to read more:
                read(iunit,pos=filesize+1,iostat=istat) c
                if (.not. IS_IOSTAT_END(istat)) &
                    write(*,*) 'Error: file was not completely read.'
            else
                write(*,*) 'Error reading file.'
            end if
        
            close(iunit, iostat=istat)
        
        else
            write(*,*) 'Error getting file size.'
        end if
    else
        write(*,*) 'Error opening file.'
    end if
    
    end subroutine read_file

Izaak_Beekman · ‎03-10-2015

What about uff-8 encoded files?

What if they contain non-ascii characters (assuming the compiler supports ISO10646/UCS4? (I know that, currently, in the case of ifort, this is hypothetical…)

Looking at the wikipedia entry on utf-8 encoding, the leading bits will determine the number of bytes (1-6) used to encode each character. In this case, I suspect (but please confirm) that one cannot simply just read the file using unformatted stream io into a character variable of either default OR ISO_10646 kind.

Presumably if one understood utf-8 really well, the entire file could be read to a ‘DEFAULT’ character string, and then reprocess in memory to convert it to UCS4/ISO10646 using the bit manipulation intrinsics and transfer statements. But the conversion process looks like it would be quite involved and it might be quite a bit slower than just reading in utf-8 encoded files with formatted stream IO… What do you think?

Text file to allocatable string