Solved: speed of reading character lines depends on size of storage variable

may_ka · ‎05-01-2020

Hi all,

I noticed with the program below that the speed for reading lines of a file into a character variable depends on the length of the variable.

program test
  implicit none
  integer :: i,j
  character(1000000) :: c0i
  real :: r01=0.0,r02=0.0
  character(500) :: c0msg
  open(unit=55,file="Data.csv",action="read",form="formatted")
  call cpu_time(r01)
  do i=1,2069675
    read(55,"(A)",iostat=j,iomsg=c0msg) c0i
  end do
  call cpu_time(r02)
  write(*,*) r02-r01
end program test

The number of lines in the file is known, but the length of each line in the file is unknown. Therefore the length of the "c0i" is oversized to a best guess to account for many possible situations. However, when the length was set to 10,000 the file read in less than 1(3) seconds, when set to 100,000 the file read in about 6(6) seconds, and when set to 1,000,000 the file read in 94(40) seconds, where the number before the brackets is from ifort, inside bracket from gfortran.

I tried to work out what the program is actually doing to justify the speed difference and it appears as if that part of "c0i" which is not used is set to "blank".

Any suggestions which explain the observations and who to avoid it even.

Thanks

Ulrich_M_ · ‎05-02-2020

I use an allocatable character variable for the purpose. Here is my homebrew function to load csv files:

function loadcsv(filename, headerflag) result(mat)
	character(len=*):: filename
	logical, optional	:: headerflag
	real, allocatable  :: mat(:,:)
	integer         :: ioerr,m,n,i,j,myunit,l1,l2
	character(:), allocatable  :: line
	real	:: nanx

	nanx=0.0/0.0

	open (file=filename,status='old',iostat=ioerr,newunit=myunit)
	if(ioerr/=0) then
		print *,"loadcsv: file not found: ",filename
		stop
	endif
	m=0
	do
		read (myunit,*,iostat=ioerr)
		if (ioerr/=0) exit     
		m=m+1
	enddo
	rewind(myunit)
	call getline
	n=countitems(line)
	rewind(myunit)
	if(present(headerflag)) then
		if(headerflag) then
			m=m-1
			call getline
		endif
	endif
	allocate(mat(m,n))
	do i=1,m
		call getline
		l1=1
		do j=1,n
			l2=merge(index(line(l1:),","),len(line)-l1+2,j<n)
			if(l2==0) then 
				print *,"loadcsv: not enough fields in row",i
				stop
			endif
			if(l2==1) then 
				mat(i,j)=nanx
			else
				read(line(l1:l1+l2-2),*,iostat=ioerr) mat(i,j)
				if(ioerr>0) mat(i,j)=nanx
			endif
			l1=l1+l2
		enddo
	enddo
	close(myunit)
	
	contains

		function countitems(line) result(val)
			character(len=*)    :: line
			integer				:: val
			integer				:: i,n
			val=1
			n=len(line)
			do i=1,n
				if(line(i:i)==",") val=val+1
			enddo
		end function
		
		subroutine getline
			character(len=8192) :: buffer
			integer				:: lsize
			line=''
			do
				read(myunit,"(a)",advance="no",iostat=ioerr,size=lsize) buffer
				if(ioerr>0) then
					print *,"loadmat can't read first line: ",filename
					stop
				endif
				line=line//buffer(:lsize)
				if(ioerr<0) exit
			enddo
		end subroutine
end function

View solution in original post

GVautier · ‎05-02-2020

In Fortran, all character variables are padded up to their declared length with blanks. So the time to assign a value to a string grows with the string length.

If you want to optimize your code you must use a method compatible with that fact.

mecej4 · ‎05-02-2020

It may be useful to replace the READ statement in the program by

read(55,*)

and to record the time consumed. The times consumed in reading the file contents to a buffer, locating the EOLs and checking for error conditions will contribute. The additional time spent in the original program in copying the file buffer to the character variable will have been removed.

may_ka · ‎05-02-2020

Hi

thanks for the responses.

@mecej4

this does not work as the file is a csv file and which can contain commas and spaces.

From my understanding with "*" the reading process will be interrupted when the program encounters the first field delimiter.

cheers

mecej4 · ‎05-02-2020

For a file opened for formatted record input, a list-directed READ with no io-list will cause the file position to be advanced by one line, regardless of the contents of the line. It is quite common for Fortran programs to skip header lines in input text files using READ(nn,*) statements.

Please try the suggestion. The following program will count the number of lines in a CSV or other text file.

Program LineCount
Implicit none
Integer nLines
!
nLines=0
open(11,file='May02.csv',status='old')
Do
   read(11,*,end=100)
   nLines=nLines+1
End Do
100 print *,'File has ',nLines,' Lines'
End Program

may_ka · ‎05-02-2020

Hi mecej4

sorry but there must be a missunderstanding. I am NOT after the number of lines in the file. I'll know them already from your approach which I had used as well.

The problem is that the length of each line is unknown. Therefore, the variable used to store the line content when the lines is read for further processing is over-allocated by best guess (e.g. 10000, 100000 etc). The problem is that the degree of overallocation affects the read speed as shown in the op.

hope this clarifies the problem.

Arjen_Markus · ‎05-02-2020

I have not tried this myself (at least not in the context of performance), but what you could do is read the line in pieces of say 100 characters and add them to the buffer until you have encountered the end-of-line. Use "ADVANCE='NO'" and detect the end of line via "IOSTAT=errcode" or "EOR=label".

may_ka · ‎05-02-2020

Hi Arjen,

thanks for the suggestion.

Can you make a small example as I don't know how I would do this.

BTW, I am trying to develop an equivalent to "getline" of the string class in c++.

thanks

mecej4 · ‎05-02-2020

may.ka: I didn't say that you want to count lines.

You wrote "From my understanding with "*" the reading process will be interrupted when the program encounters the first field delimiter." The code that I provided disproves that statement, and serves to enable you to measure the time it would take for reading the file with a zero-length I/O list. That time can be expected to depend on the lengths of the records, but this code will not incur the additional overhead of padding with blanks when the record length is less than the size of the character variable in your original post.

may_ka · ‎05-02-2020

Hi,

what seems to avoid the padding is this

program test
  implicit none
  character(12) :: a,b
  integer :: i,j
  do i=1,len(a)
    a(i:i)=achar(0)
  end do
  write(*,"(*(g0:"",""))") (ichar(a(i:i)),i=1,len(a))
  open(55,action="read",form="formatted")
  read(55,"(12A)",iostat=j,advance="no") (a(i:i),i=1,len(a))
  write(*,*) j,i
  write(*,*) a
  do i=1,len(a)
    write(*,*) ichar(a(i:i))
  end do
end program test

the file "fort.55" contains only "xxx". The string is initialized with NULL, and after the reading process most of the NULLs are retained which indicates that the padding didn't occur. However, performancewise it is not competitive. Most likely because the implicit loop does not stop at "end of line".

cheers

Ulrich_M_ · ‎05-02-2020

I use an allocatable character variable for the purpose. Here is my homebrew function to load csv files:

function loadcsv(filename, headerflag) result(mat)
	character(len=*):: filename
	logical, optional	:: headerflag
	real, allocatable  :: mat(:,:)
	integer         :: ioerr,m,n,i,j,myunit,l1,l2
	character(:), allocatable  :: line
	real	:: nanx

	nanx=0.0/0.0

	open (file=filename,status='old',iostat=ioerr,newunit=myunit)
	if(ioerr/=0) then
		print *,"loadcsv: file not found: ",filename
		stop
	endif
	m=0
	do
		read (myunit,*,iostat=ioerr)
		if (ioerr/=0) exit     
		m=m+1
	enddo
	rewind(myunit)
	call getline
	n=countitems(line)
	rewind(myunit)
	if(present(headerflag)) then
		if(headerflag) then
			m=m-1
			call getline
		endif
	endif
	allocate(mat(m,n))
	do i=1,m
		call getline
		l1=1
		do j=1,n
			l2=merge(index(line(l1:),","),len(line)-l1+2,j<n)
			if(l2==0) then 
				print *,"loadcsv: not enough fields in row",i
				stop
			endif
			if(l2==1) then 
				mat(i,j)=nanx
			else
				read(line(l1:l1+l2-2),*,iostat=ioerr) mat(i,j)
				if(ioerr>0) mat(i,j)=nanx
			endif
			l1=l1+l2
		enddo
	enddo
	close(myunit)
	
	contains

		function countitems(line) result(val)
			character(len=*)    :: line
			integer				:: val
			integer				:: i,n
			val=1
			n=len(line)
			do i=1,n
				if(line(i:i)==",") val=val+1
			enddo
		end function
		
		subroutine getline
			character(len=8192) :: buffer
			integer				:: lsize
			line=''
			do
				read(myunit,"(a)",advance="no",iostat=ioerr,size=lsize) buffer
				if(ioerr>0) then
					print *,"loadmat can't read first line: ",filename
					stop
				endif
				line=line//buffer(:lsize)
				if(ioerr<0) exit
			enddo
		end subroutine
end function

may_ka · ‎05-03-2020

Hi

@Ulrich M .............. exactly what I was looking for. I replaced the dynamic memory component by a static section of 8192*n and it works like a charm performance-wise.

thanks a lot

cheers

John_Campbell · ‎05-07-2020

Stream I/O

Why don't you use an approach where c01 is never fully initialised. This could be achieved by using stream I/O and only fill c01 to the length of the line recovered.
The following get_line (string_length, line_string) retrieves each line, assuming a text string (ichar = 32:127), terminated by a <LF>.
all other control characters are ignored (including <CR> and <HT>).
If the file contains more complex characters then get_line would need to be modified. This approach never initialises c01 beyond string_length.

     implicit none
     character(80):: file_name
     character(1000000) :: c01
     integer iostat, line_length, nl, mc

     file_name = 'data.csv'
     open (unit=55,file=file_name,access='stream', action='read',iostat=iostat)
     write (*,*) 'opening ',trim(file_name),' : iostat =',iostat

     mc = 0
     do nl = 1, huge(nl)
       call get_line (line_length, c01)
       if ( line_length < 0) exit
         mc = max (mc,line_length)
     end do
     write (*,*) nl-1,' lines recovered, max length = ',mc

   end
   
   subroutine get_line (line_length, line_string)
     implicit none
     integer,   intent(out) :: line_length
     character, intent(out) ::  line_string*(*)

     integer l, ic, iostat
     character ch
!
     line_length = 0
     do
       read (55,iostat=iostat) ch
       if ( iostat/=0) write (*,*) 'iostat=',iostat, ' line_length=',line_length
       if ( iostat < 0 .and. line_length==0) then
         line_length = iostat              ! mark end of file as line_length = -1
         return
       end if
       if ( iostat /= 0 ) then
         ic = 0                   ! mark any other error as end of text
       else
         ic = ichar (ch)
       end if
       if ( ic >= 32 ) then
         line_length = line_length+1
         line_string(line_length:line_length) = ch
       else
         if ( ic == 13 ) cycle    ! windows CR, wait for LF
         if ( ic == 10 ) exit     ! unix and windows LF
         write (*,*) 'ic =',ic    ! any control character 0:31
         if ( ic ==  0 ) exit     ! C end of text 
       end if
     end do
!
     write (*,11) line_length, line_string(1:line_length)
  11 format (i3,' : ',a)
   end subroutine get_line

Once you adopt a stream I/O approach, it is very easy to adapt to reviewing a data file to identify the types of characters that may be used and also the types of field delimiters in the data, eg spaces, commas, tabs, date/time or others.

I have attached an expanded example, including get_file_size, but many other options are possible.
??? "An AJAX HTTP request terminated abnormally."