- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I noticed with the program below that the speed for reading lines of a file into a character variable depends on the length of the variable.
program test implicit none integer :: i,j character(1000000) :: c0i real :: r01=0.0,r02=0.0 character(500) :: c0msg open(unit=55,file="Data.csv",action="read",form="formatted") call cpu_time(r01) do i=1,2069675 read(55,"(A)",iostat=j,iomsg=c0msg) c0i end do call cpu_time(r02) write(*,*) r02-r01 end program test
The number of lines in the file is known, but the length of each line in the file is unknown. Therefore the length of the "c0i" is oversized to a best guess to account for many possible situations. However, when the length was set to 10,000 the file read in less than 1(3) seconds, when set to 100,000 the file read in about 6(6) seconds, and when set to 1,000,000 the file read in 94(40) seconds, where the number before the brackets is from ifort, inside bracket from gfortran.
I tried to work out what the program is actually doing to justify the speed difference and it appears as if that part of "c0i" which is not used is set to "blank".
Any suggestions which explain the observations and who to avoid it even.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I use an allocatable character variable for the purpose. Here is my homebrew function to load csv files:
function loadcsv(filename, headerflag) result(mat) character(len=*):: filename logical, optional :: headerflag real, allocatable :: mat(:,:) integer :: ioerr,m,n,i,j,myunit,l1,l2 character(:), allocatable :: line real :: nanx nanx=0.0/0.0 open (file=filename,status='old',iostat=ioerr,newunit=myunit) if(ioerr/=0) then print *,"loadcsv: file not found: ",filename stop endif m=0 do read (myunit,*,iostat=ioerr) if (ioerr/=0) exit m=m+1 enddo rewind(myunit) call getline n=countitems(line) rewind(myunit) if(present(headerflag)) then if(headerflag) then m=m-1 call getline endif endif allocate(mat(m,n)) do i=1,m call getline l1=1 do j=1,n l2=merge(index(line(l1:),","),len(line)-l1+2,j<n) if(l2==0) then print *,"loadcsv: not enough fields in row",i stop endif if(l2==1) then mat(i,j)=nanx else read(line(l1:l1+l2-2),*,iostat=ioerr) mat(i,j) if(ioerr>0) mat(i,j)=nanx endif l1=l1+l2 enddo enddo close(myunit) contains function countitems(line) result(val) character(len=*) :: line integer :: val integer :: i,n val=1 n=len(line) do i=1,n if(line(i:i)==",") val=val+1 enddo end function subroutine getline character(len=8192) :: buffer integer :: lsize line='' do read(myunit,"(a)",advance="no",iostat=ioerr,size=lsize) buffer if(ioerr>0) then print *,"loadmat can't read first line: ",filename stop endif line=line//buffer(:lsize) if(ioerr<0) exit enddo end subroutine end function
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Fortran, all character variables are padded up to their declared length with blanks. So the time to assign a value to a string grows with the string length.
If you want to optimize your code you must use a method compatible with that fact.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It may be useful to replace the READ statement in the program by
read(55,*)
and to record the time consumed. The times consumed in reading the file contents to a buffer, locating the EOLs and checking for error conditions will contribute. The additional time spent in the original program in copying the file buffer to the character variable will have been removed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
thanks for the responses.
@mecej4
this does not work as the file is a csv file and which can contain commas and spaces.
From my understanding with "*" the reading process will be interrupted when the program encounters the first field delimiter.
cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For a file opened for formatted record input, a list-directed READ with no io-list will cause the file position to be advanced by one line, regardless of the contents of the line. It is quite common for Fortran programs to skip header lines in input text files using READ(nn,*) statements.
Please try the suggestion. The following program will count the number of lines in a CSV or other text file.
Program LineCount Implicit none Integer nLines ! nLines=0 open(11,file='May02.csv',status='old') Do read(11,*,end=100) nLines=nLines+1 End Do 100 print *,'File has ',nLines,' Lines' End Program
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi mecej4
sorry but there must be a missunderstanding. I am NOT after the number of lines in the file. I'll know them already from your approach which I had used as well.
The problem is that the length of each line is unknown. Therefore, the variable used to store the line content when the lines is read for further processing is over-allocated by best guess (e.g. 10000, 100000 etc). The problem is that the degree of overallocation affects the read speed as shown in the op.
hope this clarifies the problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have not tried this myself (at least not in the context of performance), but what you could do is read the line in pieces of say 100 characters and add them to the buffer until you have encountered the end-of-line. Use "ADVANCE='NO'" and detect the end of line via "IOSTAT=errcode" or "EOR=label".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Arjen,
thanks for the suggestion.
Can you make a small example as I don't know how I would do this.
BTW, I am trying to develop an equivalent to "getline" of the string class in c++.
thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
may.ka: I didn't say that you want to count lines.
You wrote "From my understanding with "*" the reading process will be interrupted when the program encounters the first field delimiter." The code that I provided disproves that statement, and serves to enable you to measure the time it would take for reading the file with a zero-length I/O list. That time can be expected to depend on the lengths of the records, but this code will not incur the additional overhead of padding with blanks when the record length is less than the size of the character variable in your original post.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
what seems to avoid the padding is this
program test implicit none character(12) :: a,b integer :: i,j do i=1,len(a) a(i:i)=achar(0) end do write(*,"(*(g0:"",""))") (ichar(a(i:i)),i=1,len(a)) open(55,action="read",form="formatted") read(55,"(12A)",iostat=j,advance="no") (a(i:i),i=1,len(a)) write(*,*) j,i write(*,*) a do i=1,len(a) write(*,*) ichar(a(i:i)) end do end program test
the file "fort.55" contains only "xxx". The string is initialized with NULL, and after the reading process most of the NULLs are retained which indicates that the padding didn't occur. However, performancewise it is not competitive. Most likely because the implicit loop does not stop at "end of line".
cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I use an allocatable character variable for the purpose. Here is my homebrew function to load csv files:
function loadcsv(filename, headerflag) result(mat) character(len=*):: filename logical, optional :: headerflag real, allocatable :: mat(:,:) integer :: ioerr,m,n,i,j,myunit,l1,l2 character(:), allocatable :: line real :: nanx nanx=0.0/0.0 open (file=filename,status='old',iostat=ioerr,newunit=myunit) if(ioerr/=0) then print *,"loadcsv: file not found: ",filename stop endif m=0 do read (myunit,*,iostat=ioerr) if (ioerr/=0) exit m=m+1 enddo rewind(myunit) call getline n=countitems(line) rewind(myunit) if(present(headerflag)) then if(headerflag) then m=m-1 call getline endif endif allocate(mat(m,n)) do i=1,m call getline l1=1 do j=1,n l2=merge(index(line(l1:),","),len(line)-l1+2,j<n) if(l2==0) then print *,"loadcsv: not enough fields in row",i stop endif if(l2==1) then mat(i,j)=nanx else read(line(l1:l1+l2-2),*,iostat=ioerr) mat(i,j) if(ioerr>0) mat(i,j)=nanx endif l1=l1+l2 enddo enddo close(myunit) contains function countitems(line) result(val) character(len=*) :: line integer :: val integer :: i,n val=1 n=len(line) do i=1,n if(line(i:i)==",") val=val+1 enddo end function subroutine getline character(len=8192) :: buffer integer :: lsize line='' do read(myunit,"(a)",advance="no",iostat=ioerr,size=lsize) buffer if(ioerr>0) then print *,"loadmat can't read first line: ",filename stop endif line=line//buffer(:lsize) if(ioerr<0) exit enddo end subroutine end function
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
@Ulrich M .............. exactly what I was looking for. I replaced the dynamic memory component by a static section of 8192*n and it works like a charm performance-wise.
thanks a lot
cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Stream I/O
Why don't you use an approach where c01 is never fully initialised. This could be achieved by using stream I/O and only fill c01 to the length of the line recovered.
The following get_line (string_length, line_string) retrieves each line, assuming a text string (ichar = 32:127), terminated by a <LF>.
all other control characters are ignored (including <CR> and <HT>).
If the file contains more complex characters then get_line would need to be modified. This approach never initialises c01 beyond string_length.
implicit none character(80):: file_name character(1000000) :: c01 integer iostat, line_length, nl, mc file_name = 'data.csv' open (unit=55,file=file_name,access='stream', action='read',iostat=iostat) write (*,*) 'opening ',trim(file_name),' : iostat =',iostat mc = 0 do nl = 1, huge(nl) call get_line (line_length, c01) if ( line_length < 0) exit mc = max (mc,line_length) end do write (*,*) nl-1,' lines recovered, max length = ',mc end subroutine get_line (line_length, line_string) implicit none integer, intent(out) :: line_length character, intent(out) :: line_string*(*) integer l, ic, iostat character ch ! line_length = 0 do read (55,iostat=iostat) ch if ( iostat/=0) write (*,*) 'iostat=',iostat, ' line_length=',line_length if ( iostat < 0 .and. line_length==0) then line_length = iostat ! mark end of file as line_length = -1 return end if if ( iostat /= 0 ) then ic = 0 ! mark any other error as end of text else ic = ichar (ch) end if if ( ic >= 32 ) then line_length = line_length+1 line_string(line_length:line_length) = ch else if ( ic == 13 ) cycle ! windows CR, wait for LF if ( ic == 10 ) exit ! unix and windows LF write (*,*) 'ic =',ic ! any control character 0:31 if ( ic == 0 ) exit ! C end of text end if end do ! write (*,11) line_length, line_string(1:line_length) 11 format (i3,' : ',a) end subroutine get_line
Once you adopt a stream I/O approach, it is very easy to adapt to reviewing a data file to identify the types of characters that may be used and also the types of field delimiters in the data, eg spaces, commas, tabs, date/time or others.
I have attached an expanded example, including get_file_size, but many other options are possible.
??? "An AJAX HTTP request terminated abnormally."
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page