Reading selective data from a sequential access file

dboggs · ‎11-21-2014

My program needs to read data in from blocks deep inside of a sequential access file. It does not know a priori how long each block is, but the blocks are terminated by a line beginning with a -1 flag. The first block may have data to be read into array A1, the second block to be read into array A2, etc. The arrays are to be dynamically allocated to contain however much data is present in the file.

How can I do this?

If there were only one set of data records, my approach would be to read only the first number from each record, counting the no. of records until -1 is found. Then the array can be allocated, I then have a count of the number of records. Then I would allocate A1, REWIND the file, and begin reading again, this time storing all of the numbers on each line into the appropriate location in A1.

But when there are multiple blocks this method does not work so well. After the first read of the second block of records I cannot simply rewind the file; I need to back up the position by the no. of records in the second block, say NREC2 to start the second read. I could do this by DO I=1,NREC2; BACKSPACE; END DO. I have read that this is a waste of resources and not recommended.

I could also REWIND the file after the first readthrough of any block, then advance to the beginning of the desired block to start the 2nd read. This could be done using DO I=1,N; READ (*); END DO. Not much better.

I could do the first read through the ENTIRE data file, noting the record no. of each one of the -1 block-end flags and thus the no. of blocks and the amount of data in each, then allocate all of the arrays, then REWIND the file, and then begin reading the data. But this sequence is not always possible--sometimes I need to read the data from only a particular block in the middle of the file.

If I were processing a binary file I could make great use of the intrinsic (Portability) function FSEEK, which can move the position a specified no. of bytes from the current position. I wish there were a similar function that worked on records instead of bytes.

Ideas?

mecej4 · ‎11-21-2014

How big is the file? Can its entire contents be loaded into a character array?

As a preliminary step, you can use text utilities to count the number of lines in the file, and the number of lines containing '-1' at the beginning. If the entire file contents can be stored in RAM, you can read the file and note the line numbers containing the -1s, and store the indices into an array. No rewinding, backspacing and seeking necessary.

jimdempseyatthecove · ‎11-21-2014

Here is a suggestion:

Assuming the program that writes the file does not know in advance where the record breaks will occur and possibly assume it doesn't know the number of records.

As this program writes the file it creates an index of where the records are located in the file. After all the records are written, the current size of the file (or number of bytes written is remembered, then the record index is written to the file, followed by the index of the index (last record). Then close the file.

Some time later, open the file, inquire as to the size, seek to 8 bytes prior to the end (assuming integer*8 indexes), read the index of the index record, seek to the index record, read the index. Now you know where all the records are located. If the index record is too large, you know where it is located and how large it is (number of records) therefore you can read chunks of indexes to suit your needs.

Jim Dempsey

John_Campbell · ‎11-21-2014

There are a number of characteristics of your file you have not explained.

How many blocks do you expect and how many do you want to retrieve.
How many times you expect to use the file in separate runs of the project (once or multiple)
How many files are to be used.
Does each block have a similar structure or vary.
How big is the file. It could be a single analysis or national census data!

My preference would be to read it twice, and counting the size of each record, so that you could allocate space for each array you intend to use. Depending on the file size ( say < 2gb) I'd expect good file buffering on the second read. For a relatively simple project this would be easiest. I read survey points files of up to 1gb twice to first scan for range and check for consistency.
If the file was to be used many times or many files, I'd write a separate utility to restructure the data as a fixed length record direct addressing binary file, with an index of the location and size of each block. Who generated the data, maybe they should?
If a more extensive data system, the utility could possibly generate a directory structure, with separate structured file names for each block, then archive the original data. If this is required, it is not too difficult to generate this as the data is being created, assuming you have control of this stage of the project.

Treat the data acquisition as a separate project. Then again the file may only be a few megabytes and so store the lot and have the generated index structure refer to memory. Use an allocatable array bigger than the file size and there will be no impact once the data has been read, something like attached might work.

John

JVanB · ‎11-21-2014

Formatted stream input looks good here. Input data file ('file.txt'):

          -1
           1 stuffing ( 0.93247222940435581     , 0.36124166618715292     )
           2 mashed_potatoes ( 0.73900891722065909     , 0.67369564364655721     )
           3 gravy ( 0.44573835577653831     , 0.89516329135506234     )
           4 turkey (  9.2268359463302016E-002, 0.99573417629503447     )
          -1
           5 cranberry_sauce (-0.27366299007208289     , 0.96182564317281904     )
           6 pumpkin_pie (-0.60263463637925629     , 0.79801722728023960     )
           7 ham (-0.85021713572961399     , 0.52643216287735606     )
          -1
           8 caviar (-0.98297309968390179     , 0.18374951781657037     )
           9 wine (-0.98297309968390179     ,-0.18374951781657012     )
          10 sweet_potatoes (-0.85021713572961410     ,-0.52643216287735584     )
          11 pecan_pie (-0.60263463637925718     ,-0.79801722728023894     )

Program:

module M
   implicit none
   integer, parameter :: dp = kind([double precision::])
   type T
      integer I
      character(len=:), allocatable :: S
      complex(dp) Z
   end type T
   contains
      subroutine sub(iunit,pos,N)
         integer, intent(in) :: iunit
         integer, intent(in) :: N
         integer, intent(in) :: pos
         type(T), allocatable :: V(:)
         integer i
         character(len=20) C

         allocate(V(N))
         i = 1
         if(N > 0) then
            read(iunit,*,pos=pos) V(i)%I, C, V(i)%Z
            V(i)%S = C
         end if
         do i = 2, N
            read(iunit,*) V(i)%I, C, V(i)%Z
            V(i)%S = C
         end do
         write(*,'(a,i0)') 'Number of records in block = ',N
         do i = 1,N
            write(*,*) V(i)%I,V(i)%S,V(i)%Z
         end do
      end subroutine sub
end module M

program P
   use M
   implicit none
   integer N
   integer iunit
   integer sentinel
   integer pos1,pos2

   open(newunit=iunit,file='file.txt',access='stream',form='formatted',status='old')
   pos1 = -1
   do
      sentinel = 0
      do N = 0, huge(N)
         read(iunit,*,end = 10) sentinel
         if(sentinel == -1) exit
      end do
      10 continue
      if(pos1 < 0) then
         inquire(iunit,pos=pos1)
         if(sentinel >= 0) exit
         cycle
      end if
      if(sentinel < 0) then
         inquire(iunit,pos=pos2)
      end if
      call sub(iunit,pos1,N)
      pos1 = pos2
      if(sentinel >= 0) exit
      read(iunit,*) sentinel
   end do
end program P

Output:

Number of records in block = 4
           1 stuffing             (0.932472229404356,0.361241666187153)
           2 mashed_potatoes      (0.739008917220659,0.673695643646557)
           3 gravy                (0.445738355776538,0.895163291355062)
           4 turkey               (9.226835946330202E-002,0.995734176295034)
Number of records in block = 3
           5 cranberry_sauce      (-0.273662990072083,0.961825643172819)
           6 pumpkin_pie          (-0.602634636379256,0.798017227280240)
           7 ham                  (-0.850217135729614,0.526432162877356)
Number of records in block = 4
           8 caviar               (-0.982973099683902,0.183749517816570)
           9 wine                 (-0.982973099683902,-0.183749517816570)
          10 sweet_potatoes       (-0.850217135729614,-0.526432162877356)
          11 pecan_pie            (-0.602634636379257,-0.798017227280239)

JVanB · ‎11-21-2014

Why is my avatar so pixellated any more?

I liked my solution so much that I tried to tidy it up via a nonadvancing read:

      subroutine sub(iunit,pos,N)
         integer, intent(in) :: iunit
         integer, intent(in) :: N
         integer, intent(in) :: pos
         type(T), allocatable :: V(:)
         integer i
         character(len=20) C

         allocate(V(N))
!DEC$ IF(.FALSE.)
         i = 1
         if(N > 0) then
            read(iunit,*,pos=pos) V(i)%I, C, V(i)%Z
            V(i)%S = trim(C)
         end if
         do i = 2, N
!DEC$ ELSE
         read(iunit,'()',pos=pos,advance='no')
         do i = 1, N
!DEC$ ENDIF
            read(iunit,*) V(i)%I, C, V(i)%Z
            V(i)%S = trim(C)
         end do
         write(*,'(a,i0)') 'Number of records in block = ',N
         do i = 1,N
            write(*,*) V(i)%I,V(i)%S,V(i)%Z
         end do
      end subroutine sub

With !DEC$ IF(.TRUE.) the output is as before, except that the strings are trimmed as originally intended. With !DEC$ IF(.FALSE.) as above, I thought the output should be the same, the empty READ simply serving to reposition the file pointer. However, what I got was:

Number of records in block = 4
           1 stuffing (0.932472229404356,0.361241666187153)
           1 stuffing (0.932472229404356,0.361241666187153)
           2 mashed_potatoes (0.739008917220659,0.673695643646557)
           3 gravy (0.445738355776538,0.895163291355062)
Number of records in block = 0
Number of records in block = 3
           5 cranberry_sauce (-0.273662990072083,0.961825643172819)
           5 cranberry_sauce (-0.273662990072083,0.961825643172819)
           6 pumpkin_pie (-0.602634636379256,0.798017227280240)
Number of records in block = 0
Number of records in block = 4
           8 caviar (-0.982973099683902,0.183749517816570)
           8 caviar (-0.982973099683902,0.183749517816570)
           9 wine (-0.982973099683902,-0.183749517816570)
          10 sweet_potatoes (-0.850217135729614,-0.526432162877356)

Is the bug in my program or the compiler?

JVanB · ‎11-22-2014

OK, I have composed a simpler example that more clearly illustrates the bug in Quote #6 and that gfortran can handle (gfortran had a failure reading the complex numbers in Quote #5). Here is the Fortran program:

program P
   implicit none
   integer i
   integer pos
   integer sentinel
   integer, parameter :: N = 4
   integer posmat(N)
   integer V(3,N)

   open(10, file = 'rf1.txt', access = 'stream', form = 'formatted', status = 'old')
   read(10, *) sentinel
   inquire(10, pos = pos)
   do i = 1, N
      read(10, *) sentinel
   end do
   read(10, '()', pos = pos, advance = 'no')
   do i = 1, N
      inquire(10, pos = posmat(i))
      read(10, *) V(:,i)
   end do
   do i = 1, N
      write(*, *) posmat(i)
      write(*, *) V(:,i)
   end do
end program P

Here is the input file, rf1.txt:

Output with gfortran:

           6
           1           2           3
          17
           4           5           6
          28
           7           8           9
          39
          10          11          12

Output with ifort:

           6
           1           2           3
           6
           1           2           3
          17
           4           5           6
          28
           7           8           9

That empty READ seems to reposition the file pointer as desired but in ifort it also results in repositioning after the first record read as well. As seen in Quote #5, ifort experienced no difficulties if there was no empty read, but in this case it required duplication of code to avoid it.

JVanB · ‎11-22-2014

Oh, I guess this is already # DPD200362669.

JVanB · ‎11-22-2014

I noticed in https://software.intel.com/en-us/forums/topic/534662#comment-1803178 it is claimed that a non-empty READ works around the bug. However the following program still triggers the problem:

program P
   implicit none
   integer i
   integer pos
   integer sentinel
   integer, parameter :: N = 4
   integer posmat(N)
   integer V(3,N)
   character c

   open(10, file = 'rf1.txt', access = 'stream', form = 'formatted', status = 'old')
   read(10, *) sentinel
   inquire(10, pos = pos)
   do i = 1, N
      read(10, *) sentinel
   end do
   read(10, '(a)', pos = pos, advance = 'no') c
   do i = 1, N
      inquire(10, pos = posmat(i))
      read(10, *) V(:,i)
   end do
   do i = 1, N
      write(*, *) posmat(i)
      write(*, *) V(:,i)
   end do
end program P

This was one of my attempted workarounds, but it still didn't work. Perhaps the fix for # DPD200362669 should be tested against this example as well. Indeed, if the input file in https://software.intel.com/en-us/forums/topic/534662#comment-1803178 had a single-digit number in the first column, then the nonadvancing READ would have read it and the subsequent READ(*,*) would have read the record to the end, backed up to its beginning as the bug demands, then read it again, so it wouldn't have avoided the bug, just made the double READ of the first record end up doing what was intended for a single READ.

dboggs · ‎11-23-2014

Thanks for everyone for their quick response and intriguing ideas. The mere act of reading through them causes me to realize some things I would not have otherwise.

For those that wondered: Each block of records consists currently of 100 or so lines but could eventually expand to 100,000. Each line contains, on average, 6 values. There are 4 to 10 blocks, and during execution the operator may add additional blocks in real time for as many as he wants (usually 3 to 10) The structure is constant for all lines within a block, but different blocks have different structures: different no. of values or different mix of reals and integers. This significantly complicates the problem. In addition all records may be sprinkled with "comment lines" that start with a predefined character (usually '*'); these must be read and counted as records for file positioning but not processed for data values.

My takeaway after this is that file positioning using repeated READ or BACKSPACE, even thought the simplest and most straightforward, is probably not the best way to go and certainly not very clever. For the time being--before my project expands to the scale that it may or may not eventually achieve.

For the time being I am making a first pass to read the first character of the entire file, during which I can assemble info about the no. of blocks and no. of lines belonging to each block, and the no. of data lines belonging to each block in order to allocate storage. Then it is fairly easy to make a second pass through the entire file (which as pointed out should enjoy a speed benefit through buffering), reading all of the data. Comment lines are easily skipped because attempting to read them using list-directed formatting produces a trappable error condition. This is not particularly robust, and is not the not the pseudo-direct-access concept I originally envisioned, but it works for now and is simple.

dboggs · ‎11-23-2014

It's a pity that there is no intrinsic function to move the current file position to a given record, inquire the current record no., or move the position a given relative no. of records from the current position.

Call this a request for the future.

mecej4 · ‎11-23-2014

dboggs wrote:

It's a pity that there is no intrinsic function to move the current file position to a given record, inquire the current record no., or move the position a given relative no. of records from the current position.

Call this a request for the future.

You want to do all this with a sequential access file, as you stated in the first post? Don't you see the contradiction?

Can you not open and process the file as a direct access file? Accessing direct access file records sequentially, if needed, is not difficult. The converse, i.e., accessing sequential access file records in non-sequential order, which is what you seem to be asking for, should probably not be allowed.

dboggs · ‎11-23-2014

The file cannot be opened as direct access because the records are of different length.

There is no contradiction in processing a sequential access file in a "pseudo-random-access" manor. I simply want to move the current position around the file, up or down, by a given number of records, and to inquire what the current position is. This chain just demonstrated much discussion of how it could be accomplished.

I could easily write such a function myself, but it would involve inefficient combinations of REWIND, READ, and BACKSPACE. It seems the compiler writers could come up with a better way to do it--or at least a more convenient way for the user.

IanH · ‎11-23-2014

My guideline is not to use rewind and backspace on general text input - mainly because if the text input is coming from the console (or a pipe) it really wears out my fingers to have to re-enter manually all the data when the program executes REWIND.

Instead I tend to read once into a geometrically growing buffer. This means that there is overhead associated with reallocation and copying of the buffer when it gets full, but that overhead needs to be kept in perspective when it is a serial activity with reading a file that may be coming from a spinning hard disk or across a network. An example attached.

!*******************************************************************************
!!
!> Read an arbitrary number of real numbers terminated by a sentinel value 
!! from a sequential formatted file.

PROGRAM ReadSomeReals
  
  IMPLICIT NONE
  
  !-----------------------------------------------------------------------------
  ! Program global constants.
  
  !> Name of the test file.
  CHARACTER(*), PARAMETER :: filename = 'ReadSomeReals.txt'
  
  !> Format specification for reading a single record.  Only the width is 
  !! particularly significant.
  CHARACTER(*), PARAMETER :: fmt = "(F10.0)"
  
  ! Sentinel value that terminates the block of data.
  REAL, PARAMETER :: sentinel = -1.0
  
  !*****************************************************************************
  
  CALL execute
  
CONTAINS
  
  !*****************************************************************************
  !!
  !> Run the test.
  
  SUBROUTINE execute
    
    INTEGER :: iostat         ! IOSTAT result.
    INTEGER :: unit           ! Logical unit number.
    
    ! The array that results from the read.
    REAL, ALLOCATABLE :: array(:)
    
    !*****************************************************************************
    
    OPEN(NEWUNIT=unit, FILE=filename, STATUS='OLD', ACTION='READ')
    
    CALL read_some_data(unit, array, iostat)
    IF (iostat /= 0) STOP 'It didn''t work.'
    
    PRINT "('Got ',I0,' numbers that total ',F0.1)", SIZE(array), SUM(array)
    
    CLOSE (unit)
    
  END SUBROUTINE execute
  
  
  !*****************************************************************************
  !!
  !> Read sentinel terminated data.
  
  SUBROUTINE read_some_data(unit, array, iostat)
    
    !---------------------------------------------------------------------------
    ! Local constants.
    
    ! Initial size of the buffer.  Ideally a goodly proportion of data to 
    ! be read would fit in this size, but it doesn't matter if data sizes that 
    ! exceed this are encountered.
    INTEGER, PARAMETER :: initial_size = 10
    
    !---------------------------------------------------------------------------
    ! Arguments
    
    !> Number of the unit to read from.
    INTEGER, INTENT(IN) :: unit
    
    !> The array of data.
    REAL, INTENT(OUT), ALLOCATABLE :: array(:)
    
    !> IOSTAT code - non-zero on error.
    INTEGER, INTENT(OUT) :: iostat
    
    !---------------------------------------------------------------------------
    ! Other locals.
    
    INTEGER :: i              ! Data element index.
    REAL :: d                 ! Temporary value.
    
    ! Temporary used to grow @a array.
    REAL, ALLOCATABLE :: tmp(:)
    
    !***************************************************************************
    
    i = 0
    ALLOCATE(array(initial_size))
    
    ! Loop until we run out of data or file or smoke.
    DO
      ! Read a value from a record.
      READ (unit, fmt, IOSTAT=iostat) d
      
      ! Is end of file an error or does it just indicate end of data?
      IF (IS_IOSTAT_END(iostat)) THEN
        RETURN                ! Choose this for "it is an error".
        ! EXIT                ! Choose this for "it is end of data".
      END IF
      
      ! Bail on error.
      IF (iostat /= 0) RETURN
      
      ! Have we hit the end?
      IF (d == sentinel) EXIT
      
      ! Do we have room tp store the value?
      i = i + 1
      IF (i > SIZE(array)) THEN
        ! Our array is not big enough.  Grow it.
        ALLOCATE(tmp(SIZE(array) * 2))
        tmp(:SIZE(array)) = array
        CALL MOVE_ALLOC(tmp, array)
      END IF
      ! Store the value.
      array(i) = d
    END DO
    
    ! Chop the result array back to the right size.  A zero size 
    ! array is possible and is handled ok.
    array = array(:i)
    
  END SUBROUTINE read_some_data
  
END PROGRAM ReadSomeReals

Others advocate using a linked list of data blocks to cut down on the number of times that data has to be copied around in memory. I tried to put an example of this together that also used length type parameters for extra spicy sauce, but it blew ifort's little mind.

John_Campbell · ‎11-23-2014

>>> "but different blocks have different structures: different no. of values or different mix of reals and integers. "

You should devise some header record for each block, which indicates the type of data format to follow. It would be good if you could have a data structure, such that each block could be read independently, much like most free format finite element data sets.

The block could start with a keyword, say $JOINTS or $SILLY_DATA and end with a blank line.

There is a lot to be said for simple text file data structures, as they can be reviewed or edited with notepad or similar, especially when "users" are generating the data and want to correct it. I would have only a few values per line. Only 1 value per line costs little in file size and allows for easy import into excel if there is a problem with the data and needs to be checked independently. Those "users" make lots of mistakes so a simple layout is best for checking and support. This could save a lot of time in the future.

John

dboggs · ‎11-24-2014

John, all of your points are valid and well advised, but they already apply to my situation. Indeed, this is the reason my file was set up like it is. The only point I was making about the "different structures" is that the file cannot be opened as random access (at least as I understand it).

JVanB · ‎11-24-2014

I don't get why you are ignoring the method the standard provides of reading the files as formatted stream. Given that the file wasn't created in a format that contained pointers to its internal structure in the first place, it's going to be stored as a stream of bytes on disk and your program will have to read through it at least to find the EOR marks (ACHAR(13)//ACHAR(10) on Windows) anyway. Then reading the first byte of each record costs you nothing, and as noted in Quote #5 as you discover the critical points in the file you can use the INQUIRE statement with the POS= inquiry specifier to record the file pointers, and then a READ statement with the POS= io control spec to seek to the file pointer recorded earlier. This does the same thing as FSEEK does, but in a file opened in FORMATTED form.

mecej4 · ‎11-24-2014

#13 wrote:
There is no contradiction in processing a sequential access file in a "pseudo-random-access" manor. I simply want to move the current position around the file, up or down, by a given number of records, and to inquire what the current position is.

Sorry, "sequential" means only one sequence: (i) records 1,2,3,...,n, but you can stop at some j < n. If you want any other sequences, you have to do dummy reads, rewind and backspace as needed to reach those records. You cannot do 1,3,5,7 or 1,1, 1, 2, 3, 5, 8, 13, 21, 34 or other such sequences without wasteful read-forwards or backspace-s.

How about the following approach? Find out the largest record size in your file, and create a direct access file with that record size, and copy all the records from the original file, padded up to the maximal record size, to the new file. Then, do your processing on the new file.