Advice please on a better way to read an enormous text file.

michael_green · ‎05-04-2010

Hi All,

I have been processing pairs of enormous text files, reading values as reals, manipulating those values, then outputting results to a third enormous text file. The data looks like this, with header info on the first 6 lines, then serious data to follow:

ncols 12095

nrows 17716

xllcorner 114.97464999692

yllcorner -35.064354662821

cellsize 0.000226

NODATA_value -9999

-9999 -9999 -9999 -9999 -9999 -9999 27.532 -26.49 -9999 -9999 10.6 -9999 .... etc for 12095 columns and 17716 rows.

I open the file with the following statement:
open(1,file=grid1,status='old',form='formatted',recordtype='stream_lf',recl=100000,iostat=ios,err=1000)

and after trivial reads of the header information read each record with:

read(1,'(A)',iostat=ios)line !Where line is character(100000)

Then I pick my way along the line looking for the space delimiters (these are not regularly placed) and read the values into an array of reals. The output method is an approximate inverse of the above.

It works, but it's very slow for files of this size. Is there a better way?

With many thanks in advance.

Mike

anthonyrichards · ‎05-04-2010

It appears your priority is to trade fixed format for more compact data, saving on extra blank spaces? Why?

Import your records into Excel, use a macro to convert text to cells using delected delimiter, then do your sums by calling a Fortran DLL, then export your data to a file with tab or comma delimiters?

jimdempseyatthecove · ‎05-04-2010

Have you considered multi-threading the scan for blanks and conversion from text to real?

Jim

anthonyrichards · ‎05-04-2010

If you prepare the data set with comma delimiters in NAMELIST style, then a NAMELIST read will read the data in OK , for example

real(4) mynumbers(24)
namelist /myrecord/ mynumbers

open(1,file="datafile.txt",form="formatted",status="unknown")
read(1,NML=myrecord)

will read the following file OK and convert apparent integers to real(4)

&myrecord
mynumbers=
-9999,-9999,-9999,-9999,-9999,-9999,27.532,-26.49,-9999,-9999,10.6,-9999,
-9999,-9999,-9999,-9999,-9999,-9999,27.532,-26.49,-9999,-9999,10.6,-9999
&end

NAMELIST will accept 6*-9999 to represent 6 consecutive -9999 values, so you can save even more storage space if you write your output data in this form ready for NAMELIST input, if that is one of your requirements (however, NAMELIST output will impose a fixed length for each output value, so it will not produce similar compact data).

bmchenry · ‎05-04-2010

i might suggest lookin at revising how the output is generated.
what genereates a 12095 col output?
why not simply output and at the end of each 'record' have a special character/etc for 'end of record?
Then input does not require the long read/breakdown?

If modifying the input streamis not possible, then i'd suggest looking atmodifying HOW you 'pick along the line looking for space delimiters'.

brian

michael_green · ‎05-04-2010

Hi Guys,

Thanks for the replies. I should have been a little clearer about the source data - it comes from a third party package (ArcGIS) so I am unable to change it.

I would like to know a little more about Jim Dempsey's suggestion on multi-threading - how could I apply that to this problem?

Many thanks
Mike

John4 · ‎05-05-2010

Since it's a text file, you could simply use the list-directed feature:

open(1,file=grid1,status='old',form='formatted',recordtype='stream_lf',recl=100000,iostat=ios,err=1000)

...

!read the first six lines of the file here

...

allocate (some_data(ncols, nrows))

read (1, *, iostat = ios) some_data

if (ios /= 0) ...

Try it, if only to see how fast it is compared to what you're using now.

jimdempseyatthecove · ‎05-05-2010

Unttested non-optimized sketch

[bash]Simple method (first attempt)

type myType
  integer :: iCellCount
  real, allocatable :: cellData(:)
end type myType

type(myType), allocatable :: threadData(:)

iRow = 0
iMaxThreads = omp_get_mas_threads()
allocate(threadData(iMaxThreads)) ! add error test
iMaxThreadCells = (ncols / iMaxThreads) * 2 ! larger than worst case/thread
do i=1,iMaxThreads
   allocate(threadData(i)%cellData(iMaxThreadCells) ! error test
end do

! file read loop
while(.true.)
  read(1,'(A)',iostat=ios)line      !Where line is character(100000)
  if(ios) exit
  iRow = iRow + 1
  if(iRow > nrows)  PrintErrorAndAbort()
  iLastChar = useFastWayToFindLastCharOfLine(line) ! LENTRIM(line)?
!$omp parallel private(iThread)
  iThread = omp_get_thread_num() + 1 ! use 1-based thread number
  threadData(iThread)%iCellCount = 0
  if(iThread == 1) then
    ! special case for first number on line
    threadData(iThread)%iCellCount = 1
    read(threadData(iThread)%cellData(1), '(F)') line
  endif
!$omp do
  do i=1,iLastChar
    if(line(i) == ' ') then
      threadData(iThread)%iCellCount = threadData(iThread)%iCellCount + 1
      if(threadData(iThread)%iCellCount > iMaxThreadCells) PrintErrorAndAbort()
      read(threadData(iThread)%cellData(threadData(iThread)%iCellCount), '(F)') line(i+1:)
    endif
  end do ! end of parallel do
  iFill = 1
  do i=,iThread-1
    iFill = iFill + threadData(i)%iCellCount
  end do
  do i=1,threadData(iThread)%iCellCount
     bigArray(iFill, iRow) = threadData(iThread)%cellData(i)
  end do
!$omp end parallel
end do
[/bash]

Jim

michael_green · ‎05-06-2010

Thanks Jim, this is brilliant stuff - I have never seen this sort of thing before and I'm going to learn heaps.

I have come across a problem immediately though, I've got ...

use omp_lib
.
.
i = omp_get_max_threads()

This compiles but won't link - "unresolved external symbol". What do I need to do?

Many thanks
Mike

jimdempseyatthecove · ‎05-06-2010

You must enable OpenMP as a compiler option.

In VS Solution pane

Right-Click on your project
| Properties
| Fortran
| Language
| Process OpenMP directives
| (select) Generate Parallel Code

Or from command line add /Qopenmp
Then on link line add the appropriate OpenMP library
VS automatically adjust the link line

The sketch code has typographical errors. The intent is to provide you with an overview of a simple parallel process.

After you get this working, you can decide if you want to spend additional time on improving this section of your code. Efforts on parallization of other code in your application might be abetter choice. The code sketch I provided does not overlap the reading of the line with the conversion from text to REAL. An improvement can be attained with overlapping of I/O with conversion.

Jim