Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29285 Discussions

Advice please on a better way to read an enormous text file.

michael_green
Beginner
1,078 Views
Hi All,

I have been processing pairs of enormous text files, reading values as reals, manipulating those values, then outputting results to a third enormous text file. The data looks like this, with header info on the first 6 lines, then serious data to follow:

ncols 12095

nrows 17716

xllcorner 114.97464999692

yllcorner -35.064354662821

cellsize 0.000226

NODATA_value -9999

-9999 -9999 -9999 -9999 -9999 -9999 27.532 -26.49 -9999 -9999 10.6 -9999 .... etc for 12095 columns and 17716 rows.

I open the file with the following statement:
open(1,file=grid1,status='old',form='formatted',recordtype='stream_lf',recl=100000,iostat=ios,err=1000)

and after trivial reads of the header information read each record with:

read(1,'(A)',iostat=ios)line !Where line is character(100000)

Then I pick my way along the line looking for the space delimiters (these are not regularly placed) and read the values into an array of reals. The output method is an approximate inverse of the above.

It works, but it's very slow for files of this size. Is there a better way?

With many thanks in advance.

Mike

0 Kudos
9 Replies
anthonyrichards
New Contributor III
1,078 Views
It appears your priority is to trade fixed format for more compact data, saving on extra blank spaces? Why?

Import your records into Excel, use a macro to convert text to cells using delected delimiter, then do your sums by calling a Fortran DLL, then export your data to a file with tab or comma delimiters?
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,078 Views

Have you considered multi-threading the scan for blanks and conversion from text to real?

Jim

0 Kudos
anthonyrichards
New Contributor III
1,078 Views
If you prepare the data set with comma delimiters in NAMELIST style, then a NAMELIST read will read the data in OK , for example

real(4) mynumbers(24)
namelist /myrecord/ mynumbers

open(1,file="datafile.txt",form="formatted",status="unknown")
read(1,NML=myrecord)

will read the following file OK and convert apparent integers to real(4)

&myrecord
mynumbers=
-9999,-9999,-9999,-9999,-9999,-9999,27.532,-26.49,-9999,-9999,10.6,-9999,
-9999,-9999,-9999,-9999,-9999,-9999,27.532,-26.49,-9999,-9999,10.6,-9999
&end

NAMELIST will accept 6*-9999 to represent 6 consecutive -9999 values, so you can save even more storage space if you write your output data in this form ready for NAMELIST input, if that is one of your requirements (however, NAMELIST output will impose a fixed length for each output value, so it will not produce similar compact data).

0 Kudos
bmchenry
New Contributor II
1,078 Views
i might suggest lookin at revising how the output is generated.
what genereates a 12095 col output?
why not simply output and at the end of each 'record' have a special character/etc for 'end of record?
Then input does not require the long read/breakdown?

If modifying the input streamis not possible, then i'd suggest looking atmodifying HOW you 'pick along the line looking for space delimiters'.


brian
0 Kudos
michael_green
Beginner
1,078 Views
Hi Guys,

Thanks for the replies. I should have been a little clearer about the source data - it comes from a third party package (ArcGIS) so I am unable to change it.

I would like to know a little more about Jim Dempsey's suggestion on multi-threading - how could I apply that to this problem?

Many thanks
Mike
0 Kudos
John4
Valued Contributor I
1,078 Views

Since it's a text file, you could simply use the list-directed feature:

open(1,file=grid1,status='old',form='formatted',recordtype='stream_lf',recl=100000,iostat=ios,err=1000)

...

!read the first six lines of the file here

...

allocate (some_data(ncols, nrows))

read (1, *, iostat = ios) some_data

if (ios /= 0) ...

Try it, if only to see how fast it is compared to what you're using now.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,078 Views
Unttested non-optimized sketch

[bash]Simple method (first attempt)

type myType
  integer :: iCellCount
  real, allocatable :: cellData(:)
end type myType

type(myType), allocatable :: threadData(:)

iRow = 0
iMaxThreads = omp_get_mas_threads()
allocate(threadData(iMaxThreads)) ! add error test
iMaxThreadCells = (ncols / iMaxThreads) * 2 ! larger than worst case/thread
do i=1,iMaxThreads
   allocate(threadData(i)%cellData(iMaxThreadCells) ! error test
end do

! file read loop
while(.true.)
  read(1,'(A)',iostat=ios)line      !Where line is character(100000)
  if(ios) exit
  iRow = iRow + 1
  if(iRow > nrows)  PrintErrorAndAbort()
  iLastChar = useFastWayToFindLastCharOfLine(line) ! LENTRIM(line)?
!$omp parallel private(iThread)
  iThread = omp_get_thread_num() + 1 ! use 1-based thread number
  threadData(iThread)%iCellCount = 0
  if(iThread == 1) then
    ! special case for first number on line
    threadData(iThread)%iCellCount = 1
    read(threadData(iThread)%cellData(1), '(F)') line
  endif
!$omp do
  do i=1,iLastChar
    if(line(i) == ' ') then
      threadData(iThread)%iCellCount = threadData(iThread)%iCellCount + 1
      if(threadData(iThread)%iCellCount > iMaxThreadCells) PrintErrorAndAbort()
      read(threadData(iThread)%cellData(threadData(iThread)%iCellCount), '(F)') line(i+1:)
    endif
  end do ! end of parallel do
  iFill = 1
  do i=,iThread-1
    iFill = iFill + threadData(i)%iCellCount
  end do
  do i=1,threadData(iThread)%iCellCount
     bigArray(iFill, iRow) = threadData(iThread)%cellData(i)
  end do
!$omp end parallel
end do
[/bash]

Jim
0 Kudos
michael_green
Beginner
1,078 Views
Thanks Jim, this is brilliant stuff - I have never seen this sort of thing before and I'm going to learn heaps.

I have come across a problem immediately though, I've got ...

use omp_lib
.
.
i = omp_get_max_threads()

This compiles but won't link - "unresolved external symbol". What do I need to do?

Many thanks
Mike
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,078 Views
You must enable OpenMP as a compiler option.

In VS Solution pane

Right-Click on your project
| Properties
| Fortran
| Language
| Process OpenMP directives
| (select) Generate Parallel Code

Or from command line add /Qopenmp
Then on link line add the appropriate OpenMP library
VS automatically adjust the link line

The sketch code has typographical errors. The intent is to provide you with an overview of a simple parallel process.

After you get this working, you can decide if you want to spend additional time on improving this section of your code. Efforts on parallization of other code in your application might be abetter choice. The code sketch I provided does not overlap the reading of the line with the conversion from text to REAL. An improvement can be attained with overlapping of I/O with conversion.

Jim
0 Kudos
Reply