Thanks Jim for the nice

OP1 · ‎03-08-2016

My code needs fairly large chunks of memory for some large arrays (GB size). The calculations performed on these arrays are extremely quick, but I am surprised to see that most of the time spent in the code is actually due to Windows very slowly granting the memory requested. I can see in the task manager the memory amount increasing by chunks of about 15MB, iteratively. It's all the more puzzling that I have plenty of free memory on that 32GB machine.

Is there a parameter that controls this behavior somewhere?

Steven_L_Intel1 · ‎03-08-2016

Interesting - I have not observed that sort of behavior before. I suspect that what you're actually seeing is the memory being added to your working set as you access the range. Just allocating takes almost no time at all, but really what you're doing is "reserving" the address space. Not much else really happens until you touch the pages. You could try playing with the Heap Commit Size in the Linker > System properties. I don't think I'd set it to "GB size" initially.

OP1 · ‎03-08-2016

Thanks Lionel - this is probably what happens. I am allocating the memory first; then reading a large file from the disk. What I see increasing is the "Memory (Private Working Set)" value in the task manager. I suppose it is a bit misleading at first glance, if the memory is already committed and available.

jimdempseyatthecove · ‎03-08-2016

OP,

Here is what you can do to speed things up a tad for the first time you access the virtual memory of your process. Note, this does not mean access an address formerly acquired, touched, and returned to the heap. This it the very first access of that page of memory (this requires the overhead to acquire page in page file, map page to RAM, and possibly wipe the RAM on each new page).

ALLOCATE(YourBuffer(YourBufferSize))
COUNT = 0
!$OMP PARALLEL SECTIONS
CALL ReadFile()
!$OMP SECTION
DO I=1,YourBufferSize,PageSizeInUnits / sizeof(YourBuffer(1))
IF(ISNAN(YourBuffer(I))) COUNT = COUNT+1
END DO
IF(COUNT < 0) PRINT *,"Shouldn't happen"
!$OMP END PARALLEL SECTIONS

Note, you are not interested in the count of NaN's since you will be reading memory _prior_ to the data being read from the file. What you are doing here is "touching" that memory (hopefully) in advance of the read trying to access the same memory.

Jim Dempsey

OP1 · ‎03-08-2016

Thanks Jim for the nice example. I will look into this as well.

Johannes_A_ · ‎03-09-2016

Jim,

I was having the similar observations like OP had. Before I can try myself some question: what number to use for PageSizeInUnits ?

Regards, Johannes

jimdempseyatthecove · ‎03-09-2016

On Windows you can use GetSystemInfo to obtain a SYSTEM_INFO structure (or GetNativeSystemInfo)

Page sizes can be 4KB or 4MB (Large), though with PAE you can also have 2MB. If your processor support IA-32e paging support 1GB.

Note, the O/S may determine your "first touch" is a read, and may decide to return a NULL or other nonsense without mapping the page (until write). And it would be somewhat dangerous to assume your current "first touch" writes would outpace the file read. In the case the O/S thwarts you, then run the "first touch" pass writing every 4KB byte/word/dword/whatever, then perform the file read. While this is not happening concurrently, it may permit the file read to occur in larger gulps.

Jim Dempsey

John_Campbell · ‎03-11-2016

A simple program that times the 3 main stages of array allocation, initialisation and usage could be helpful to demonstrate what is happening.

    real*8    :: elapse_sec, t0,t1,t2,t3
    integer*4 :: n = 100000000
    integer*4, allocatable :: k(:)
!
       t0 = elapse_sec ()
!
     allocate ( k(n) )
       t1 = elapse_sec ()
!
     k = 1
       t2 = elapse_sec ()
!
     call read_buffer (k,n)     
       t3 = elapse_sec ()
!
     write (*,*) t1-t0, t2-t1, t3-t2
  end

  real*8 function elapse_sec ()
    integer*8 tick, rate
    call system_clock (tick, rate)
    elapse_sec = dble(tick)/dble(rate)
  end function elapse_sec

  subroutine read_buffer (k,n)     
    integer*4 n, k(n), i
    do i = 1,n
      k(i) = mod(i,4)+1
      k(i) = exp ( log(dble(k(i))) )
    end do
  end subroutine read_buffer

jimdempseyatthecove · ‎03-12-2016

John,

A minor re-write of your example code:

!  sim_read_buffer.f90 
!
!  FUNCTIONS:
!  sim_read_buffer - Entry point of console application.
!

!****************************************************************************
!
!  PROGRAM: sim_read_buffer
!
!  PURPOSE:  Entry point for the console application.
!
!****************************************************************************

    program sim_read_buffer
    use omp_lib
    use kernel32
    
    implicit none
    type(T_SYSTEM_INFO) :: SystemInfo
    
    real*8    :: elapse_sec, t0,t1,t2,t3,t4,t5
    integer*4 :: n = 500000000 ! 500M elements
    integer :: i, iFrom, iTo, nThreads, iPageSize
    integer*4, allocatable :: w(:),k_noTouch(:),k_touch(:)
    integer, volatile :: iTouched
    
    write(*,*) "Initializing OpenMP thread pool"
    ! dummy parallel region used to establish OpenMP thread team
    ! and keeps this overhead out of the equation
    !$omp parallel
    nThreads = omp_get_num_threads()
    !$omp end parallel
    
    write(*,*) "Allocating arrays"
    allocate ( w(n),k_noTouch(n),k_touch(n))
    write(*,*) "Array size in bytes", sizeof(w)
    
    write(*,*) "Creating temp file"
    open(10,status='scratch',form='binary',access='stream',action='readwrite',dispose='delete')
    w(n) = 1
    write(10) w
    write(*,*) "Temp file created"
!
    rewind(10)
    write(*,*) "Untouched array read test"
    t0 = elapse_sec ()
    read(10) k_noTouch
    t1 = elapse_sec ()
    write(*,*) 'noTouch read time', t1-t0
    
    rewind(10)
    write(*,*) "Touched array read test (re-read of noTouch buffer)"
    t0 = elapse_sec ()
    read(10) k_noTouch
    t1 = elapse_sec ()
    write(*,*) 'noTouch after touch read time', t1-t0
    
    rewind(10)
    call GetSystemInfo(SystemInfo)
    write(*,*) "Parallel sections, touch ahead test, page size=", SystemInfo%dwPageSize
    t0 = elapse_sec ()
    iTouched = 0
    
    !$omp parallel sections
    do i=1,n,SystemInfo%dwPageSize/sizeof(w(1)) ! assuming 4KB page size
        k_touch(i) = 0
        iTouched = i
    end do
    iTouched = n
    !$omp section
    iFrom = 1
    do
        iTo = iTouched
        if(iTo .gt. iFrom) then
            read(10) k_touch(iFrom:iTo)
            if(iTo == n) exit
            iFrom = iTo + 1
        endif
    end do
    !$omp end parallel sections
    t1 = elapse_sec ()
    write(*,*) 'With parallel touch ahead read time', t1-t0
    close(10)  

end program sim_read_buffer

  real*8 function elapse_sec ()
    integer*8 tick, rate
    call system_clock (tick, rate)
    elapse_sec = dble(tick)/dble(rate)
  end function elapse_sec

Results:

 Initializing OpenMP thread pool
 Allocating arrays
 Array size in bytes            2000000000
 Creating temp file
 Temp file created
 Untouched array read test
 noTouch read time  0.990000009536743
 Touched array read test (re-read of noTouch buffer)
 noTouch after touch read time  0.637999773025513
 Parallel sections, touch ahead test, page size=        4096
 With parallel touch ahead read time  0.815000057220459
Press any key to continue . . .

Note, the modified program uses 3 buffers for the test. I set n to 500M due to my system having only 16GB. 3*500M*4 = ~6GB and I did not want to get into a swapping situation. Using n=1G, though 12GB wouldn't fit without swapping (other stuff running on system).

Jim Dempsey

Is there a way to speed up memory allocation?