My code needs fairly large chunks of memory for some large arrays (GB size). The calculations performed on these arrays are extremely quick, but I am surprised to see that most of the time spent in the code is actually due to Windows very slowly granting the memory requested. I can see in the task manager the memory amount increasing by chunks of about 15MB, iteratively. It's all the more puzzling that I have plenty of free memory on that 32GB machine.
Is there a parameter that controls this behavior somewhere?
Interesting - I have not observed that sort of behavior before. I suspect that what you're actually seeing is the memory being added to your working set as you access the range. Just allocating takes almost no time at all, but really what you're doing is "reserving" the address space. Not much else really happens until you touch the pages. You could try playing with the Heap Commit Size in the Linker > System properties. I don't think I'd set it to "GB size" initially.
Thanks Lionel - this is probably what happens. I am allocating the memory first; then reading a large file from the disk. What I see increasing is the "Memory (Private Working Set)" value in the task manager. I suppose it is a bit misleading at first glance, if the memory is already committed and available.
Here is what you can do to speed things up a tad for the first time you access the virtual memory of your process. Note, this does not mean access an address formerly acquired, touched, and returned to the heap. This it the very first access of that page of memory (this requires the overhead to acquire page in page file, map page to RAM, and possibly wipe the RAM on each new page).
COUNT = 0
!$OMP PARALLEL SECTIONS
DO I=1,YourBufferSize,PageSizeInUnits / sizeof(YourBuffer(1))
IF(ISNAN(YourBuffer(I))) COUNT = COUNT+1
IF(COUNT < 0) PRINT *,"Shouldn't happen"
!$OMP END PARALLEL SECTIONS
Note, you are not interested in the count of NaN's since you will be reading memory _prior_ to the data being read from the file. What you are doing here is "touching" that memory (hopefully) in advance of the read trying to access the same memory.
Page sizes can be 4KB or 4MB (Large), though with PAE you can also have 2MB. If your processor support IA-32e paging support 1GB.
Note, the O/S may determine your "first touch" is a read, and may decide to return a NULL or other nonsense without mapping the page (until write). And it would be somewhat dangerous to assume your current "first touch" writes would outpace the file read. In the case the O/S thwarts you, then run the "first touch" pass writing every 4KB byte/word/dword/whatever, then perform the file read. While this is not happening concurrently, it may permit the file read to occur in larger gulps.
A simple program that times the 3 main stages of array allocation, initialisation and usage could be helpful to demonstrate what is happening.
real*8 :: elapse_sec, t0,t1,t2,t3 integer*4 :: n = 100000000 integer*4, allocatable :: k(:) ! t0 = elapse_sec () ! allocate ( k(n) ) t1 = elapse_sec () ! k = 1 t2 = elapse_sec () ! call read_buffer (k,n) t3 = elapse_sec () ! write (*,*) t1-t0, t2-t1, t3-t2 end real*8 function elapse_sec () integer*8 tick, rate call system_clock (tick, rate) elapse_sec = dble(tick)/dble(rate) end function elapse_sec subroutine read_buffer (k,n) integer*4 n, k(n), i do i = 1,n k(i) = mod(i,4)+1 k(i) = exp ( log(dble(k(i))) ) end do end subroutine read_buffer
A minor re-write of your example code:
! sim_read_buffer.f90 ! ! FUNCTIONS: ! sim_read_buffer - Entry point of console application. ! !**************************************************************************** ! ! PROGRAM: sim_read_buffer ! ! PURPOSE: Entry point for the console application. ! !**************************************************************************** program sim_read_buffer use omp_lib use kernel32 implicit none type(T_SYSTEM_INFO) :: SystemInfo real*8 :: elapse_sec, t0,t1,t2,t3,t4,t5 integer*4 :: n = 500000000 ! 500M elements integer :: i, iFrom, iTo, nThreads, iPageSize integer*4, allocatable :: w(:),k_noTouch(:),k_touch(:) integer, volatile :: iTouched write(*,*) "Initializing OpenMP thread pool" ! dummy parallel region used to establish OpenMP thread team ! and keeps this overhead out of the equation !$omp parallel nThreads = omp_get_num_threads() !$omp end parallel write(*,*) "Allocating arrays" allocate ( w(n),k_noTouch(n),k_touch(n)) write(*,*) "Array size in bytes", sizeof(w) write(*,*) "Creating temp file" open(10,status='scratch',form='binary',access='stream',action='readwrite',dispose='delete') w(n) = 1 write(10) w write(*,*) "Temp file created" ! rewind(10) write(*,*) "Untouched array read test" t0 = elapse_sec () read(10) k_noTouch t1 = elapse_sec () write(*,*) 'noTouch read time', t1-t0 rewind(10) write(*,*) "Touched array read test (re-read of noTouch buffer)" t0 = elapse_sec () read(10) k_noTouch t1 = elapse_sec () write(*,*) 'noTouch after touch read time', t1-t0 rewind(10) call GetSystemInfo(SystemInfo) write(*,*) "Parallel sections, touch ahead test, page size=", SystemInfo%dwPageSize t0 = elapse_sec () iTouched = 0 !$omp parallel sections do i=1,n,SystemInfo%dwPageSize/sizeof(w(1)) ! assuming 4KB page size k_touch(i) = 0 iTouched = i end do iTouched = n !$omp section iFrom = 1 do iTo = iTouched if(iTo .gt. iFrom) then read(10) k_touch(iFrom:iTo) if(iTo == n) exit iFrom = iTo + 1 endif end do !$omp end parallel sections t1 = elapse_sec () write(*,*) 'With parallel touch ahead read time', t1-t0 close(10) end program sim_read_buffer real*8 function elapse_sec () integer*8 tick, rate call system_clock (tick, rate) elapse_sec = dble(tick)/dble(rate) end function elapse_sec
Initializing OpenMP thread pool Allocating arrays Array size in bytes 2000000000 Creating temp file Temp file created Untouched array read test noTouch read time 0.990000009536743 Touched array read test (re-read of noTouch buffer) noTouch after touch read time 0.637999773025513 Parallel sections, touch ahead test, page size= 4096 With parallel touch ahead read time 0.815000057220459 Press any key to continue . . .
Note, the modified program uses 3 buffers for the test. I set n to 500M due to my system having only 16GB. 3*500M*4 = ~6GB and I did not want to get into a swapping situation. Using n=1G, though 12GB wouldn't fit without swapping (other stuff running on system).