How can I free the memory used for thread stacks?

Robin_T_ · ‎12-13-2017

Hello.

I have a program consisting of two parts. In the first part, I would like to do a parallel computation with a reduction operation on a large array. To do that, I have to increase the thread stack size. Then, in the second part, I need a lot of memory for other purposes again. However, a large part of the memory is still used by the thread stack and cannot be used. How can I free this memory?

I tried freeing it by setting the stack size back to a lower value, but that does not seem to work. Here is a little example, which crashes on my 256 GB machine with ifort v18

program stacksize_test

	use omp_lib
	implicit none
	
	real * 8, allocatable :: bigparallelarray (:), bigarray (:)
	integer * 8 :: n, error, i, nthreads
	
	n = 20000000000 ! 20 bil. elements => 160 GB
	allocate (bigarray (n), stat = error)
	if (error .ne. 0) print *, 'could not allocate bigarray'
	bigarray = 1.d0
	deallocate (bigarray)		
	print *, 'array could be allocated and filled'
	
	nthreads = 10
	call KMP_SET_STACKSIZE_S (20000000000) ! set stacksize to 20 GB	
	allocate (bigparallelarray (n / 10), stat = error)
	if (error .ne. 0) print *, 'could not allocate bigparallelarray'
	
	!$omp parallel do default (shared) num_threads (nthreads) private (i) &
	!$omp& reduction (+: bigparallelarray)
		do i = 1, 10
			bigparallelarray = 2.d0
		enddo
	!$omp end parallel do
	
	deallocate (bigparallelarray)
	
	print *, 'array could be filled in parallel'
	
	call KMP_SET_STACKSIZE_S (4000000) ! set stacksize to 4 MB
	
	allocate (bigarray (n), stat = error)
	if (error .ne. 0) print *, 'could not allocate bigarray'
	bigarray = 1.d0
	deallocate (bigarray)		
	print *, 'array could be allocated and filled'
	
end program stacksize_test

compiled using

ifort -m64 -fpp -fopenmp -O3 -g -traceback -C -c stacksize_test.f90	
ifort stacksize_test.o -m64 -fpp -fopenmp -O3 -g -traceback -C -o stacksize_test.x

Thanks for your help in advance

jimdempseyatthecove · ‎12-13-2017

I suggest you redesign your code such that it does not require reduction

!$omp parallel default (shared) num_threads (nthreads) private (i,j)
! note, all threads executing do i
do i = 1, 10
  !$omp do
  do j=1,size(bigparallelarray)
    bigparallelarray(j) = 2.d0
  enddo
  !$omp end do
end do
!$omp end parallel

Note, the above is kept in line with the structure of your original code (i.e. it does not make sense to reduce an array witn static values)

The above assumes that bigparallelarray(j) is manipulated in a different manner.

Note 2, do not be overly enthusiastic about incorporating reductions on arrays when it is not necessary (it is a sloppy coding technique to induce unnecessary copies of arrays and subsequent post region array operations).

Jim Dempsey

TimP · ‎12-13-2017

I haven't seen an application use more than 50MB thread stack successfully. If successful, the thread stack space should be recovered automatically after leaving parallel region and expiring kmp_blocktime Given that "unlimited " stack size may be as small as 16GB and 30 threads is not excessive for many recent platforms, I would not count on omp_stacksize working even to 500MB even on 64 bit platform.

TimP · ‎12-13-2017

I agree with Jim about avoiding array reduction. In my fairly basic tests, scaling of array reduction drops off beyond 2 threads when alternatives scale at least to number of cores.

Robin_T_ · ‎12-13-2017

Dear Jim, thanks for your quick reply. You suggest to circumvent the problem by avoiding reduction and therefore an increase of the stack size.

If this is possible, I think it is a good suggestion. However, my actual problem is of course more complicated than reducing a static array.

Specifically it could look something like this (pseudocode)

!$omp parallel do default (shared) private (i, idx, m) reduction (+: array) &
!$omp& num_threads (nthreads)
	do i = 1, n
		m = some_function_of_many_variables1
		allocate (idx (m))
		do j = 1, m
			idx (m) = some_function_of_many_variables2
		enddo
		array (idx) = array (idx) + some_function_of_many_variables3
 		deallocate (idx)
	enddo
!$omp end parallel do

I am not sure how to deal with that without reduction. I have tried implementing a "manual" reduction, i.e. allocating nthreads arrays and summing them up afterwards, but that does not get me the desired speedup from parallel.

Tim, I have implemented this with the reduction operation on very large arrays (definitely larger than 500 MB) and it works very nicely. I havent done rigorous scaling tests, but for 48 processors I get a speed improvement that is definitely much larger than a factor 2 compared to sequential.

jimdempseyatthecove · ‎12-13-2017

Is it correct to assume that your sketch code in #5, line 7 is in error, and that it should have read idx(j) =...?

I assume that the array idx is an integer array that is used to arbitrarily index array array. IOW the specific elements of array(...) are not known until after the array idx is populated by the do j= loop within the do i= loop.

If you can pre-determine the worst case maximum value of m, then consider allocating idx once in the parallel region (prior to do i=), then on line 9 use: array(idx(1:m)) = array(idx(1:m))..., (and place deallocate of idx outside of do I loop)

The statement on line 9, is some_function_of_many_variables3 a scalar? If so, consider placing the result into a temporary (private), then using an !$omp critical to update array. This does mean you will be adding n criticals as opposed to nthreads criticals for the array reduction. ***

*** note, that by reducing the time spent in critical section (by use of temp), the probability of multiple threads attempting to enter the critical section at overlapping times is greatly reduced. IOW the overhead of the critical section may be relatively free. in particular, if the execution times of any of the some_function... is non-deterministic then each thread will arrive at the critical section at distributed time intervals (and thus not interfere with other threads competing with the critical section).

From your sketch, reduction does seem the way to go if critical section is too large of overhead. You will have to run test to confirm what the actual critical section time is found to be.

Jim Dempsey

jimdempseyatthecove · ‎12-13-2017

I forgot to mention...

if all some_function_... computation time IS deterministic...
...then on first iteration all threads will approximately reach the critical section at the same time and thus interfere...
...but then on subsequent iterations each thread will (may be) skewed by the first iteration delays of passing the critical section.

IOW exhibit lesser conflict of use of critical section.

Jim Dempsey

Robin_T_ · ‎12-14-2017

Dear Jim,

thanks for your reply.

Is it correct to assume that your sketch code in #5, line 7 is in error, and that it should have read idx(j) =...?

Yes, that is correct.

IOW the specific elements of array(...) are not known until after the array idx is populated by the do j= loop within the do i= loop

That is also correct.

You recommend some allocation outside of the loop. Is this more of a general advice not to use allocation and deallocation within a parallel loop? Or do you think that this could be the cause of the specific problem? I am asking because implementing this could be a lot of work in the real problem (allocation is part of a function which is called in many places in the code etc...)

I have followed your advice now and implemented a critical section instead of a reduction. Something like this:

!$omp parallel do default (shared) private (i, j, idx, m, array_update) &
!$omp& num_threads (nthreads)
	do i = 1, n
		m = some_function_of_many_variables1 (i, ...)
		do j = 1, m
			idx = some_function_of_many_variables2 (i, j, ...)
			array_update = some_function_of_many_variables3 (i, j, ...)
			!$omp critical
			array (idx) = array (idx) + array_update 
			!$omp end critical
		enddo
	enddo
!$omp end parallel do

I have also tried with an atomic section instead:

!$omp parallel do default (shared) private (i, j, idx, m, array_update) &
!$omp& num_threads (nthreads)
	do i = 1, n
		m = some_function_of_many_variables1 (i, ...)
		do j = 1, m
			idx = some_function_of_many_variables2 (i, j, ...)
			array_update = some_function_of_many_variables3 (i, j, ...)
			!$omp atomic
			array (idx) = array (idx) + array_update 
		enddo
	enddo
!$omp end parallel do

I hope there are no mistakes in my pseudocode this time.

Then I have compared the speedup of the different versions with the sequential computation for a small problem in my actual code for nthreads = 48. I get

Reduction clause: speedup factor 17

"Manual" reduction: speedup factor 6

atomic clause: speedup factor 3

critical clause: speedup factor << 1 (still running at the time of submission of this post and the sequential walltime was only a minute)

So we see that both the critical and atomic clause are of not much use here unfortunately. This is even though the operation within that clause is just an addition. Maybe it is because the the array is very large and it is hard to access value idx?

So I guess for now I will just go back for manual reduction, and have it run three times slower. This seems to avoid the memory "leak" (don't know if thats the right term here) problems from the thread stack at least which made me post this in the first place.

But if anyone has suggestions on how to efficiently circumvent the reduction clause or fix the memory issues, those are very welcome of course.

jimdempseyatthecove · ‎12-14-2017

>>You recommend some allocation outside of the loop. Is this more of a general advice not to use allocation and deallocation within a parallel loop?

General optimization advice (reduce number of unnecessary allocate/deallocate.

What I meant was

! *** idx and array_update are .NOT. allocated at this point
! *** use firstprivate to copy in to parallel region the not allocated array descriptors
! *** remove "do"
!$omp parallel default (shared) private (i, j, m, array_update) firstprivate(idx) &
!$omp& num_threads (nthreads)
  ! all/each thread allocates private array
  allocate(idx(size(array)) ! once, *** verify that this is large enough for your purposes
  !$omp do
  do i = 1, n
    m = some_function_of_many_variables1 (i, ...)
    do j = 1, m
      idx(1:m) = some_function_of_many_variables2 (i, j, ...)
      array_update = some_function_of_many_variables3 (i, j, ...)
      !$omp critical
      array (idx(1:m)) = array (idx(1:m)) + array_update 
      !$omp end critical
    enddo
  enddo
  !$end omp do
  deallocate(idx)
!$omp end parallel

Jim Dempsey