- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello.
I have a program consisting of two parts. In the first part, I would like to do a parallel computation with a reduction operation on a large array. To do that, I have to increase the thread stack size. Then, in the second part, I need a lot of memory for other purposes again. However, a large part of the memory is still used by the thread stack and cannot be used. How can I free this memory?
I tried freeing it by setting the stack size back to a lower value, but that does not seem to work. Here is a little example, which crashes on my 256 GB machine with ifort v18
program stacksize_test use omp_lib implicit none real * 8, allocatable :: bigparallelarray (:), bigarray (:) integer * 8 :: n, error, i, nthreads n = 20000000000 ! 20 bil. elements => 160 GB allocate (bigarray (n), stat = error) if (error .ne. 0) print *, 'could not allocate bigarray' bigarray = 1.d0 deallocate (bigarray) print *, 'array could be allocated and filled' nthreads = 10 call KMP_SET_STACKSIZE_S (20000000000) ! set stacksize to 20 GB allocate (bigparallelarray (n / 10), stat = error) if (error .ne. 0) print *, 'could not allocate bigparallelarray' !$omp parallel do default (shared) num_threads (nthreads) private (i) & !$omp& reduction (+: bigparallelarray) do i = 1, 10 bigparallelarray = 2.d0 enddo !$omp end parallel do deallocate (bigparallelarray) print *, 'array could be filled in parallel' call KMP_SET_STACKSIZE_S (4000000) ! set stacksize to 4 MB allocate (bigarray (n), stat = error) if (error .ne. 0) print *, 'could not allocate bigarray' bigarray = 1.d0 deallocate (bigarray) print *, 'array could be allocated and filled' end program stacksize_test
compiled using
ifort -m64 -fpp -fopenmp -O3 -g -traceback -C -c stacksize_test.f90 ifort stacksize_test.o -m64 -fpp -fopenmp -O3 -g -traceback -C -o stacksize_test.x
Thanks for your help in advance
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest you redesign your code such that it does not require reduction
!$omp parallel default (shared) num_threads (nthreads) private (i,j) ! note, all threads executing do i do i = 1, 10 !$omp do do j=1,size(bigparallelarray) bigparallelarray(j) = 2.d0 enddo !$omp end do end do !$omp end parallel
Note, the above is kept in line with the structure of your original code (i.e. it does not make sense to reduce an array witn static values)
The above assumes that bigparallelarray(j) is manipulated in a different manner.
Note 2, do not be overly enthusiastic about incorporating reductions on arrays when it is not necessary (it is a sloppy coding technique to induce unnecessary copies of arrays and subsequent post region array operations).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim, thanks for your quick reply. You suggest to circumvent the problem by avoiding reduction and therefore an increase of the stack size.
If this is possible, I think it is a good suggestion. However, my actual problem is of course more complicated than reducing a static array.
Specifically it could look something like this (pseudocode)
!$omp parallel do default (shared) private (i, idx, m) reduction (+: array) & !$omp& num_threads (nthreads) do i = 1, n m = some_function_of_many_variables1 allocate (idx (m)) do j = 1, m idx (m) = some_function_of_many_variables2 enddo array (idx) = array (idx) + some_function_of_many_variables3 deallocate (idx) enddo !$omp end parallel do
I am not sure how to deal with that without reduction. I have tried implementing a "manual" reduction, i.e. allocating nthreads arrays and summing them up afterwards, but that does not get me the desired speedup from parallel.
Tim, I have implemented this with the reduction operation on very large arrays (definitely larger than 500 MB) and it works very nicely. I havent done rigorous scaling tests, but for 48 processors I get a speed improvement that is definitely much larger than a factor 2 compared to sequential.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is it correct to assume that your sketch code in #5, line 7 is in error, and that it should have read idx(j) =...?
I assume that the array idx is an integer array that is used to arbitrarily index array array. IOW the specific elements of array(...) are not known until after the array idx is populated by the do j= loop within the do i= loop.
If you can pre-determine the worst case maximum value of m, then consider allocating idx once in the parallel region (prior to do i=), then on line 9 use: array(idx(1:m)) = array(idx(1:m))..., (and place deallocate of idx outside of do I loop)
The statement on line 9, is some_function_of_many_variables3 a scalar? If so, consider placing the result into a temporary (private), then using an !$omp critical to update array. This does mean you will be adding n criticals as opposed to nthreads criticals for the array reduction. ***
*** note, that by reducing the time spent in critical section (by use of temp), the probability of multiple threads attempting to enter the critical section at overlapping times is greatly reduced. IOW the overhead of the critical section may be relatively free. in particular, if the execution times of any of the some_function... is non-deterministic then each thread will arrive at the critical section at distributed time intervals (and thus not interfere with other threads competing with the critical section).
From your sketch, reduction does seem the way to go if critical section is too large of overhead. You will have to run test to confirm what the actual critical section time is found to be.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I forgot to mention...
if all some_function_... computation time IS deterministic...
...then on first iteration all threads will approximately reach the critical section at the same time and thus interfere...
...but then on subsequent iterations each thread will (may be) skewed by the first iteration delays of passing the critical section.
IOW exhibit lesser conflict of use of critical section.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim,
thanks for your reply.
Is it correct to assume that your sketch code in #5, line 7 is in error, and that it should have read idx(j) =...?
Yes, that is correct.
IOW the specific elements of array(...) are not known until after the array idx is populated by the do j= loop within the do i= loop
That is also correct.
You recommend some allocation outside of the loop. Is this more of a general advice not to use allocation and deallocation within a parallel loop? Or do you think that this could be the cause of the specific problem? I am asking because implementing this could be a lot of work in the real problem (allocation is part of a function which is called in many places in the code etc...)
I have followed your advice now and implemented a critical section instead of a reduction. Something like this:
!$omp parallel do default (shared) private (i, j, idx, m, array_update) & !$omp& num_threads (nthreads) do i = 1, n m = some_function_of_many_variables1 (i, ...) do j = 1, m idx = some_function_of_many_variables2 (i, j, ...) array_update = some_function_of_many_variables3 (i, j, ...) !$omp critical array (idx) = array (idx) + array_update !$omp end critical enddo enddo !$omp end parallel do
I have also tried with an atomic section instead:
!$omp parallel do default (shared) private (i, j, idx, m, array_update) & !$omp& num_threads (nthreads) do i = 1, n m = some_function_of_many_variables1 (i, ...) do j = 1, m idx = some_function_of_many_variables2 (i, j, ...) array_update = some_function_of_many_variables3 (i, j, ...) !$omp atomic array (idx) = array (idx) + array_update enddo enddo !$omp end parallel do
I hope there are no mistakes in my pseudocode this time.
Then I have compared the speedup of the different versions with the sequential computation for a small problem in my actual code for nthreads = 48. I get
Reduction clause: speedup factor 17
"Manual" reduction: speedup factor 6
atomic clause: speedup factor 3
critical clause: speedup factor << 1 (still running at the time of submission of this post and the sequential walltime was only a minute)
So we see that both the critical and atomic clause are of not much use here unfortunately. This is even though the operation within that clause is just an addition. Maybe it is because the the array is very large and it is hard to access value idx?
So I guess for now I will just go back for manual reduction, and have it run three times slower. This seems to avoid the memory "leak" (don't know if thats the right term here) problems from the thread stack at least which made me post this in the first place.
But if anyone has suggestions on how to efficiently circumvent the reduction clause or fix the memory issues, those are very welcome of course.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>You recommend some allocation outside of the loop. Is this more of a general advice not to use allocation and deallocation within a parallel loop?
General optimization advice (reduce number of unnecessary allocate/deallocate.
What I meant was
! *** idx and array_update are .NOT. allocated at this point ! *** use firstprivate to copy in to parallel region the not allocated array descriptors ! *** remove "do" !$omp parallel default (shared) private (i, j, m, array_update) firstprivate(idx) & !$omp& num_threads (nthreads) ! all/each thread allocates private array allocate(idx(size(array)) ! once, *** verify that this is large enough for your purposes !$omp do do i = 1, n m = some_function_of_many_variables1 (i, ...) do j = 1, m idx(1:m) = some_function_of_many_variables2 (i, j, ...) array_update = some_function_of_many_variables3 (i, j, ...) !$omp critical array (idx(1:m)) = array (idx(1:m)) + array_update !$omp end critical enddo enddo !$end omp do deallocate(idx) !$omp end parallel
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page