- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The following code crashes even when using single thread. Compilation is done with -heap_array options. The problem is caused by v3 being private (if v3 is shared no crash). As code indicates the allocation of v3 is successful, the threaded part of the program crashes. This happens even when only one thread is running. Increasing OMP_STACKSIZE does not solve the problem, it seems that the global stacksize is exceeding the limit. Using unlimted stacksize does not crash, but end user does not want to use that option. Why is the program using global stacksize in the openmp loop? The same program without OpenMP directives does not crash.
What is the solution here? How can we force the private array to stay in heap? Or make threads to use thread stacksize and not global stacksize.
The PGI fortran compiler does not crash by the way, they have implemented a solution that works. What is ifort solution for this problem? Don't say unlimited stacksize. I don't know how big the arrays can get in advance.
program my_prog
implicit none
integer, parameter:: M = 1800000
call test_ifort_openmp(M);
end program my_prog
subroutine test_ifort_openmp(M)
implicit none
integer, intent(in) :: M
COMPLEX*8 :: U(M)
integer :: i
COMPLEX*8, allocatable :: v3(:)
!$ INTEGER omp_get_num_threads
!$ EXTERNAL omp_get_num_threads
allocate (v3(M));
U = 0.0;
v3 = 0.0;
print *, 'allocation successful'
!$ CALL omp_set_num_threads(1)
!$omp parallel do ordered private(v3) &
!$omp shared (U,M) &
!$omp default(none) &
!$omp schedule(dynamic)
DO i=1,10 !---------------------main loop
!$ print *, 'num threads', omp_get_num_threads()
v3 = exp(cmplx(0.0_4,1.0_4));
!$omp ordered
U = U + v3
!$omp end ordered
END DO !---------------------end main loop
!$omp end parallel do
print *,'done ordered summation'
end subroutine test_ifort_openmp
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
program my_prog implicit none integer, parameter:: M = 1800000 call test_ifort_openmp(M); end program my_prog subroutine test_ifort_openmp(M) implicit none integer, intent(in) :: M COMPLEX*8 :: U(M) integer :: i COMPLEX*8, allocatable :: v3(:) !$ INTEGER omp_get_num_threads !$ EXTERNAL omp_get_num_threads !**** do later allocate (v3(M)); U = 0.0; !**** do later v3 = 0.0; print *, 'allocation successful' !$ CALL omp_set_num_threads(1) ! **** start parallel region WITHOUT d ! **** change private(v3) to firstprivate(v3)... ! **** to copy in the unallocated array descriptor !$omp parallel firstprivate(v3) & !$omp shared (U,M) & !$omp default(none) ! **** do the allocation inside the parallel region allocate (v3(M)) v3 = 0.0 ! *** now the DO without parallel !$omp do ordered & !$omp schedule(dynamic) DO i=1,10 !---------------------main loop !$ print *, 'num threads', omp_get_num_threads() v3 = exp(cmplx(0.0_4,1.0_4)); !$omp ordered U = U + v3 !$omp end ordered END DO !---------------------end main loop ! *** omit the parallel !$omp end do deallocate(v3) !*** now exit parallel !$omp end parallel print *,'done ordered summation' end subroutine test_ifort_openmp
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear,
Regarding the stack (under Linux at least), the main thread uses the main stack (for which the size can be modified using the command ulimit) while the slave threads have their own thread stack. The size of these latter ones can be modified with the env variable OMP_STACKSIZE.
A solution to avoid the use of stacks is to manually allocate an array with the same number of rows as the array in the private statement and with the same number of columns as the number of threads. This array is then used in the loop instead of the private array.
Yours sincerely,
Jeremie
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yenikaya, bayram wrote:
.. COMPLEX*8 :: U(M) ..
Do you really mean for an automatic array of such a big size on the stack, as opposed to a scalar, for a reduction result of your ordered summation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is not the real program, just a testcase to demonstrate the crash.
In real program the reduction is in an array, but each thread needs to use some large array as private for computation.
Can Intel compiler put the “large private arrays” used by each thread on the heap instead of stack? It seems that it is putting private arrays in stack, and that requires a large stacksize. Our end user do not like that.
I know I can allocate as much memory as needed for each thread, but that requires lots of code change. I have a code that works with PGI without any problems and crashes with Intel fortran. It looks like PGI fortran is putting private arrays on the heap, but intel is putting them on the stack. Is there a way to tell intel compiler to put them on the heap? -heap_arrays does not do it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yenikaya, bayram wrote:.. intel is putting them on the stack. Is there a way to tell intel compiler to put them on the heap? -heap_arrays does not do it.
Contact Intel support? https://supporttickets.intel.com/?lang=en-US
They might be able to explain why this automatic array is not allocated on the heap with -heap-arrays option, or look into any issues/bugs with the option.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
program my_prog implicit none integer, parameter:: M = 1800000 call test_ifort_openmp(M); end program my_prog subroutine test_ifort_openmp(M) implicit none integer, intent(in) :: M COMPLEX*8 :: U(M) integer :: i COMPLEX*8, allocatable :: v3(:) !$ INTEGER omp_get_num_threads !$ EXTERNAL omp_get_num_threads !**** do later allocate (v3(M)); U = 0.0; !**** do later v3 = 0.0; print *, 'allocation successful' !$ CALL omp_set_num_threads(1) ! **** start parallel region WITHOUT d ! **** change private(v3) to firstprivate(v3)... ! **** to copy in the unallocated array descriptor !$omp parallel firstprivate(v3) & !$omp shared (U,M) & !$omp default(none) ! **** do the allocation inside the parallel region allocate (v3(M)) v3 = 0.0 ! *** now the DO without parallel !$omp do ordered & !$omp schedule(dynamic) DO i=1,10 !---------------------main loop !$ print *, 'num threads', omp_get_num_threads() v3 = exp(cmplx(0.0_4,1.0_4)); !$omp ordered U = U + v3 !$omp end ordered END DO !---------------------end main loop ! *** omit the parallel !$omp end do deallocate(v3) !*** now exit parallel !$omp end parallel print *,'done ordered summation' end subroutine test_ifort_openmp
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or:
program my_prog implicit none integer, parameter:: M = 1800000 call test_ifort_openmp(M); end program my_prog subroutine test_ifort_openmp(M) implicit none integer, intent(in) :: M COMPLEX*8 :: U(M) integer :: i !$ INTEGER omp_get_num_threads !$ EXTERNAL omp_get_num_threads U = 0.0; print *, 'allocation successful' !$ CALL omp_set_num_threads(1) !$omp parallel call doWork() !$omp end parallel print *,'done ordered summation' contains subroutine doWork() implicit none integer :: i COMPLEX*8, allocatable :: v3(:) allocate (v3(M)) v3 = 0.0 !$omp do ordered schedule(dynamic) DO i=1,10 !---------------------main loop !$ print *, 'num threads', omp_get_num_threads() v3 = exp(cmplx(0.0_4,1.0_4)); !$omp ordered U = U + v3 !$omp end ordered END DO !---------------------end main loop !$omp end do deallocate(v3) ! optional end subroutine doWork end subroutine test_ifort_openmp
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim, great idea. How can I do it if v3 was an input array? Defined like this.
COMPLEX*8, intent(inout) :: v3(M)
or automatic array defined like this
COMPLEX*8 :: v3(M)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Thanks Jim, great idea. How can I do it if v3 was an input array?
This depends on how you access V3. IIF (If and only if) V3, and U for that matter are accesses as such above (each element in V3 updates corresponding elements in U and each element in V3 is independent of any other elements in V3...
program my_prog implicit none integer, parameter:: M = 1800000 call test_ifort_openmp(M); end program my_prog subroutine test_ifort_openmp(M) use omp_lib !**** add this implicit none integer, intent(in) :: M COMPLEX*8 :: U(M) integer :: i COMPLEX*8, allocatable :: v3(:) allocate (v3(M)); ! *** size(U) == size(V3) .and. M large enough to slice U = 0.0; v3 = 0.0; print *, 'allocation successful' !$ CALL omp_set_num_threads(1) ! *** use more threads?? ! **** start parallel region !$omp parallel shared(v3) & !$omp shared (U,M) & !$omp private(i,j) & !$omp default(none) DO i=1,10 !---------------------main loop ! Note i is private, this loop contains omp do, and omp do has ! implicit barrier at end. Therefor each threads i will be in lock-step !$omp do & !$omp schedule(static) DO j=1,M v3(J) = exp(cmplx(0.0_4,1.0_4)); U(J) = U(J) + v3(J) END DO ! ------- end slice of V3 and U !$omp end do END DO !---------------------end main loop !$omp end parallel print *,'done slice summation' deallocate(v3) ! optional with newer versions of Fortran end subroutine test_ifort_openmp
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IOW your initial way was to do batches in parallel with ordered accumulation (parallel ordered pipeline-esk)
The #9 method is to do batches in order and parallel slice each batch. Also note in the #9 case the main loop can reside within the parallel region thus reducing the number of parallel regions entry/exit. Your actual program may need additional programming considerations
END DO ! ------- end slice of V3 and U !$omp end do !$omp master ... write something to output !$omp end master END DO !---------------------end main loop !$omp end parallel
The selection of method greatly depends on size of M (and if internal algorithm index varying dependencies within V3, U, ...)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a more realistic case. v3 is passed in because most times this code is ran single threaded, we want to reuse the memory instead of allocate deallocate. v3 must be private when threaded. The following code crashes due to v3 being put on stack. subroutine test_ifort_openmp4(M,k,H,U,v3) implicit none integer, intent(in) :: M,k COMPLEX*8, intent(out) :: U(M) COMPLEX*8, intent(in) :: H(M,k) COMPLEX*8, intent(out) :: v3(M) integer :: i U = 0.0; v3 = 0.0; !$omp parallel do ordered private(v3) & !$omp shared (U,M,H,k) & !$omp default(none) & !$omp schedule(dynamic) DO i=1,k !---------------------main loop call fft(H(:,i), v3); !$omp ordered U = U + v3*conjg(v3); !$omp end ordered END DO !---------------------end main loop !$omp end parallel do end subroutine test_ifort_openmp4
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>v3 must be private when threaded.
conflicts with
>>we want to reuse the memory instead of allocate deallocate.
>>U = U + v3*conjg(v3);
Depending on the version of the compiler .AND. optimization used/not used, the above may (or may not) be performed without creating a temporary array (on stack when -heap-arrays not used). This may aggregate your crash situation.
There is a runtime check option that will indicate if an array temporary is created.
If it is created, then you can manually bypass the array temporary by expressing that as a DO loop using V3(index)%RE and %IM...
... however this should not be necessary as the compiler optimization (now-a-days) should be able to avoid the array temporary and vectorize this code.
Consider this:
subroutine test_ifort_openmp4(M,k,H,U,v3) implicit none integer, intent(in) :: M,k COMPLEX*8, intent(out) :: U(M) COMPLEX*8, intent(in) :: H(M,k) COMPLEX*8, intent(out) :: v3(M) !$OMP THREADPRIVATE(privateV3) COMPLEX*8, ALLOCATABLE, SAVE :: privateV3(:) ! each thread will have different copy integer :: i ! *** do NOT use: IF(.not.allocated(privateV3) .or. size(privateV3) .lt. M) ! *** Fortran is NOT C and will produce results of both sides of .or. ! Now allocate only when necessary if(.not.allocated(privateV3)) then ALLOCATE(privateV3(M)) else if(size(privateV3) .lt. M) then DEALLOCATE(privateV3) ALLOCATE(privateV3(M)) endif endif U = 0.0 ! remove the ; privatev3(1:M) = 0.0D0 ! privateV3 may be larger than M !$omp parallel do ordered & !$omp shared(v3, privatev3) & !$omp shared (U,M,H,k) & !$omp default(none) & !$omp schedule(dynamic) DO i=1,k !---------------------main loop call fft(H(:,i), privatev3); !$omp ordered U = U + privatev3(1:M)*conjg(privatev3(1:M)); !$omp end ordered if(i .eq. k) V3 = privatev3(1:M) END DO !---------------------end main loop !$omp end parallel do ! *** Note, privateV3 hangs around for next call ! *** will show up as memory leak with valgrind (which you can ignore) end subroutine test_ifort_openmp4
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim. This gives enough ideas to develop a workaround.
Memory reuse for v3 is for non-openmp applications.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>This may aggregate your crash situation
giggling.....
In some sense, in light of the program being parallel, aggregate may be a better choice of words than aggravate.
BTW the effort is not a workaround. Rather it is an effort to properly, and efficiently, parallelize this procedure.
Good luck in your efforts.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. BTW, PGI compiler does this properly. "Put array on the heap" means put array on the heap. When heap array is enabled by PGI compiler, all local arrays and all arrays private to threads are put on the heap.
Intel compiler does not put local arrays private to each thread on the heap, when you use -heap_arrays. If they did that properly, we would not need to try so hard to find a solution.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page