Solved: OpenMP workshare bug with ifx 2025.0

Harald1 · ‎11-01-2024

Hello,

the following reduced code works with previous versions of ifort and ifx, but crashes with a division by 0 in the OpenMP workshare construct at the indicated assignment when run with more than 1 OpenMP thread:

! Crashes during runtime with ifx-2025.0 when compiled with:
! ifx ifx-omp-workshare.f90 -fopenmp -g -traceback -fpe0 -O -fpe0
program test
  implicit none
  integer, parameter    :: wp = kind (1.d0)
  real(wp), allocatable :: x(:,:), f(:,:), d2(:,:)
  integer               :: np, n, k, m
  np = 4225
  n  = 60
  allocate (x(np,n), f(np,n), d2(np,n))
  call random_number (f)
  x = f * 0.5_wp
  do k = 2, n
     x(:,k) = x(:,k) + k
  end do
  do m = 1, 2
     print *, "Outer iteration", m
     call spline_init_vec (x, f, d2, np, n)
  end do
contains
  subroutine spline_init_vec (x, y, y2, np, n)
    integer  ,intent(in)           :: np
    integer  ,intent(in)           :: n
    real(wp) ,intent(in)           :: x (np,n)
    real(wp) ,intent(in)           :: y (np,n)
    real(wp) ,intent(out)          :: y2(np,n)
    !-----------------------------------------
    integer  :: i, j, k
    real(wp) :: p(np), sig(np)
    real(wp), allocatable :: u(:,:)

    allocate (u(np,n-1))

    do j = 1, np
       y2(j,1) = 0._wp
       u (j,1) = 0._wp
    end do

!$omp parallel private(sig,p)
    do i = 2,n-1
!$omp master
write(0,*) "### i=", i
!$omp end master

!$omp workshare
       sig(:) = (x(:,i)-x(:,i-1))/(x(:,i+1)-x(:,i-1))
       p(:)   = sig(:)*y2(:,i-1)+2._wp
       y2(:,i)=(sig(:)-1._wp)/p(:)
       u (:,i)=(6._wp*((y(:,i+1)-y(:,i))/(x(:,i+1)-x(:,i))    &
                      -(y(:,i)-y(:,i-1))/(x(:,i)-x(:,i-1))) / &
                       (x(:,i+1)-x(:,i-1))                    &
                - sig(:)*u(:,i-1))/p(:)    ! <- bogus division by 0 here
!$omp end workshare

    end do
!$omp end parallel

  end subroutine
end

I get (using the indicated compile options above):

% OMP_NUM_THREADS=2 ./a.out 
 Outer iteration           1
 ### i=           2
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source             
libpthread-2.31.s  0000152A15F01910  Unknown               Unknown  Unknown
a.out              000000000040582B  test_IP_spline_in          52  ifx-omp-workshare.f90
libiomp5.so        0000152A15B45603  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        0000152A15ACA633  Unknown               Unknown  Unknown
libiomp5.so        0000152A15AC3B28  __kmp_fork_call       Unknown  Unknown
libiomp5.so        0000152A15A84B88  __kmpc_fork_call      Unknown  Unknown
a.out              00000000004050A1  spline_init_vec            39  ifx-omp-workshare.f90
a.out              0000000000404DF9  test                       18  ifx-omp-workshare.f90
a.out              000000000040476D  Unknown               Unknown  Unknown
libc-2.31.so       0000152A1583E24D  __libc_start_main     Unknown  Unknown
a.out              000000000040469A  Unknown               Unknown  Unknown
Aborted (core dumped)

This happens on OpenSUSE 15.5.

Is there anything I might miss?

Thanks,

Harald

TobiasK · ‎11-07-2024

@Harald1

No, the workshare construct was implemented long ago. It may just be luck that it did work before.

see https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf page 246:

8 How units of work are assigned to the threads that execute a workshare region is unspecified.

9 If an array expression in the block references the value, association status, or allocation status of

10 private variables, the value of the expression is undefined, unless the same value would be

11 computed by every thread.

Since you privatize the arrays sig and p only parts of those arrays contain useful numbers for each thread. Now if the runtime decides to distribute the next access to sig and p in a different way among threads, you are likely getting uninitialized data.

If you do not privatize the arrays, those arrays are filled completely and it will work. But again, for performance reasons I would rewrite it with the loop and use sig and p as scalars and let the compiler vectorize the inner loop (!$omp do simd)

Best

Tobias

View solution in original post

TobiasK · ‎11-05-2024

@Harald1

are you sure you want to privatize arrays sig and p?

Harald1 · ‎11-05-2024

Are you saying this is wrong?

sig and p are just work arrays with their effective lifetime being the omp parallel region.

Writing out that region by plain loops would need corresponding private temporaries.

TobiasK · ‎11-06-2024

!$omp parallel
    do i = 2,n-1
!$omp master
write(0,*) "### i=", i
!$omp end master
!$omp do
 do j=1,np
       sig(j) = (x(j,i)-x(j,i-1))/(x(j,i+1)-x(j,i-1))
       p(j)   = sig(j)*y2(j,i-1)+2._wp
       y2(j,i)=(sig(j)-1._wp)/p(j)
       u (j,i)=(6._wp*((y(j,i+1)-y(j,i))/(x(j,i+1)-x(j,i))    &
                      -(y(j,i)-y(j,i-1))/(x(j,i)-x(j,i-1))) / &
                       (x(j,i+1)-x(j,i-1))                    &
                - sig(j)*u(j,i-1))/p(j)    ! <- bogus division by 0 here


    end do
!$omp end parallel

I would rewrite the loop without a workshare construct. I also don't see why sig and p need to be private, actually, I don't see why sig and p need to be arrays in the first place. Scalars would need to be private, though.

Harald1 · ‎11-06-2024

You evaded my previous question: is the source code wrong, or is the problem elsewhere?

My understanding of workshare is that one can keep concise array expressions and does not need to do the compiler's work.

Playing some more, I found that replacing the declaration of u (explicit allocation on the heap) by an automatic array (!) makes the code work here, i.e.

replacing:

    real(wp), allocatable :: u(:,:)
    allocate (u(np,n-1))

by:

    real(wp) :: u(np,n-1)

Of course this is not wanted for the full application, as the array may be too large for the stack.

This seems to suggest that sig and p are actually not the cause of the problem.

So maybe the reason that this code did not have a problem with previous versions might be Intel simply did not implement workshare before 2025.0 (just guessing)?

TobiasK · ‎11-07-2024

@Harald1

No, the workshare construct was implemented long ago. It may just be luck that it did work before.

see https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf page 246:

8 How units of work are assigned to the threads that execute a workshare region is unspecified.

9 If an array expression in the block references the value, association status, or allocation status of

10 private variables, the value of the expression is undefined, unless the same value would be

11 computed by every thread.

Since you privatize the arrays sig and p only parts of those arrays contain useful numbers for each thread. Now if the runtime decides to distribute the next access to sig and p in a different way among threads, you are likely getting uninitialized data.

If you do not privatize the arrays, those arrays are filled completely and it will work. But again, for performance reasons I would rewrite it with the loop and use sig and p as scalars and let the compiler vectorize the inner loop (!$omp do simd)

Best

Tobias

Harald1 · ‎11-07-2024

Ah, I see. I was under the impression that when using schedule(static) this would also apply to workshare regions, with same chunks for same loop lengths.

(I know one cannot set the schedule in the directive here, but omp_get_schedule confirms that it is set to static.)

So if the compiler "sees" the declarations of the first index of all arrays, it just works accidentally with "private" (maybe due to loop fusion), otherwise it will treat the assignment to u separately and somehow use a different choice.

I am going to change the code suitably.

Thanks for the insight!

Harald