Re: Do Concurrent offload getting incorrect results - Page 2

caplanr · ‎12-05-2023

Hi,

I have a code that is using "do concurrent" for offload to Intel GPUs.

The following code yields an incorrect result with IFX on an Intel GPU (but works fine on the CPU and on NVIDA GPUs with nvfortran):

do concurrent (i=1:nr)
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo

However, if I modify the code to make the outer-most loop sequential, the code does yield the correct result:

do i=1,nr
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo

It seems the compiler is not liking having the reduction loop within a DC loop (or maybe just having DC loops within DC loops?)

The NVIDIA compiler handles this by making the outer look across blocks and the two inner loops across threads:

7367, Generating implicit private(fn2_fn1,fs2_fs1)
Generating NVIDIA GPU code
7367, Loop parallelized across CUDA thread blocks ! blockidx%x
7370, Loop parallelized across CUDA threads(128) ! threadidx%x
Generating reduction(+:fn2_fn1,fs2_fs1)
7378, Loop parallelized across CUDA threads(128) ! threadidx%x

I wanted to bring this to your attention - for now, I may make the outer loop sequential as "nr" is often small

-- Ron

caplanr · ‎01-19-2024

Hi,

Thanks for the info!

If both "shared" and "local_init" worked, than does that mean the default behavior was to use "local"?

I do not use any locality specifiers on any of my DC loops and they all seem to work on CPUs, GPUs, on both NV and IFX.

This case of having two nested DC loops was the first case where I see this problem, and the code as-is works on NV.

Is this a matter of what each compiler decides to do with each variable when no locality specifiers are listed?

Or, is this a "bug" in the IFX compiler?

The fact that it works with other compilers and with a single DC loop in IFX seems to imply the latter?

- Ron

Henry_G_Intel · ‎01-19-2024

I just noticed a possible flaw in my reasoning. If ntm and npm have to be shared or local_init, why doesn't nt also have to be shared or local_init? Isn't nt uninitialized in the first inner DC loop of the nested DC block?

Barbara_P_Intel · ‎02-02-2024

After similar discussions interpreting the Fortran standard around DO CONCURRENT with Intel's Fortran compiler team, it was determined that the shared clause is not required.

A fix is in the works and should be available in the compiler release in mid-2024.

NOTE to the folks "down under": I deliberately did not say mid-summer. (wink)