Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28895 Discussions

Do Concurrent offload getting incorrect results

caplanr
New Contributor I
3,545 Views

Hi,

 

I have a code that is using "do concurrent" for offload to Intel GPUs.

 

The following code yields an incorrect result with IFX on an Intel GPU (but works fine on the CPU and on NVIDA GPUs with nvfortran):

 

do concurrent (i=1:nr)
  fn2_fn1 = zero
  fs2_fs1 = zero
  do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
    fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
                       + diffusion_coef(2 ,k,i)) &
                     * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
    fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
                       + diffusion_coef(nt ,k,i)) &
                     * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
  enddo
  do concurrent (k=1:npm)
    y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
    y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
  enddo
enddo

However, if I modify the code to make the outer-most loop sequential, the code does yield the correct result:


do i=1,nr
  fn2_fn1 = zero
  fs2_fs1 = zero
  do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
    fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
                       + diffusion_coef(2 ,k,i)) &
                     * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
    fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
                       + diffusion_coef(nt ,k,i)) &
                     * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
  enddo
  do concurrent (k=1:npm)
    y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
    y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
  enddo
enddo

 

It seems the compiler is not liking having the reduction loop within a DC loop (or maybe just having DC loops within DC loops?)

 

The NVIDIA compiler handles this by making the outer look across blocks and the two inner loops across threads:

7367, Generating implicit private(fn2_fn1,fs2_fs1)
Generating NVIDIA GPU code
7367, Loop parallelized across CUDA thread blocks ! blockidx%x
7370, Loop parallelized across CUDA threads(128) ! threadidx%x
Generating reduction(+:fn2_fn1,fs2_fs1)
7378, Loop parallelized across CUDA threads(128) ! threadidx%x

 

I wanted to bring this to your attention - for now, I may make the outer loop sequential as "nr" is often small

 -- Ron

 

 

23 Replies
caplanr
New Contributor I
417 Views

Hi,

 
Thanks for the info!
 
If both "shared" and "local_init" worked, than does that mean the default behavior was to use "local"?
 
I do not use any locality specifiers on any of my DC loops and they all seem to work on CPUs, GPUs, on both NV and IFX.
 
This case of having two nested DC loops was the first case where I see this problem, and the code as-is works on NV.
 
Is this a matter of what each compiler decides to do with each variable when no locality specifiers are listed?
Or, is this a "bug" in the IFX compiler?
 
The fact that it works with other compilers and with a single DC loop in IFX seems to imply the latter?
 
 
 - Ron
0 Kudos
Henry_G_Intel
Employee
399 Views

I just noticed a possible flaw in my reasoning. If ntm and npm have to be shared or local_init, why doesn't nt also have to be shared or local_init? Isn't nt uninitialized in the first inner DC loop of the nested DC block?

0 Kudos
Barbara_P_Intel
Employee
330 Views

After similar discussions interpreting the Fortran standard around DO CONCURRENT with Intel's Fortran compiler team, it was determined that the shared clause is not required. 

A fix is in the works and should be available in the compiler release in mid-2024. 

NOTE to the folks "down under": I deliberately did not say mid-summer. (wink)

0 Kudos
Reply