 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Hi,
I have a code that is using "do concurrent" for offload to Intel GPUs.
The following code yields an incorrect result with IFX on an Intel GPU (but works fine on the CPU and on NVIDA GPUs with nvfortran):
do concurrent (i=1:nr)
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i)  x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm1,k,i)  x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo
However, if I modify the code to make the outermost loop sequential, the code does yield the correct result:
do i=1,nr
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i)  x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm1,k,i)  x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo
It seems the compiler is not liking having the reduction loop within a DC loop (or maybe just having DC loops within DC loops?)
The NVIDIA compiler handles this by making the outer look across blocks and the two inner loops across threads:
7367, Generating implicit private(fn2_fn1,fs2_fs1)
Generating NVIDIA GPU code
7367, Loop parallelized across CUDA thread blocks ! blockidx%x
7370, Loop parallelized across CUDA threads(128) ! threadidx%x
Generating reduction(+:fn2_fn1,fs2_fs1)
7378, Loop parallelized across CUDA threads(128) ! threadidx%x
I wanted to bring this to your attention  for now, I may make the outer loop sequential as "nr" is often small
 Ron
Link Copied
 « Previous

 1
 2
 Next »
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Hi,
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
I just noticed a possible flaw in my reasoning. If ntm and npm have to be shared or local_init, why doesn't nt also have to be shared or local_init? Isn't nt uninitialized in the first inner DC loop of the nested DC block?
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
After similar discussions interpreting the Fortran standard around DO CONCURRENT with Intel's Fortran compiler team, it was determined that the shared clause is not required.
A fix is in the works and should be available in the compiler release in mid2024.
NOTE to the folks "down under": I deliberately did not say midsummer. (wink)
 Subscribe to RSS Feed
 Mark Topic as New
 Mark Topic as Read
 Float this Topic for Current User
 Bookmark
 Subscribe
 Printer Friendly Page
 « Previous

 1
 2
 Next »