- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a code that is using "do concurrent" for offload to Intel GPUs.
The following code yields an incorrect result with IFX on an Intel GPU (but works fine on the CPU and on NVIDA GPUs with nvfortran):
do concurrent (i=1:nr)
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo
However, if I modify the code to make the outer-most loop sequential, the code does yield the correct result:
do i=1,nr
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo
It seems the compiler is not liking having the reduction loop within a DC loop (or maybe just having DC loops within DC loops?)
The NVIDIA compiler handles this by making the outer look across blocks and the two inner loops across threads:
7367, Generating implicit private(fn2_fn1,fs2_fs1)
Generating NVIDIA GPU code
7367, Loop parallelized across CUDA thread blocks ! blockidx%x
7370, Loop parallelized across CUDA threads(128) ! threadidx%x
Generating reduction(+:fn2_fn1,fs2_fs1)
7378, Loop parallelized across CUDA threads(128) ! threadidx%x
I wanted to bring this to your attention - for now, I may make the outer loop sequential as "nr" is often small
-- Ron
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just noticed a possible flaw in my reasoning. If ntm and npm have to be shared or local_init, why doesn't nt also have to be shared or local_init? Isn't nt uninitialized in the first inner DC loop of the nested DC block?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After similar discussions interpreting the Fortran standard around DO CONCURRENT with Intel's Fortran compiler team, it was determined that the shared clause is not required.
A fix is in the works and should be available in the compiler release in mid-2024.
NOTE to the folks "down under": I deliberately did not say mid-summer. (wink)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »