Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29390 Discussions

Performance of nested do concurrent loops (array reduction)

caplanr
New Contributor II
481 Views

Hi,

 

In our recently published paper https://arxiv.org/pdf/2408.07843 I had been working with Henry Gabb and Shiquan Su on the performance of an array reduction double-nested do concurrent loop offloaded to Intel GPUs.

 

The code has the basic form of:

 

 

do concurrent(i=1:n)
  s = zero
  do concurrent(k=1:m) reduce(+:s)
    s = s + array(k,i)
  enddo
  do concurrent(k=1:m)
    array(k) = s
  enddo
enddo

 

 

It was found that this code was very slow on the GPU due to the compiler's choice of how to parallelize it.

 

It was found that a "hack" could be done using an OpenMP target directive to fix the performance issue (due to the OpenMP target back-end to the do concurrent implementation) which looks like this:

do concurrent(i=1:n)
  s = zero
!$omp parallel loop
  do concurrent(k=1:m) reduce(+:s)
    s = s + array(k,i)
  enddo
!$omp parallel loop
  do concurrent(k=1:m)
    array(k) = s
  enddo
enddo

 

This was done before the 2025 compiler release.

I recently have installed and tested the newest compiler, and this issue is still present.

 

Due to the recent personnel changes at Intel, I am not sure if anyone on the compiler team is aware/assigned about this issue, so I thought I would make a post here about it.

 

If this is already being worked on, do you have an ETA for a compiler release version that will not require the above OpenMP directives to achieve good GPU performance on this code?

 

Thanks!

 

 - Ron

 

P.S. The full code that this comes from can be found at github.com/predsci/hipft and the modified code is included in the 'waccpd24_intel_tmp" branch.

 

 

 

 

 

 

 

0 Kudos
1 Reply
Shiquan_Su
Moderator
404 Views

We are looking into this.


0 Kudos
Reply