Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29282 Discussions

Performance of nested do concurrent loops (array reduction)

caplanr
New Contributor II
377 Views

Hi,

 

In our recently published paper https://arxiv.org/pdf/2408.07843 I had been working with Henry Gabb and Shiquan Su on the performance of an array reduction double-nested do concurrent loop offloaded to Intel GPUs.

 

The code has the basic form of:

 

 

do concurrent(i=1:n)
  s = zero
  do concurrent(k=1:m) reduce(+:s)
    s = s + array(k,i)
  enddo
  do concurrent(k=1:m)
    array(k) = s
  enddo
enddo

 

 

It was found that this code was very slow on the GPU due to the compiler's choice of how to parallelize it.

 

It was found that a "hack" could be done using an OpenMP target directive to fix the performance issue (due to the OpenMP target back-end to the do concurrent implementation) which looks like this:

do concurrent(i=1:n)
  s = zero
!$omp parallel loop
  do concurrent(k=1:m) reduce(+:s)
    s = s + array(k,i)
  enddo
!$omp parallel loop
  do concurrent(k=1:m)
    array(k) = s
  enddo
enddo

 

This was done before the 2025 compiler release.

I recently have installed and tested the newest compiler, and this issue is still present.

 

Due to the recent personnel changes at Intel, I am not sure if anyone on the compiler team is aware/assigned about this issue, so I thought I would make a post here about it.

 

If this is already being worked on, do you have an ETA for a compiler release version that will not require the above OpenMP directives to achieve good GPU performance on this code?

 

Thanks!

 

 - Ron

 

P.S. The full code that this comes from can be found at github.com/predsci/hipft and the modified code is included in the 'waccpd24_intel_tmp" branch.

 

 

 

 

 

 

 

0 Kudos
1 Reply
Shiquan_Su
Moderator
300 Views

We are looking into this.


0 Kudos
Reply