Performance of nested do concurrent loops (array reduction)

caplanr · ‎12-16-2024

Hi,

In our recently published paper https://arxiv.org/pdf/2408.07843 I had been working with Henry Gabb and Shiquan Su on the performance of an array reduction double-nested do concurrent loop offloaded to Intel GPUs.

The code has the basic form of:

do concurrent(i=1:n)
  s = zero
  do concurrent(k=1:m) reduce(+:s)
    s = s + array(k,i)
  enddo
  do concurrent(k=1:m)
    array(k) = s
  enddo
enddo

It was found that this code was very slow on the GPU due to the compiler's choice of how to parallelize it.

It was found that a "hack" could be done using an OpenMP target directive to fix the performance issue (due to the OpenMP target back-end to the do concurrent implementation) which looks like this:

do concurrent(i=1:n)
  s = zero
!$omp parallel loop
  do concurrent(k=1:m) reduce(+:s)
    s = s + array(k,i)
  enddo
!$omp parallel loop
  do concurrent(k=1:m)
    array(k) = s
  enddo
enddo

This was done before the 2025 compiler release.

I recently have installed and tested the newest compiler, and this issue is still present.

Due to the recent personnel changes at Intel, I am not sure if anyone on the compiler team is aware/assigned about this issue, so I thought I would make a post here about it.

If this is already being worked on, do you have an ETA for a compiler release version that will not require the above OpenMP directives to achieve good GPU performance on this code?

Thanks!

- Ron

P.S. The full code that this comes from can be found at github.com/predsci/hipft and the modified code is included in the 'waccpd24_intel_tmp" branch.

Shiquan_Su · ‎12-19-2024

We are looking into this.