Re: Performance of nested do concurrent loops (array reduction)

caplanr · ‎12-16-2024

Hi,

In our recently published paper https://arxiv.org/pdf/2408.07843 I had been working with Henry Gabb and Shiquan Su on the performance of an array reduction double-nested do concurrent loop offloaded to Intel GPUs.

The code has the basic form of:

do concurrent(i=1:n)
  s = zero
  do concurrent(k=1:m) reduce(+:s)
    s = s + array(k,i)
  enddo
  do concurrent(k=1:m)
    array(k) = s
  enddo
enddo

It was found that this code was very slow on the GPU due to the compiler's choice of how to parallelize it.

It was found that a "hack" could be done using an OpenMP target directive to fix the performance issue (due to the OpenMP target back-end to the do concurrent implementation) which looks like this:

do concurrent(i=1:n)
  s = zero
!$omp parallel loop
  do concurrent(k=1:m) reduce(+:s)
    s = s + array(k,i)
  enddo
!$omp parallel loop
  do concurrent(k=1:m)
    array(k) = s
  enddo
enddo

This was done before the 2025 compiler release.

I recently have installed and tested the newest compiler, and this issue is still present.

Due to the recent personnel changes at Intel, I am not sure if anyone on the compiler team is aware/assigned about this issue, so I thought I would make a post here about it.

If this is already being worked on, do you have an ETA for a compiler release version that will not require the above OpenMP directives to achieve good GPU performance on this code?

Thanks!

- Ron

P.S. The full code that this comes from can be found at github.com/predsci/hipft and the modified code is included in the 'waccpd24_intel_tmp" branch.

Shiquan_Su · ‎12-19-2024

We are looking into this.

caplanr · ‎01-09-2026

Hi,

Just an update on this issue.

Using the 2025.2 compiler, the performance between using the nested DC loops as-is and using the "hack" still exists.

Running the "examples/flux_transport_1rot_flowAa_diff_r8" example case from the github with the main branch gets:

                                  Avg         Min         Max      S. Dev
                                  ---         ---         ---      ------
 Wall clock time:             285.861     285.861     285.861       0.000
 --> Setup:                     1.798       1.798       1.798       0.000
 --> Update:                    9.297       9.297       9.297       0.000
 --> Data Assimilation:         0.000       0.000       0.000       0.000
 --> Flux transport:          267.234     267.234     267.234       0.000
     --> Advecton:             60.819      60.819      60.819       0.000
     --> Diffusion            206.415     206.415     206.415       0.000
     --> Source:                0.000       0.000       0.000       0.000
 --> Analysis:                  6.782       6.782       6.782       0.000
 --> I/O:                       0.762       0.762       0.762       0.000
 --> MPI overhead:              0.050       0.050       0.050       0.000

While running it with the "hack" branch (waccpd24_intel_tmp) yields:

                                  Avg         Min         Max      S. Dev
                                  ---         ---         ---      ------
 Wall clock time:             215.105     215.105     215.105       0.000
 --> Setup:                     1.858       1.858       1.858       0.000
 --> Update:                    9.096       9.096       9.096       0.000
 --> Data Assimilation:         0.000       0.000       0.000       0.000
 --> Flux transport:          196.776     196.776     196.776       0.000
     --> Advecton:             54.419      54.419      54.419       0.000
     --> Diffusion            142.356     142.356     142.356       0.000
     --> Source:                0.000       0.000       0.000       0.000
 --> Analysis:                  6.635       6.635       6.635       0.000
 --> I/O:                       0.771       0.771       0.771       0.000
 --> MPI overhead:              0.050       0.050       0.050       0.000

This is a 25% reduction in run time (1.33x speedup).

Are there plans to improve the heuristics of loops of this kind so that the original code (without the hack) can compile to the same performance as the "hacked" code?

Thanks!

- Ron Caplan