- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In our recently published paper https://arxiv.org/pdf/2408.07843 I had been working with Henry Gabb and Shiquan Su on the performance of an array reduction double-nested do concurrent loop offloaded to Intel GPUs.
The code has the basic form of:
do concurrent(i=1:n)
s = zero
do concurrent(k=1:m) reduce(+:s)
s = s + array(k,i)
enddo
do concurrent(k=1:m)
array(k) = s
enddo
enddo
It was found that this code was very slow on the GPU due to the compiler's choice of how to parallelize it.
It was found that a "hack" could be done using an OpenMP target directive to fix the performance issue (due to the OpenMP target back-end to the do concurrent implementation) which looks like this:
do concurrent(i=1:n)
s = zero
!$omp parallel loop
do concurrent(k=1:m) reduce(+:s)
s = s + array(k,i)
enddo
!$omp parallel loop
do concurrent(k=1:m)
array(k) = s
enddo
enddo
This was done before the 2025 compiler release.
I recently have installed and tested the newest compiler, and this issue is still present.
Due to the recent personnel changes at Intel, I am not sure if anyone on the compiler team is aware/assigned about this issue, so I thought I would make a post here about it.
If this is already being worked on, do you have an ETA for a compiler release version that will not require the above OpenMP directives to achieve good GPU performance on this code?
Thanks!
- Ron
P.S. The full code that this comes from can be found at github.com/predsci/hipft and the modified code is included in the 'waccpd24_intel_tmp" branch.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Just writing that this less-than-optimal nested loop is still less-then-optimal with the new 2026 compiler.
I also found that the "hack" that is mentioned above now crashes the code (which is fine since it is not within OpenMP spec).
However, now that the "hack" is unavailable, the nested DC loop must be used as-is (the main branch) which yields a performance hit.
I will continue to test the benchmark with new versions of IFX as they come out to see if the speed of such loops has improved, and if so will post here.
However, since the performance hit has been reduced since this issue first appeared, I am closing this ticket.
- Ron
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are looking into this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Just an update on this issue.
Using the 2025.2 compiler, the performance between using the nested DC loops as-is and using the "hack" still exists.
Running the "examples/flux_transport_1rot_flowAa_diff_r8" example case from the github with the main branch gets:
Avg Min Max S. Dev
--- --- --- ------
Wall clock time: 285.861 285.861 285.861 0.000
--> Setup: 1.798 1.798 1.798 0.000
--> Update: 9.297 9.297 9.297 0.000
--> Data Assimilation: 0.000 0.000 0.000 0.000
--> Flux transport: 267.234 267.234 267.234 0.000
--> Advecton: 60.819 60.819 60.819 0.000
--> Diffusion 206.415 206.415 206.415 0.000
--> Source: 0.000 0.000 0.000 0.000
--> Analysis: 6.782 6.782 6.782 0.000
--> I/O: 0.762 0.762 0.762 0.000
--> MPI overhead: 0.050 0.050 0.050 0.000While running it with the "hack" branch (waccpd24_intel_tmp) yields:
Avg Min Max S. Dev
--- --- --- ------
Wall clock time: 215.105 215.105 215.105 0.000
--> Setup: 1.858 1.858 1.858 0.000
--> Update: 9.096 9.096 9.096 0.000
--> Data Assimilation: 0.000 0.000 0.000 0.000
--> Flux transport: 196.776 196.776 196.776 0.000
--> Advecton: 54.419 54.419 54.419 0.000
--> Diffusion 142.356 142.356 142.356 0.000
--> Source: 0.000 0.000 0.000 0.000
--> Analysis: 6.635 6.635 6.635 0.000
--> I/O: 0.771 0.771 0.771 0.000
--> MPI overhead: 0.050 0.050 0.050 0.000This is a 25% reduction in run time (1.33x speedup).
Are there plans to improve the heuristics of loops of this kind so that the original code (without the hack) can compile to the same performance as the "hacked" code?
Thanks!
- Ron Caplan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Just writing that this less-than-optimal nested loop is still less-then-optimal with the new 2026 compiler.
I also found that the "hack" that is mentioned above now crashes the code (which is fine since it is not within OpenMP spec).
However, now that the "hack" is unavailable, the nested DC loop must be used as-is (the main branch) which yields a performance hit.
I will continue to test the benchmark with new versions of IFX as they come out to see if the speed of such loops has improved, and if so will post here.
However, since the performance hit has been reduced since this issue first appeared, I am closing this ticket.
- Ron
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page