- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In our recently published paper https://arxiv.org/pdf/2408.07843 I had been working with Henry Gabb and Shiquan Su on the performance of an array reduction double-nested do concurrent loop offloaded to Intel GPUs.
The code has the basic form of:
do concurrent(i=1:n)
s = zero
do concurrent(k=1:m) reduce(+:s)
s = s + array(k,i)
enddo
do concurrent(k=1:m)
array(k) = s
enddo
enddo
It was found that this code was very slow on the GPU due to the compiler's choice of how to parallelize it.
It was found that a "hack" could be done using an OpenMP target directive to fix the performance issue (due to the OpenMP target back-end to the do concurrent implementation) which looks like this:
do concurrent(i=1:n)
s = zero
!$omp parallel loop
do concurrent(k=1:m) reduce(+:s)
s = s + array(k,i)
enddo
!$omp parallel loop
do concurrent(k=1:m)
array(k) = s
enddo
enddo
This was done before the 2025 compiler release.
I recently have installed and tested the newest compiler, and this issue is still present.
Due to the recent personnel changes at Intel, I am not sure if anyone on the compiler team is aware/assigned about this issue, so I thought I would make a post here about it.
If this is already being worked on, do you have an ETA for a compiler release version that will not require the above OpenMP directives to achieve good GPU performance on this code?
Thanks!
- Ron
P.S. The full code that this comes from can be found at github.com/predsci/hipft and the modified code is included in the 'waccpd24_intel_tmp" branch.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are looking into this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Just an update on this issue.
Using the 2025.2 compiler, the performance between using the nested DC loops as-is and using the "hack" still exists.
Running the "examples/flux_transport_1rot_flowAa_diff_r8" example case from the github with the main branch gets:
Avg Min Max S. Dev
--- --- --- ------
Wall clock time: 285.861 285.861 285.861 0.000
--> Setup: 1.798 1.798 1.798 0.000
--> Update: 9.297 9.297 9.297 0.000
--> Data Assimilation: 0.000 0.000 0.000 0.000
--> Flux transport: 267.234 267.234 267.234 0.000
--> Advecton: 60.819 60.819 60.819 0.000
--> Diffusion 206.415 206.415 206.415 0.000
--> Source: 0.000 0.000 0.000 0.000
--> Analysis: 6.782 6.782 6.782 0.000
--> I/O: 0.762 0.762 0.762 0.000
--> MPI overhead: 0.050 0.050 0.050 0.000While running it with the "hack" branch (waccpd24_intel_tmp) yields:
Avg Min Max S. Dev
--- --- --- ------
Wall clock time: 215.105 215.105 215.105 0.000
--> Setup: 1.858 1.858 1.858 0.000
--> Update: 9.096 9.096 9.096 0.000
--> Data Assimilation: 0.000 0.000 0.000 0.000
--> Flux transport: 196.776 196.776 196.776 0.000
--> Advecton: 54.419 54.419 54.419 0.000
--> Diffusion 142.356 142.356 142.356 0.000
--> Source: 0.000 0.000 0.000 0.000
--> Analysis: 6.635 6.635 6.635 0.000
--> I/O: 0.771 0.771 0.771 0.000
--> MPI overhead: 0.050 0.050 0.050 0.000This is a 25% reduction in run time (1.33x speedup).
Are there plans to improve the heuristics of loops of this kind so that the original code (without the hack) can compile to the same performance as the "hacked" code?
Thanks!
- Ron Caplan
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page