- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In our recently published paper https://arxiv.org/pdf/2408.07843 I had been working with Henry Gabb and Shiquan Su on the performance of an array reduction double-nested do concurrent loop offloaded to Intel GPUs.
The code has the basic form of:
do concurrent(i=1:n)
s = zero
do concurrent(k=1:m) reduce(+:s)
s = s + array(k,i)
enddo
do concurrent(k=1:m)
array(k) = s
enddo
enddo
It was found that this code was very slow on the GPU due to the compiler's choice of how to parallelize it.
It was found that a "hack" could be done using an OpenMP target directive to fix the performance issue (due to the OpenMP target back-end to the do concurrent implementation) which looks like this:
do concurrent(i=1:n)
s = zero
!$omp parallel loop
do concurrent(k=1:m) reduce(+:s)
s = s + array(k,i)
enddo
!$omp parallel loop
do concurrent(k=1:m)
array(k) = s
enddo
enddo
This was done before the 2025 compiler release.
I recently have installed and tested the newest compiler, and this issue is still present.
Due to the recent personnel changes at Intel, I am not sure if anyone on the compiler team is aware/assigned about this issue, so I thought I would make a post here about it.
If this is already being worked on, do you have an ETA for a compiler release version that will not require the above OpenMP directives to achieve good GPU performance on this code?
Thanks!
- Ron
P.S. The full code that this comes from can be found at github.com/predsci/hipft and the modified code is included in the 'waccpd24_intel_tmp" branch.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are looking into this.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page