- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a code that is using "do concurrent" for offload to Intel GPUs.
The following code yields an incorrect result with IFX on an Intel GPU (but works fine on the CPU and on NVIDA GPUs with nvfortran):
do concurrent (i=1:nr)
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo
However, if I modify the code to make the outer-most loop sequential, the code does yield the correct result:
do i=1,nr
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo
It seems the compiler is not liking having the reduction loop within a DC loop (or maybe just having DC loops within DC loops?)
The NVIDIA compiler handles this by making the outer look across blocks and the two inner loops across threads:
7367, Generating implicit private(fn2_fn1,fs2_fs1)
Generating NVIDIA GPU code
7367, Loop parallelized across CUDA thread blocks ! blockidx%x
7370, Loop parallelized across CUDA threads(128) ! threadidx%x
Generating reduction(+:fn2_fn1,fs2_fs1)
7378, Loop parallelized across CUDA threads(128) ! threadidx%x
I wanted to bring this to your attention - for now, I may make the outer loop sequential as "nr" is often small
-- Ron
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What version of ifx are you using? A new version, 2024.0.0, was released a couple of weeks ago as part of the HPC Toolkit.
Do you have a complete reproducer including the output you expect?
What compiler options are you using?
Thanks! That info will help with the triage.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am using the latest 24.0:
EDGE_GPU_INTEL: ~ $ ifx -v
ifx version 2024.0.0
The compiler options I am using are:
FC = mpif90 -f90=ifx
FFLAGS = -O3 -xHost -fp-model precise -heap-arrays -fopenmp-target-do-concurrent -fiopenmp -fopenmp-targets=spir64 -Xopenmp-target-backend "-device arc"
(the same issue happens on the MAX GPU with "-device pvc")
I will be releasing the full code on github in a few days at which point I can post it here.
To reproduce, you would need to replace that small section with the first version I posted.
Then, you can run the testsuite and see that the tests fail, whereas with the revised code (the one on git) it succeeds.
I will post the back here when the code is released with the details on how to test it.
-- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I should have been more specific about the reproducer. A small reproducer is best. One that we can compile and run quickly preferably without MPI.
I meant "complete" in that you only supplied some loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
It should be straight forward to create a reproducer using the two versions of the loops I sent (I do not have time to make one currently).
Otherwise, when the code comes out tomorrow, it can be used to reproduce the error as described above.
-- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Should be straight forward..... People can spend a lot of time trying to figure it out what the 'obvious' but missing vital detail is. And on a personal level i will say the exercise of making the simple reproducer has often thrown up additional details. It is better when you have time to make a reproducer, the support staff will show much more interest and we all then get some resolution and a better product!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not have time to make one currently. Sad.
As my old workmate used to say, there are 3 work days in 24 hours, I would use the spare one. Of course after 3 days of working you can tend to see pink elephants, but as long as you realize they do not exist you are still sane.
Courtesy never hurt anyone, although interestingly, this is the second "challenging" post this afternoon.
A lot of people who work here do it for free and work, and most would not ask a question that can be found with a simple search.
In essence, RTFM.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Assuming fn2_fn1 and fs2_fs1 are scalars, nothing jumps out as being obviously problematic. If they are scalars, we will assume they are LOCAL_INIT. Otherwise, we will assume they are SHARED, which will cause problems.
you could try adding locality-spec. It not only helps the compiler, but someone 30 years from now who inherits your code will understand your data locality intent.
Finally, have you tried this without AOT ( without Xopenmp-target-backend )?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Yes they are scalars.
I have not used locality specifiers because several compilers that we use have not supported them, and we need to maintain compatibility.
Also, the default behavior of every compiler that supports DC is the exact locality behavior we require/expect, so adding them would simply be too robust.
As for 30 years from now, I thank you for your suggestion, but the style guide for our code is clear that we do not use arrays without indicating they are arrays through indexing, or when using array syntax, but using a(:). Therefore, as we provide documentation, it will be known they are scalars.
"Finally, have you tried this without AOT ( without Xopenmp-target-backend )?"
Yes - the code works fine on the CPU on all compilers.
I must say that so far - none of these responses have been helpful, and being cursed at by a fellow user is not professional or useful.
I was simply trying to help out by pointing out a bug in the compiler for code that works for GPU offload with other compilers but was failing for IFX in a bad "wrong answer" way.
I may refrain from posting in this forum and rely on my direct Intel contacts from now on.
That said, I can provide a simple reproducer when I am back from travel (as it seems you require that we provide that now) if you are still interested in fixing this bug.
-- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did not mean to infer that this is not a bug - it is a bug I think. Getting a bug report requires I find an hour to cook up a testcase.
the -Xopenmp-target-backend passes the offload code to the device compiler and creates a device-specific binary. If you leave that option off, we create SPIR-V code that, at runtime, uses that compiler to JIT compile the code. There is often differences in JIT vs AOT code. The error you are seeing is most likely in the IGC compiler used by our Fortran compiler to create the device code, or JIT it at runtime. I do think there is a problem in IGC codegen. That it runs on CPU has me sure our Fortran compiler is "doing the right thing" with the loop nest. The AOT case is probably doing a loop optimization "trick" that has gone bad. JIT may not be as aggressive in loop transforms. It's worth a test.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Here is a reproducer.
It should print the same pairs of numbers each time.
On the GPU it outputs the wrong answer for the loop with embedded DC loops.
On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.
On the CPU in serial mode, it produces correct results.
program dc_gpu_bug1
implicit none
integer :: i, j, k, nr,npm,nt,ntm,np
double precision :: fn2_fn1, fs2_fs1
double precision, dimension(:,:,:), allocatable :: diffusion_coef, x, y
double precision, dimension(:,:,:,:), allocatable :: coef
double precision, dimension(:), allocatable :: dp,dt_i
double precision, parameter :: zero=0.0
double precision, parameter :: pi_i=1.0d0
nr = 1
ntm = 512
npm = ntm*2
nt=ntm+1
np=npm
allocate (diffusion_coef(nt,np,nr))
diffusion_coef(:,:,:) = 1.0d0
allocate (coef(2:ntm-1,2:npm-1,5,nr))
coef(:,:,:,:)=0.
coef(2,:,:,:) = 1.0d0
coef(3:ntm-2,:,:,:) = 3.0d0
coef(ntm-1,:,:,:) = 1.0d0
allocate(dp(npm))
dp(:) = 1.0d0
allocate (dt_i(ntm))
dt_i(:) = 1.0d0
allocate (x(ntm,npm,nr))
allocate (y(ntm,npm,nr))
x(:,:,:) = 1.0d0
x(1,:,:) = 2.0d0
x(2,:,:) = 4.0d0
x(ntm,:,:) = 2.0d0
x(ntm-1,:,:) = 4.0d0
y(:,:,:) = zero
!$omp target enter data map(to:x,y,coef,diffusion_coef,dp,dt_i)
do concurrent (i=1:nr)
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo
!$omp target exit data map(from:y)
print *, y(1,2,1), y(ntm,2,1)
! Reset y.
y(:,:,:) = zero
!$omp target enter data map(to:y)
do i=1,nr
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo
!$omp target exit data map(from:y)
print *, y(1,2,1), y(ntm,2,1)
!$omp target exit data map(delete:x,coef,diffusion_coef,dp,dt_i)
deallocate(coef, x, y, dp, dt_i, diffusion_coef)
end program dc_gpu_bug1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Like I said, I can send you a reproducer next week.
Otherwise, the code this came from is now on github: (github.com/predsci/hipft).
If you find the loop in question and replace it with the first example I posted (with nested DC loops), when you run the testsuite you will see that the test fails.
The repo provides build scripts for IFX for both CPU and GPU offload.
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will open a bug report. Thank you for the reproducer!
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've done some testing with your reproducer with an early version of the next ifx compiler. It prints the correct answers on CPU when compiled with -qopenmp and without.
But with the Intel GPU I don't get the right answers. I'm investigating.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.
Please use -qopenmp with ifx. -fopenmp is deprecated and behaves differently from -qopenmp.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
It still segfaults with -qopenmp:
EDGE_GPU_INTEL: ~/bug1 $ ifx -v
ifx version 2024.0.2
EDGE_GPU_INTEL: ~/bug1 $ ifx -O3 -xHost -fp-model precise -heap-arrays -qopenmp dc_gpu_bug1_v2.f90 -o test_qopenmp
EDGE_GPU_INTEL: ~/bug1 $ ./test_qopenmp
4088.00000000000 0.000000000000000E+000
4088.00000000000 4088.00000000000
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 00007F9221A54DB0 Unknown Unknown Unknown
libiomp5.so 00007F9221F7236C Unknown Unknown Unknown
libiomp5.so 00007F9221F7233A Unknown Unknown Unknown
libiomp5.so 00007F9221F743DF Unknown Unknown Unknown
test_qopenmp 000000000040D7F8 Unknown Unknown Unknown
test_qopenmp 00000000004076A0 Unknown Unknown Unknown
test_qopenmp 000000000040525D Unknown Unknown Unknown
libc.so.6 00007F9221A3FEB0 Unknown Unknown Unknown
libc.so.6 00007F9221A3FF60 __libc_start_main Unknown Unknown
test_qopenmp 0000000000405175 Unknown Unknown Unknown
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried with your original reproducer and your compiler options on CPU only on 2 different Linux distros. One ran ok, the other got a segmentation fault.
However, the version of your reproducer with the added "shared(ntm, npm)" on the outer DO CONCURRENT compiled with your compiler options ran ok on CPU on both distros.
I'm using ifx (IFX) 2024.0.0 20231017.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now for the the GPU version...
The Fortran OpenMP compiler guys said to add "shared(ntm, npm)" to the outer DO CONCURRENT loop.
From 11.1.7.5 of the F2018 standard
28 1 The locality of a variable that appears in a DO CONCURRENT construct is LOCAL, LOCAL_INIT, SHARED,
29 or unspecified. A construct or statement entity of a construct or statement within the DO CONCURRENT
30 construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, it is a
31 different entity in each iteration, similar to LOCAL locality.
I made that change and ran successfully on Intel GPU.
My compiler options: ifx -what -fopenmp-target-do-concurrent -qopenmp -fopenmp-targets=spir64
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
This works!
However, I think it should work without the "shared".
ntm and npm are scalars which according to the spec are assumed to be local/private here as they do not have the SAVE attribute.
This should be fine as long as they are treated as "firstprivate", in which case the inner DC loop should have correct values.
It seems the compiler is not handling this correctly since the original code reproducer works fine for GCC and NV (with NV working for both CPU and GPU).
Unless these other compilers are assuming/helping more than the spec requires?
- Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't speak "standard" very well. There are others on this Forum who do. I'll let them comment.
I wonder how this statement in the F2018 translates to nested DO CONCURRENT?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't speak "standard" very well either, so I'll rely on our old pal Michael Klemm (CEO, OpenMP) to explain:
"OpenMP PRIVATE and FIRSTPRIVATE are different from DO CONCURRENT LOCAL and LOCAL_INIT. Like mentioned before OpenMP data-sharing attributes privatize per thread, while DO CONCURRENT privatizes per iteration. OpenMP’s specification is clear about that a thread receives such a private copy (with initialization in case of FIRSTPRIVATE). DO CONCURRENT on the other hand does not have the notion of threads. It just says the loop can be executed in any order. So, you cannot assume that there are even threads (and how many of them) or some other form of concurrency (like SIMD). So, LOCAL_INIT has to re-init the variable for every iteration as this is the only way to have a well-defined definition of the variable in the loop body for all different concurrent execution mechanisms."
This quote is from a lengthy, ongoing discussion about DO CONCURRENT locality specifiers on fortran-lang. It explains why adding the "shared(ntm, npm)" clause to the outer DO CONCURRENT loop fixes the example code.
Even though OpenMP firstprivate and DO CONCURRENT local_init are not the same, adding "local_init(ntm, npm)" to the outer DO CONCURRENT loop also works because the variables are reinitialized to the correct values in each iteration.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page