topic Re:Do Concurrent offload getting incorrect results in Intel® Fortran Compiler

Do Concurrent offload getting incorrect results

caplanr — Wed, 06 Dec 2023 00:25:33 GMT

Hi,

I have a code that is using "do concurrent" for offload to Intel GPUs.

The following code yields an incorrect result with IFX on an Intel GPU (but works fine on the CPU and on NVIDA GPUs with nvfortran):

do concurrent (i=1:nr)
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo

However, if I modify the code to make the outer-most loop sequential, the code does yield the correct result:

do i=1,nr
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo

It seems the compiler is not liking having the reduction loop within a DC loop (or maybe just having DC loops within DC loops?)

The NVIDIA compiler handles this by making the outer look across blocks and the two inner loops across threads:

7367, Generating implicit private(fn2_fn1,fs2_fs1)
Generating NVIDIA GPU code
7367, Loop parallelized across CUDA thread blocks ! blockidx%x
7370, Loop parallelized across CUDA threads(128) ! threadidx%x
Generating reduction(+:fn2_fn1,fs2_fs1)
7378, Loop parallelized across CUDA threads(128) ! threadidx%x

I wanted to bring this to your attention - for now, I may make the outer loop sequential as "nr" is often small

-- Ron

Re: Do Concurrent offload getting incorrect results

Barbara_P_Intel — Wed, 06 Dec 2023 23:29:44 GMT

What version of ifx are you using? A new version, 2024.0.0, was released a couple of weeks ago as part of the HPC Toolkit.

Do you have a complete reproducer including the output you expect?

What compiler options are you using?

Thanks! That info will help with the triage.

Re: Do Concurrent offload getting incorrect results

caplanr — Thu, 07 Dec 2023 00:22:44 GMT

Hi,

I am using the latest 24.0:

EDGE_GPU_INTEL: ~ $ ifx -v
ifx version 2024.0.0

The compiler options I am using are:

FC = mpif90 -f90=ifx

FFLAGS = -O3 -xHost -fp-model precise -heap-arrays -fopenmp-target-do-concurrent -fiopenmp -fopenmp-targets=spir64 -Xopenmp-target-backend "-device arc"

(the same issue happens on the MAX GPU with "-device pvc")

I will be releasing the full code on github in a few days at which point I can post it here.

To reproduce, you would need to replace that small section with the first version I posted.

Then, you can run the testsuite and see that the tests fail, whereas with the revised code (the one on git) it succeeds.

I will post the back here when the code is released with the details on how to test it.

-- Ron

Re: Do Concurrent offload getting incorrect results

Barbara_P_Intel — Thu, 07 Dec 2023 16:42:37 GMT

I should have been more specific about the reproducer. A small reproducer is best. One that we can compile and run quickly preferably without MPI.

I meant "complete" in that you only supplied some loops.

Re: Do Concurrent offload getting incorrect results

caplanr — Thu, 07 Dec 2023 20:58:27 GMT

Hi,

It should be straight forward to create a reproducer using the two versions of the loops I sent (I do not have time to make one currently).

Otherwise, when the code comes out tomorrow, it can be used to reproduce the error as described above.

-- Ron

Re: Do Concurrent offload getting incorrect results

andrew_4619 — Fri, 08 Dec 2023 00:08:01 GMT

Should be straight forward..... People can spend a lot of time trying to figure it out what the 'obvious' but missing vital detail is. And on a personal level i will say the exercise of making the simple reproducer has often thrown up additional details. It is better when you have time to make a reproducer, the support staff will show much more interest and we all then get some resolution and a better product!

Re: Do Concurrent offload getting incorrect results

JohnNichols — Fri, 08 Dec 2023 00:59:21 GMT

I do not have time to make one currently. Sad.

As my old workmate used to say, there are 3 work days in 24 hours, I would use the spare one. Of course after 3 days of working you can tend to see pink elephants, but as long as you realize they do not exist you are still sane.

Courtesy never hurt anyone, although interestingly, this is the second "challenging" post this afternoon.

A lot of people who work here do it for free and work, and most would not ask a question that can be found with a simple search.

In essence, RTFM.

Re: Do Concurrent offload getting incorrect results

Ron_Green — Thu, 14 Dec 2023 13:35:52 GMT

Assuming fn2_fn1 and fs2_fs1 are scalars, nothing jumps out as being obviously problematic. If they are scalars, we will assume they are LOCAL_INIT. Otherwise, we will assume they are SHARED, which will cause problems.

you could try adding locality-spec. It not only helps the compiler, but someone 30 years from now who inherits your code will understand your data locality intent.

Finally, have you tried this without AOT ( without Xopenmp-target-backend )?

Re: Do Concurrent offload getting incorrect results

caplanr — Thu, 14 Dec 2023 17:15:12 GMT

Hi,

Yes they are scalars.

I have not used locality specifiers because several compilers that we use have not supported them, and we need to maintain compatibility.

Also, the default behavior of every compiler that supports DC is the exact locality behavior we require/expect, so adding them would simply be too robust.

As for 30 years from now, I thank you for your suggestion, but the style guide for our code is clear that we do not use arrays without indicating they are arrays through indexing, or when using array syntax, but using a(:). Therefore, as we provide documentation, it will be known they are scalars.

"Finally, have you tried this without AOT ( without Xopenmp-target-backend )?"

Yes - the code works fine on the CPU on all compilers.

I must say that so far - none of these responses have been helpful, and being cursed at by a fellow user is not professional or useful.

I was simply trying to help out by pointing out a bug in the compiler for code that works for GPU offload with other compilers but was failing for IFX in a bad "wrong answer" way.

I may refrain from posting in this forum and rely on my direct Intel contacts from now on.

That said, I can provide a simple reproducer when I am back from travel (as it seems you require that we provide that now) if you are still interested in fixing this bug.

-- Ron

Re: Do Concurrent offload getting incorrect results

Ron_Green — Thu, 14 Dec 2023 17:22:38 GMT

I did not mean to infer that this is not a bug - it is a bug I think. Getting a bug report requires I find an hour to cook up a testcase.

the -Xopenmp-target-backend passes the offload code to the device compiler and creates a device-specific binary. If you leave that option off, we create SPIR-V code that, at runtime, uses that compiler to JIT compile the code. There is often differences in JIT vs AOT code. The error you are seeing is most likely in the IGC compiler used by our Fortran compiler to create the device code, or JIT it at runtime. I do think there is a problem in IGC codegen. That it runs on CPU has me sure our Fortran compiler is "doing the right thing" with the loop nest. The AOT case is probably doing a loop optimization "trick" that has gone bad. JIT may not be as aggressive in loop transforms. It's worth a test.

Re: Do Concurrent offload getting incorrect results

caplanr — Thu, 14 Dec 2023 17:27:39 GMT

Hi,

Like I said, I can send you a reproducer next week.

Otherwise, the code this came from is now on github: (github.com/predsci/hipft).

If you find the loop in question and replace it with the first example I posted (with nested DC loops), when you run the testsuite you will see that the test fails.

The repo provides build scripts for IFX for both CPU and GPU offload.

- Ron

Re: Do Concurrent offload getting incorrect results

caplanr — Fri, 15 Dec 2023 02:38:44 GMT

Hi,

Here is a reproducer.

It should print the same pairs of numbers each time.

On the GPU it outputs the wrong answer for the loop with embedded DC loops.

On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.

On the CPU in serial mode, it produces correct results.

program dc_gpu_bug1 implicit none integer :: i, j, k, nr,npm,nt,ntm,np double precision :: fn2_fn1, fs2_fs1 double precision, dimension(:,:,:), allocatable :: diffusion_coef, x, y double precision, dimension(:,:,:,:), allocatable :: coef double precision, dimension(:), allocatable :: dp,dt_i double precision, parameter :: zero=0.0 double precision, parameter :: pi_i=1.0d0 nr = 1 ntm = 512 npm = ntm*2 nt=ntm+1 np=npm allocate (diffusion_coef(nt,np,nr)) diffusion_coef(:,:,:) = 1.0d0 allocate (coef(2:ntm-1,2:npm-1,5,nr)) coef(:,:,:,:)=0. coef(2,:,:,:) = 1.0d0 coef(3:ntm-2,:,:,:) = 3.0d0 coef(ntm-1,:,:,:) = 1.0d0 allocate(dp(npm)) dp(:) = 1.0d0 allocate (dt_i(ntm)) dt_i(:) = 1.0d0 allocate (x(ntm,npm,nr)) allocate (y(ntm,npm,nr)) x(:,:,:) = 1.0d0 x(1,:,:) = 2.0d0 x(2,:,:) = 4.0d0 x(ntm,:,:) = 2.0d0 x(ntm-1,:,:) = 4.0d0 y(:,:,:) = zero !$omp target enter data map(to:x,y,coef,diffusion_coef,dp,dt_i) do concurrent (i=1:nr) fn2_fn1 = zero fs2_fs1 = zero do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1) fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) & + diffusion_coef(2 ,k,i)) & * (x(2 ,k,i) - x(1 ,k,i))*dp(k) fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) & + diffusion_coef(nt ,k,i)) & * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k) enddo do concurrent (k=1:npm) y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i enddo enddo !$omp target exit data map(from:y) print *, y(1,2,1), y(ntm,2,1) ! Reset y. y(:,:,:) = zero !$omp target enter data map(to:y) do i=1,nr fn2_fn1 = zero fs2_fs1 = zero do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1) fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) & + diffusion_coef(2 ,k,i)) & * (x(2 ,k,i) - x(1 ,k,i))*dp(k) fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) & + diffusion_coef(nt ,k,i)) & * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k) enddo do concurrent (k=1:npm) y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i enddo enddo !$omp target exit data map(from:y) print *, y(1,2,1), y(ntm,2,1) !$omp target exit data map(delete:x,coef,diffusion_coef,dp,dt_i) deallocate(coef, x, y, dp, dt_i, diffusion_coef) end program dc_gpu_bug1

Re: Do Concurrent offload getting incorrect results

Ron_Green — Fri, 15 Dec 2023 16:32:34 GMT

I will open a bug report. Thank you for the reproducer!

ron

Re:Do Concurrent offload getting incorrect results

Barbara_P_Intel — Tue, 02 Jan 2024 22:26:01 GMT

I've done some testing with your reproducer with an early version of the next ifx compiler. It prints the correct answers on CPU when compiled with -qopenmp and without.

But with the Intel GPU I don't get the right answers. I'm investigating.

Re: Do Concurrent offload getting incorrect results

Barbara_P_Intel — Tue, 16 Jan 2024 23:38:02 GMT

>> On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.

Please use -qopenmp with ifx. -fopenmp is deprecated and behaves differently from -qopenmp.

Re: Do Concurrent offload getting incorrect results

caplanr — Tue, 16 Jan 2024 23:49:26 GMT

Hi,

It still segfaults with -qopenmp:

EDGE_GPU_INTEL: ~/bug1 $ ifx -v ifx version 2024.0.2 EDGE_GPU_INTEL: ~/bug1 $ ifx -O3 -xHost -fp-model precise -heap-arrays -qopenmp dc_gpu_bug1_v2.f90 -o test_qopenmp EDGE_GPU_INTEL: ~/bug1 $ ./test_qopenmp 4088.00000000000 0.000000000000000E+000 4088.00000000000 4088.00000000000 forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source libc.so.6 00007F9221A54DB0 Unknown Unknown Unknown libiomp5.so 00007F9221F7236C Unknown Unknown Unknown libiomp5.so 00007F9221F7233A Unknown Unknown Unknown libiomp5.so 00007F9221F743DF Unknown Unknown Unknown test_qopenmp 000000000040D7F8 Unknown Unknown Unknown test_qopenmp 00000000004076A0 Unknown Unknown Unknown test_qopenmp 000000000040525D Unknown Unknown Unknown libc.so.6 00007F9221A3FEB0 Unknown Unknown Unknown libc.so.6 00007F9221A3FF60 __libc_start_main Unknown Unknown test_qopenmp 0000000000405175 Unknown Unknown Unknown

- Ron

Re:Do Concurrent offload getting incorrect results

Barbara_P_Intel — Tue, 16 Jan 2024 23:58:33 GMT

Now for the the GPU version...

The Fortran OpenMP compiler guys said to add "shared(ntm, npm)" to the outer DO CONCURRENT loop.

From 11.1.7.5 of the F2018 standard

28 1 The locality of a variable that appears in a DO CONCURRENT construct is LOCAL, LOCAL_INIT, SHARED,

29 or unspecified. A construct or statement entity of a construct or statement within the DO CONCURRENT

30 construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, it is a

31 different entity in each iteration, similar to LOCAL locality.

I made that change and ran successfully on Intel GPU.

My compiler options: ifx -what -fopenmp-target-do-concurrent -qopenmp -fopenmp-targets=spir64

Re: Re:Do Concurrent offload getting incorrect results

caplanr — Wed, 17 Jan 2024 00:10:57 GMT

Hi,

This works!

However, I think it should work without the "shared".

ntm and npm are scalars which according to the spec are assumed to be local/private here as they do not have the SAVE attribute.

This should be fine as long as they are treated as "firstprivate", in which case the inner DC loop should have correct values.

It seems the compiler is not handling this correctly since the original code reproducer works fine for GCC and NV (with NV working for both CPU and GPU).

Unless these other compilers are assuming/helping more than the spec requires?

- Ron

Re: Do Concurrent offload getting incorrect results

Barbara_P_Intel — Wed, 17 Jan 2024 00:16:53 GMT

I tried with your original reproducer and your compiler options on CPU only on 2 different Linux distros. One ran ok, the other got a segmentation fault.

However, the version of your reproducer with the added "shared(ntm, npm)" on the outer DO CONCURRENT compiled with your compiler options ran ok on CPU on both distros.

I'm using ifx (IFX) 2024.0.0 20231017.

Re:Do Concurrent offload getting incorrect results

Barbara_P_Intel — Wed, 17 Jan 2024 00:23:02 GMT

I don't speak "standard" very well. There are others on this Forum who do. I'll let them comment.

I wonder how this statement in the F2018 translates to nested DO CONCURRENT?