Do Concurrent offload getting incorrect results

caplanr · ‎12-05-2023

Hi,

I have a code that is using "do concurrent" for offload to Intel GPUs.

The following code yields an incorrect result with IFX on an Intel GPU (but works fine on the CPU and on NVIDA GPUs with nvfortran):

do concurrent (i=1:nr)
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo

However, if I modify the code to make the outer-most loop sequential, the code does yield the correct result:

do i=1,nr
fn2_fn1 = zero
fs2_fs1 = zero
do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
+ diffusion_coef(2 ,k,i)) &
* (x(2 ,k,i) - x(1 ,k,i))*dp(k)
fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
+ diffusion_coef(nt ,k,i)) &
* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
enddo
do concurrent (k=1:npm)
y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
enddo
enddo

It seems the compiler is not liking having the reduction loop within a DC loop (or maybe just having DC loops within DC loops?)

The NVIDIA compiler handles this by making the outer look across blocks and the two inner loops across threads:

7367, Generating implicit private(fn2_fn1,fs2_fs1)
Generating NVIDIA GPU code
7367, Loop parallelized across CUDA thread blocks ! blockidx%x
7370, Loop parallelized across CUDA threads(128) ! threadidx%x
Generating reduction(+:fn2_fn1,fs2_fs1)
7378, Loop parallelized across CUDA threads(128) ! threadidx%x

I wanted to bring this to your attention - for now, I may make the outer loop sequential as "nr" is often small

-- Ron

Barbara_P_Intel · ‎12-06-2023

What version of ifx are you using? A new version, 2024.0.0, was released a couple of weeks ago as part of the HPC Toolkit.

Do you have a complete reproducer including the output you expect?

What compiler options are you using?

Thanks! That info will help with the triage.

caplanr · ‎12-06-2023

Hi,

I am using the latest 24.0:

EDGE_GPU_INTEL: ~ $ ifx -v
ifx version 2024.0.0

The compiler options I am using are:

FC = mpif90 -f90=ifx

FFLAGS = -O3 -xHost -fp-model precise -heap-arrays -fopenmp-target-do-concurrent -fiopenmp -fopenmp-targets=spir64 -Xopenmp-target-backend "-device arc"

(the same issue happens on the MAX GPU with "-device pvc")

I will be releasing the full code on github in a few days at which point I can post it here.

To reproduce, you would need to replace that small section with the first version I posted.

Then, you can run the testsuite and see that the tests fail, whereas with the revised code (the one on git) it succeeds.

I will post the back here when the code is released with the details on how to test it.

-- Ron

Barbara_P_Intel · ‎12-07-2023

I should have been more specific about the reproducer. A small reproducer is best. One that we can compile and run quickly preferably without MPI.

I meant "complete" in that you only supplied some loops.

caplanr · ‎12-07-2023

Hi,

It should be straight forward to create a reproducer using the two versions of the loops I sent (I do not have time to make one currently).

Otherwise, when the code comes out tomorrow, it can be used to reproduce the error as described above.

-- Ron

andrew_4619 · ‎12-07-2023

Should be straight forward..... People can spend a lot of time trying to figure it out what the 'obvious' but missing vital detail is. And on a personal level i will say the exercise of making the simple reproducer has often thrown up additional details. It is better when you have time to make a reproducer, the support staff will show much more interest and we all then get some resolution and a better product!

JohnNichols · ‎12-07-2023

I do not have time to make one currently. Sad.

As my old workmate used to say, there are 3 work days in 24 hours, I would use the spare one. Of course after 3 days of working you can tend to see pink elephants, but as long as you realize they do not exist you are still sane.

Courtesy never hurt anyone, although interestingly, this is the second "challenging" post this afternoon.

A lot of people who work here do it for free and work, and most would not ask a question that can be found with a simple search.

In essence, RTFM.

Ron_Green · ‎12-14-2023

Assuming fn2_fn1 and fs2_fs1 are scalars, nothing jumps out as being obviously problematic. If they are scalars, we will assume they are LOCAL_INIT. Otherwise, we will assume they are SHARED, which will cause problems.

you could try adding locality-spec. It not only helps the compiler, but someone 30 years from now who inherits your code will understand your data locality intent.

Finally, have you tried this without AOT ( without Xopenmp-target-backend )?

caplanr · ‎12-14-2023

Hi,

Yes they are scalars.

I have not used locality specifiers because several compilers that we use have not supported them, and we need to maintain compatibility.

Also, the default behavior of every compiler that supports DC is the exact locality behavior we require/expect, so adding them would simply be too robust.

As for 30 years from now, I thank you for your suggestion, but the style guide for our code is clear that we do not use arrays without indicating they are arrays through indexing, or when using array syntax, but using a(:). Therefore, as we provide documentation, it will be known they are scalars.

"Finally, have you tried this without AOT ( without Xopenmp-target-backend )?"

Yes - the code works fine on the CPU on all compilers.

I must say that so far - none of these responses have been helpful, and being cursed at by a fellow user is not professional or useful.

I was simply trying to help out by pointing out a bug in the compiler for code that works for GPU offload with other compilers but was failing for IFX in a bad "wrong answer" way.

I may refrain from posting in this forum and rely on my direct Intel contacts from now on.

That said, I can provide a simple reproducer when I am back from travel (as it seems you require that we provide that now) if you are still interested in fixing this bug.

-- Ron

Ron_Green · ‎12-14-2023

I did not mean to infer that this is not a bug - it is a bug I think. Getting a bug report requires I find an hour to cook up a testcase.

the -Xopenmp-target-backend passes the offload code to the device compiler and creates a device-specific binary. If you leave that option off, we create SPIR-V code that, at runtime, uses that compiler to JIT compile the code. There is often differences in JIT vs AOT code. The error you are seeing is most likely in the IGC compiler used by our Fortran compiler to create the device code, or JIT it at runtime. I do think there is a problem in IGC codegen. That it runs on CPU has me sure our Fortran compiler is "doing the right thing" with the loop nest. The AOT case is probably doing a loop optimization "trick" that has gone bad. JIT may not be as aggressive in loop transforms. It's worth a test.

caplanr · ‎12-14-2023

Hi,

Here is a reproducer.

It should print the same pairs of numbers each time.

On the GPU it outputs the wrong answer for the loop with embedded DC loops.

On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.

On the CPU in serial mode, it produces correct results.

program dc_gpu_bug1

    implicit none

    integer                                         :: i, j, k, nr,npm,nt,ntm,np
    double precision                                :: fn2_fn1, fs2_fs1
    double precision, dimension(:,:,:), allocatable :: diffusion_coef, x, y
    double precision, dimension(:,:,:,:), allocatable :: coef
    double precision, dimension(:),     allocatable :: dp,dt_i
    double precision, parameter :: zero=0.0
    double precision, parameter :: pi_i=1.0d0

    nr = 1
    ntm = 512
    npm = ntm*2
   
    nt=ntm+1
    np=npm

    allocate (diffusion_coef(nt,np,nr))

    diffusion_coef(:,:,:) = 1.0d0

    allocate (coef(2:ntm-1,2:npm-1,5,nr))

    coef(:,:,:,:)=0.

    coef(2,:,:,:) = 1.0d0
    coef(3:ntm-2,:,:,:) = 3.0d0
    coef(ntm-1,:,:,:) = 1.0d0

    allocate(dp(npm))
    dp(:) = 1.0d0

    allocate (dt_i(ntm))
    dt_i(:) = 1.0d0

    allocate (x(ntm,npm,nr))
    allocate (y(ntm,npm,nr))

    x(:,:,:) = 1.0d0

    x(1,:,:) = 2.0d0
    x(2,:,:) = 4.0d0

    x(ntm,:,:) = 2.0d0
    x(ntm-1,:,:) = 4.0d0

    y(:,:,:) = zero

!$omp target enter data map(to:x,y,coef,diffusion_coef,dp,dt_i)

    do concurrent (i=1:nr)
      fn2_fn1 = zero
      fs2_fs1 = zero
      do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
        fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
                           + diffusion_coef(2 ,k,i)) &
                         * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
        fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
                           + diffusion_coef(nt ,k,i)) &
                         * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
      enddo
      do concurrent (k=1:npm)
        y( 1,k,i)  = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
        y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
      enddo
    enddo

!$omp target exit data map(from:y)

    print *, y(1,2,1), y(ntm,2,1)

! Reset y.

    y(:,:,:) = zero

!$omp target enter data map(to:y)

    do i=1,nr
      fn2_fn1 = zero
      fs2_fs1 = zero
      do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
        fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
                           + diffusion_coef(2 ,k,i)) &
                         * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
        fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
                           + diffusion_coef(nt ,k,i)) &
                         * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
      enddo
      do concurrent (k=1:npm)
        y( 1,k,i)  = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
        y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
      enddo
    enddo

!$omp target exit data map(from:y)

    print *, y(1,2,1), y(ntm,2,1)

!$omp target exit data map(delete:x,coef,diffusion_coef,dp,dt_i)
    deallocate(coef, x, y, dp, dt_i, diffusion_coef)

end program dc_gpu_bug1

caplanr · ‎12-14-2023

Hi,

Like I said, I can send you a reproducer next week.

Otherwise, the code this came from is now on github: (github.com/predsci/hipft).

If you find the loop in question and replace it with the first example I posted (with nested DC loops), when you run the testsuite you will see that the test fails.

The repo provides build scripts for IFX for both CPU and GPU offload.

- Ron

Ron_Green · ‎12-15-2023

I will open a bug report. Thank you for the reproducer!

ron

Barbara_P_Intel · ‎01-02-2024

I've done some testing with your reproducer with an early version of the next ifx compiler. It prints the correct answers on CPU when compiled with -qopenmp and without.

But with the Intel GPU I don't get the right answers. I'm investigating.

Barbara_P_Intel · ‎01-16-2024

>> On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.

Please use -qopenmp with ifx. -fopenmp is deprecated and behaves differently from -qopenmp.

caplanr · ‎01-16-2024

Hi,

It still segfaults with -qopenmp:

EDGE_GPU_INTEL: ~/bug1 $ ifx -v
ifx version 2024.0.2
EDGE_GPU_INTEL: ~/bug1 $ ifx -O3 -xHost -fp-model precise -heap-arrays -qopenmp dc_gpu_bug1_v2.f90 -o test_qopenmp
EDGE_GPU_INTEL: ~/bug1 $ ./test_qopenmp
4088.00000000000 0.000000000000000E+000
4088.00000000000 4088.00000000000
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 00007F9221A54DB0 Unknown Unknown Unknown
libiomp5.so 00007F9221F7236C Unknown Unknown Unknown
libiomp5.so 00007F9221F7233A Unknown Unknown Unknown
libiomp5.so 00007F9221F743DF Unknown Unknown Unknown
test_qopenmp 000000000040D7F8 Unknown Unknown Unknown
test_qopenmp 00000000004076A0 Unknown Unknown Unknown
test_qopenmp 000000000040525D Unknown Unknown Unknown
libc.so.6 00007F9221A3FEB0 Unknown Unknown Unknown
libc.so.6 00007F9221A3FF60 __libc_start_main Unknown Unknown
test_qopenmp 0000000000405175 Unknown Unknown Unknown

- Ron

Barbara_P_Intel · ‎01-16-2024

I tried with your original reproducer and your compiler options on CPU only on 2 different Linux distros. One ran ok, the other got a segmentation fault.

However, the version of your reproducer with the added "shared(ntm, npm)" on the outer DO CONCURRENT compiled with your compiler options ran ok on CPU on both distros.

I'm using ifx (IFX) 2024.0.0 20231017.

Barbara_P_Intel · ‎01-16-2024

Now for the the GPU version...

The Fortran OpenMP compiler guys said to add "shared(ntm, npm)" to the outer DO CONCURRENT loop.

From 11.1.7.5 of the F2018 standard

28 1 The locality of a variable that appears in a DO CONCURRENT construct is LOCAL, LOCAL_INIT, SHARED,

29 or unspecified. A construct or statement entity of a construct or statement within the DO CONCURRENT

30 construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, it is a

31 different entity in each iteration, similar to LOCAL locality.

I made that change and ran successfully on Intel GPU.

My compiler options: ifx -what -fopenmp-target-do-concurrent -qopenmp -fopenmp-targets=spir64

caplanr · ‎01-16-2024

Hi,

This works!

However, I think it should work without the "shared".

ntm and npm are scalars which according to the spec are assumed to be local/private here as they do not have the SAVE attribute.

This should be fine as long as they are treated as "firstprivate", in which case the inner DC loop should have correct values.

It seems the compiler is not handling this correctly since the original code reproducer works fine for GCC and NV (with NV working for both CPU and GPU).

Unless these other compilers are assuming/helping more than the spec requires?

- Ron

Barbara_P_Intel · ‎01-16-2024

I don't speak "standard" very well. There are others on this Forum who do. I'll let them comment.

I wonder how this statement in the F2018 translates to nested DO CONCURRENT?

Henry_G_Intel · ‎01-19-2024

I don't speak "standard" very well either, so I'll rely on our old pal Michael Klemm (CEO, OpenMP) to explain:

"OpenMP PRIVATE and FIRSTPRIVATE are different from DO CONCURRENT LOCAL and LOCAL_INIT. Like mentioned before OpenMP data-sharing attributes privatize per thread, while DO CONCURRENT privatizes per iteration. OpenMP’s specification is clear about that a thread receives such a private copy (with initialization in case of FIRSTPRIVATE). DO CONCURRENT on the other hand does not have the notion of threads. It just says the loop can be executed in any order. So, you cannot assume that there are even threads (and how many of them) or some other form of concurrency (like SIMD). So, LOCAL_INIT has to re-init the variable for every iteration as this is the only way to have a well-defined definition of the variable in the loop body for all different concurrent execution mechanisms."

This quote is from a lengthy, ongoing discussion about DO CONCURRENT locality specifiers on fortran-lang. It explains why adding the "shared(ntm, npm)" clause to the outer DO CONCURRENT loop fixes the example code.

Even though OpenMP firstprivate and DO CONCURRENT local_init are not the same, adding "local_init(ntm, npm)" to the outer DO CONCURRENT loop also works because the variables are reinitialized to the correct values in each iteration.