Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28757 Discussions

Do Concurrent offload getting incorrect results

caplanr
New Contributor I
3,176 Views

Hi,

 

I have a code that is using "do concurrent" for offload to Intel GPUs.

 

The following code yields an incorrect result with IFX on an Intel GPU (but works fine on the CPU and on NVIDA GPUs with nvfortran):

 

do concurrent (i=1:nr)
  fn2_fn1 = zero
  fs2_fs1 = zero
  do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
    fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
                       + diffusion_coef(2 ,k,i)) &
                     * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
    fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
                       + diffusion_coef(nt ,k,i)) &
                     * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
  enddo
  do concurrent (k=1:npm)
    y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
    y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
  enddo
enddo

However, if I modify the code to make the outer-most loop sequential, the code does yield the correct result:


do i=1,nr
  fn2_fn1 = zero
  fs2_fs1 = zero
  do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
    fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
                       + diffusion_coef(2 ,k,i)) &
                     * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
    fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
                       + diffusion_coef(nt ,k,i)) &
                     * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
  enddo
  do concurrent (k=1:npm)
    y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
    y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
  enddo
enddo

 

It seems the compiler is not liking having the reduction loop within a DC loop (or maybe just having DC loops within DC loops?)

 

The NVIDIA compiler handles this by making the outer look across blocks and the two inner loops across threads:

7367, Generating implicit private(fn2_fn1,fs2_fs1)
Generating NVIDIA GPU code
7367, Loop parallelized across CUDA thread blocks ! blockidx%x
7370, Loop parallelized across CUDA threads(128) ! threadidx%x
Generating reduction(+:fn2_fn1,fs2_fs1)
7378, Loop parallelized across CUDA threads(128) ! threadidx%x

 

I wanted to bring this to your attention - for now, I may make the outer loop sequential as "nr" is often small

 -- Ron

 

 

23 Replies
Barbara_P_Intel
Employee
2,782 Views

What version of ifx are you using? A new version, 2024.0.0, was released a couple of weeks ago as part of the HPC Toolkit.

Do you have a complete reproducer including the output you expect? 

What compiler options are you using?

Thanks! That info will help with the triage.

 

0 Kudos
caplanr
New Contributor I
2,773 Views

Hi,

 

I am using the latest 24.0:

 

EDGE_GPU_INTEL: ~ $ ifx -v
ifx version 2024.0.0

 

The compiler options I am using are:

 

FC = mpif90 -f90=ifx

FFLAGS = -O3 -xHost -fp-model precise -heap-arrays -fopenmp-target-do-concurrent -fiopenmp -fopenmp-targets=spir64 -Xopenmp-target-backend "-device arc" 

 

(the same issue happens on the MAX GPU with "-device pvc")

 

I will be releasing the full code on github in a few days at which point I can post it here.

To reproduce, you would need to replace that small section with the first version I posted.

Then, you can run the testsuite and see that the tests fail, whereas with the revised code (the one on git) it succeeds.

 

I will post the back here when the code is released with the details on how to test it.

 

 -- Ron

 

 

 

0 Kudos
Barbara_P_Intel
Employee
2,741 Views

I should have been more specific about the reproducer. A small reproducer is best. One that we can compile and run quickly preferably without MPI.

I meant "complete" in that you only supplied some loops.

0 Kudos
caplanr
New Contributor I
2,728 Views

Hi,

 

It should be straight forward to create a reproducer using the two versions of the loops I sent (I do not have time to make one currently).

 

Otherwise, when the code comes out tomorrow, it can be used to reproduce the error as described above.

 

 -- Ron

0 Kudos
andrew_4619
Honored Contributor III
2,719 Views

Should be straight forward..... People can spend a lot of time trying to figure it out what the 'obvious' but missing vital detail is. And on a personal level i will say the exercise  of making the simple reproducer has often thrown up additional details.  It is better when you have time to make a reproducer, the support staff will show much more interest and we all then get some resolution and a better product!

JohnNichols
Valued Contributor III
2,709 Views

I do not have time to make one currently. Sad. 

As my old workmate used to say, there are 3 work days in 24 hours, I would use the spare one. Of course after 3 days of working you can tend to see pink elephants, but as long as you realize they do not exist you are still sane.  

Courtesy never hurt anyone, although interestingly, this is the second "challenging" post this afternoon.  

A lot of people who work here do it for free and work, and most would not ask a question that can be found with a simple search.

In essence, RTFM. 

0 Kudos
Ron_Green
Moderator
2,610 Views

Assuming fn2_fn1 and fs2_fs1 are scalars, nothing jumps out as being obviously problematic.  If they are scalars, we will assume they are LOCAL_INIT.  Otherwise, we will assume they are SHARED, which will cause problems.

 

you could try adding locality-spec.  It not only helps the compiler, but someone 30 years from now who inherits your code will understand your data locality intent.

 

Finally, have you tried this without AOT ( without Xopenmp-target-backend )?

0 Kudos
caplanr
New Contributor I
2,597 Views

Hi,

 

Yes they are scalars.

 

I have not used locality specifiers because several compilers that we use have not supported them, and we need to maintain compatibility.

Also, the default behavior of every compiler that supports DC is the exact locality behavior we require/expect, so adding them would simply be too robust. 

As for 30 years from now, I thank you for your suggestion, but the style guide for our code is clear that we do not use arrays without indicating they are arrays through indexing, or when using array syntax, but using a(:).  Therefore, as we provide documentation, it will be known they are scalars.  

 

"Finally, have you tried this without AOT ( without Xopenmp-target-backend )?"

Yes - the code works fine on the CPU on all compilers.

 

I must say that so far - none of these responses have been helpful, and being cursed at by a fellow user is not professional or useful.

 

I was simply trying to help out by pointing out a bug in the compiler for code that works for GPU offload with other compilers but was failing for IFX in a bad "wrong answer" way.   

 

I may refrain from posting in this forum and rely on my direct Intel contacts from now on.

 

That said, I can provide a simple reproducer when I am back from travel (as it seems you require that we provide that now) if you are still interested in fixing this bug.  

 

 -- Ron

0 Kudos
Ron_Green
Moderator
2,593 Views

I did not mean to infer that this is not a bug - it is a bug I think. Getting a bug report requires I find an hour to cook up a testcase. 

the -Xopenmp-target-backend passes the offload code to the device compiler and creates a device-specific binary.  If you leave that option off, we create SPIR-V code that, at runtime, uses that compiler to JIT compile the code.  There is often differences in JIT vs AOT code.  The error you are seeing is most likely in the IGC compiler used by our Fortran compiler to create the device code, or JIT it at runtime. I do think there is a problem in IGC codegen.  That it runs on CPU has me sure our Fortran compiler is "doing the right thing" with the loop nest. The AOT case is probably doing a loop optimization "trick" that has gone bad.  JIT may not be as aggressive in loop transforms.  It's worth a test.

0 Kudos
caplanr
New Contributor I
2,564 Views

Hi,

 

Here is a reproducer.

 

It should print the same pairs of numbers each time.

 

On the GPU it outputs the wrong answer for the loop with embedded DC loops.

On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.

On the CPU in serial mode, it produces correct results.

 

 

program dc_gpu_bug1

    implicit none

    integer                                         :: i, j, k, nr,npm,nt,ntm,np
    double precision                                :: fn2_fn1, fs2_fs1
    double precision, dimension(:,:,:), allocatable :: diffusion_coef, x, y
    double precision, dimension(:,:,:,:), allocatable :: coef
    double precision, dimension(:),     allocatable :: dp,dt_i
    double precision, parameter :: zero=0.0
    double precision, parameter :: pi_i=1.0d0

    nr = 1
    ntm = 512
    npm = ntm*2
   
    nt=ntm+1
    np=npm

    allocate (diffusion_coef(nt,np,nr))

    diffusion_coef(:,:,:) = 1.0d0

    allocate (coef(2:ntm-1,2:npm-1,5,nr))

    coef(:,:,:,:)=0.

    coef(2,:,:,:) = 1.0d0
    coef(3:ntm-2,:,:,:) = 3.0d0
    coef(ntm-1,:,:,:) = 1.0d0

    allocate(dp(npm))
    dp(:) = 1.0d0

    allocate (dt_i(ntm))
    dt_i(:) = 1.0d0

    allocate (x(ntm,npm,nr))
    allocate (y(ntm,npm,nr))

    x(:,:,:) = 1.0d0

    x(1,:,:) = 2.0d0
    x(2,:,:) = 4.0d0

    x(ntm,:,:) = 2.0d0
    x(ntm-1,:,:) = 4.0d0

    y(:,:,:) = zero

!$omp target enter data map(to:x,y,coef,diffusion_coef,dp,dt_i)

    do concurrent (i=1:nr)
      fn2_fn1 = zero
      fs2_fs1 = zero
      do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
        fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
                           + diffusion_coef(2 ,k,i)) &
                         * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
        fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
                           + diffusion_coef(nt ,k,i)) &
                         * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
      enddo
      do concurrent (k=1:npm)
        y( 1,k,i)  = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
        y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
      enddo
    enddo

!$omp target exit data map(from:y)

    print *, y(1,2,1), y(ntm,2,1)

! Reset y.

    y(:,:,:) = zero

!$omp target enter data map(to:y)

    do i=1,nr
      fn2_fn1 = zero
      fs2_fs1 = zero
      do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
        fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &
                           + diffusion_coef(2 ,k,i)) &
                         * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
        fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &
                           + diffusion_coef(nt ,k,i)) &
                         * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
      enddo
      do concurrent (k=1:npm)
        y( 1,k,i)  = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
        y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
      enddo
    enddo

!$omp target exit data map(from:y)

    print *, y(1,2,1), y(ntm,2,1)

!$omp target exit data map(delete:x,coef,diffusion_coef,dp,dt_i)
    deallocate(coef, x, y, dp, dt_i, diffusion_coef)

end program dc_gpu_bug1
caplanr
New Contributor I
2,592 Views

Hi,

 

Like I said, I can send you a reproducer next week.

 

Otherwise, the code this came from is now on github: (github.com/predsci/hipft).

If you find the loop in question and replace it with the first example I posted (with nested DC loops), when you run the testsuite you will see that the test fails.

The repo provides build scripts for IFX for both CPU and GPU offload.

 

 -  Ron

0 Kudos
Ron_Green
Moderator
2,526 Views

I will open a bug report.  Thank you for the reproducer!

 

ron

0 Kudos
Barbara_P_Intel
Employee
2,240 Views

I've done some testing with your reproducer with an early version of the next ifx compiler. It prints the correct answers on CPU when compiled with -qopenmp and without.

But with the Intel GPU I don't get the right answers. I'm investigating.



0 Kudos
Barbara_P_Intel
Employee
2,064 Views

>> On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.

Please use -qopenmp with ifx. -fopenmp is deprecated and behaves differently from -qopenmp.

 

0 Kudos
caplanr
New Contributor I
2,062 Views

Hi,

 

It still segfaults with -qopenmp:

 

EDGE_GPU_INTEL: ~/bug1 $ ifx -v
ifx version 2024.0.2
EDGE_GPU_INTEL: ~/bug1 $ ifx -O3 -xHost -fp-model precise -heap-arrays -qopenmp dc_gpu_bug1_v2.f90 -o test_qopenmp
EDGE_GPU_INTEL: ~/bug1 $ ./test_qopenmp
4088.00000000000 0.000000000000000E+000
4088.00000000000 4088.00000000000
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 00007F9221A54DB0 Unknown Unknown Unknown
libiomp5.so 00007F9221F7236C Unknown Unknown Unknown
libiomp5.so 00007F9221F7233A Unknown Unknown Unknown
libiomp5.so 00007F9221F743DF Unknown Unknown Unknown
test_qopenmp 000000000040D7F8 Unknown Unknown Unknown
test_qopenmp 00000000004076A0 Unknown Unknown Unknown
test_qopenmp 000000000040525D Unknown Unknown Unknown
libc.so.6 00007F9221A3FEB0 Unknown Unknown Unknown
libc.so.6 00007F9221A3FF60 __libc_start_main Unknown Unknown
test_qopenmp 0000000000405175 Unknown Unknown Unknown

 

- Ron

0 Kudos
Barbara_P_Intel
Employee
2,045 Views

I tried with your original reproducer and your compiler options on CPU only on 2 different Linux distros. One ran ok, the other got a segmentation fault.

However, the version of your reproducer with the added "shared(ntm, npm)" on the outer DO CONCURRENT compiled with your compiler options ran ok on CPU on both distros.

I'm using ifx (IFX) 2024.0.0 20231017. 

 

0 Kudos
Barbara_P_Intel
Employee
2,058 Views

Now for the the GPU version...

The Fortran OpenMP compiler guys said to add "shared(ntm, npm)" to the outer DO CONCURRENT loop.

From 11.1.7.5 of the F2018 standard

28 1 The locality of a variable that appears in a DO CONCURRENT construct is LOCAL, LOCAL_INIT, SHARED,

29 or unspecified. A construct or statement entity of a construct or statement within the DO CONCURRENT

30 construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, it is a

31 different entity in each iteration, similar to LOCAL locality. 

I made that change and ran successfully on Intel GPU.

My compiler options: ifx -what -fopenmp-target-do-concurrent -qopenmp -fopenmp-targets=spir64



0 Kudos
caplanr
New Contributor I
2,052 Views

Hi,

 

This works!

 

However, I think it should work without the "shared".

 

ntm and npm are scalars which according to the spec are assumed to be local/private here as they do not have the SAVE attribute.

 

This should be fine as long as they are treated as "firstprivate", in which case the inner DC loop should have correct values.

 

It seems the compiler is not handling this correctly since the original code reproducer works fine for GCC and NV (with NV working for both CPU and GPU). 

Unless these other compilers are assuming/helping more than the spec requires?

 

 - Ron

 

0 Kudos
Barbara_P_Intel
Employee
2,017 Views

I don't speak "standard" very well. There are others on this Forum who do. I'll let them comment.

I wonder how this statement in the F2018 translates to nested DO CONCURRENT?



0 Kudos
Henry_G_Intel
Employee
1,899 Views

I don't speak "standard" very well either, so I'll rely on our old pal Michael Klemm (CEO, OpenMP) to explain:

"OpenMP PRIVATE and FIRSTPRIVATE are different from DO CONCURRENT LOCAL and LOCAL_INIT. Like mentioned before OpenMP data-sharing attributes privatize per thread, while DO CONCURRENT privatizes per iteration. OpenMP’s specification is clear about that a thread receives such a private copy (with initialization in case of FIRSTPRIVATE). DO CONCURRENT on the other hand does not have the notion of threads. It just says the loop can be executed in any order. So, you cannot assume that there are even threads (and how many of them) or some other form of concurrency (like SIMD). So, LOCAL_INIT has to re-init the variable for every iteration as this is the only way to have a well-defined definition of the variable in the loop body for all different concurrent execution mechanisms."

This quote is from a lengthy, ongoing discussion about DO CONCURRENT locality specifiers on fortran-lang. It explains why adding the "shared(ntm, npm)" clause to the outer DO CONCURRENT loop fixes the example code.

Even though OpenMP firstprivate and DO CONCURRENT local_init are not the same, adding "local_init(ntm, npm)" to the outer DO CONCURRENT loop also works because the variables are reinitialized to the correct values in each iteration.

Reply