OpenMP 5 TARGET and reduction question for code to be run on host system

Harald1 · ‎03-29-2021

I've been searching for a while why this doesn't work for me. Maybe somebody else could hit me on the head or help debugging...

Code:

program p5
  implicit none
  integer           :: i, n = 1000
  real              :: s
  real, allocatable :: a(:)
  allocate (a(n))

  do i = 1, n
     a(i) = 2*i
  end do

  s = 0.

!$omp target data map(a,s)
!$omp target teams reduction(+:s) map(s)
  do i = 1, n
     s = s + a(i)
  end do
!$omp end target teams
!$omp end target data

  print *, a(1),a(n)
  print *, s
end program

This works for me with nvidia (nvfortran -mp=multicore) and gcc of OpenSuse Leap 15.2 (gfortran-10 -fopenmp -foffload=nvptx-none), i.e. I get the expected result.

However, trying:

% ifort -fopenmp -qopenmp-offload=host

I always get zero for the sum. I am using:

Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.1.2 Build 20201208_000000

I believe I have installed all needed libraries, I've also tried to set

export LIBOMPTARGET_DEBUG=1

which yield nothing.

Barbara_P_Intel · ‎03-29-2021

The ifort (Classic) compiler does not support OMP TARGET directives. Use ifx instead.

This Getting Started article is mostly about C++, but there is a bit about Fortran. The compiler options used there will get you started.

By default, a program built with those compiler options will offload to an Intel GPU.

Harald1 · ‎03-29-2021

OK, that is surprising. ifort -help openmp says:

-qopenmp-offload[=<kind>]
          Enables OpenMP* offloading compilation for TARGET directives.
          Enabled by default with -qopenmp.
          Use -qno-openmp-offload to disable.
          Specify kind to specify the default device for TARGET directives.
            host - allow target code to run on host system while still doing
                   the outlining for offload

ifx does give the right result for the testcase.

I tried some of the features of the mentioned documentation, to figure out if the code runs on the CPU or on the GPU.

Setting "export LIBOMPTARGET_DEBUG=1" does not really help, neither with "-fopenmp" nor "-fiopenmp". This changes if I add "-fopenmp-targets=spir64".

(It could be made clearer that "-fopenmp-targets=spir64" requires "-fiopenmp". If forgotten, the error message refers to "-fopenmp". Adding "-debug offload" crashes ifx.)

The combination of "-fiopenmp -fopenmp-targets=spir64" and "export OMP_TARGET_OFFLOAD=MANDATORY" leads on my system to:

Libomptarget --> Checking user-specified plugin 'libomptarget.rtl.opencl.so'...
Libomptarget --> Loading library 'libomptarget.rtl.opencl.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.opencl.so': libOpenCL.so.1: cannot open shared object file: No such file or directory!

[...]

Libomptarget fatal error 1: failure of target construct while offloading is mandatory

or random crashes or a wrong result, dependent on the setting of LIBOMPTARGET_DEBUG. Interestingly, there is

/opt/intel/oneapi/compiler/2021.1.1/linux/lib/libOpenCL.so.1

I hope that is not a problem of my system. Or could that have happened while upgrading from 2021.1.1 to 2021.1.2?

Harald1 · ‎03-30-2021

I'm still struggling with the data sharing clauses to make the testcase work with ifx.

The code quoted has:

!$omp target teams reduction(+:s) map(s)

although I believe it should already work with

!$omp target teams reduction(+:s)

since the mapping was declared in the enclosing !$omp target data. Nevertheless the code gives a wrong result with ifx (but not with gfortran or nvfortran) if I omit that map. Interestingly, it does not matter if I use map(from:s) or map(to:s); both seem to help...

Is there something I am missing?

Barbara_P_Intel · ‎03-30-2021

What is your failure? What version of ifx are you using?

I just ran this and it offloaded successfully. I didn't do the math to determine if the reduction is correct.

ifx (IFORT) 2021.1.2 Beta 20201214
Copyright (C) 1985-2020 Intel Corporation. All rights reserved.

+ ifx -fiopenmp -fopenmp-targets=spir64 p5.F90
+ export LIBOMPTARGET_PROFILE=T,usec
+ LIBOMPTARGET_PROFILE=T,usec
+ a.out
   2.000000       2000.000
  4.8048000E+07
LIBOMPTARGET_PROFILE for OMP DEVICE(0) Intel(R) UHD Graphics 630 [0x3e92], Thread 139877046483840
-- Name: Host Time (usec), Device Time (usec)
-- DataAlloc: 11.682510, 11.682510
-- DataRead: 0.000000, 0.000000
-- DataWrite: 0.953674, 0.953674
-- Kernel#__omp_offloading_803_80042b_MAIN___l15: 5090.951920, 4818.980000
-- ModuleBuild: 2641505.956650, 2641505.956650
-- Total: 2646609.544754, 2646337.572834

Harald1 · ‎03-30-2021

Just run the code without OpenMP (without -fiopenmp), or without -fopenmp-targets=spir64, and you get:

2.000000 2000.000
1001000.

which is what is expected.

So the result you show is likely wrong, or there is something we are missing.

Barbara_P_Intel · ‎03-31-2021

I realized after I posted my reply that I should just run it serially.

!$omp target data map(a,s)
!$omp target teams distribute map(tofrom:s), map(to:n,a), reduction(+:s) 
  do i = 1, n
     s = s + a(i)
  end do
!$omp end target data

This also works

!$omp target teams distribute map(tofrom:s), map(to:n,a), reduction(+:s)
  do i = 1, n
     s = s + a(i)
  end do

I tend to be as specific as I can with OpenMP directives to make the code self-documenting.

Harald1 · ‎03-31-2021

It is probably a good idea to make the self-documenting.

The reason why I split and "nested" the target directives is probably obvious. As a motivation: with OpenACC I can place data on the device for a whole region e.g. using "!$acc data ..." / "!$acc end data" to avoid unnecessary copying between host and device between successive kernels. I can even verify that this works at compile time and also at run time.

With OpenMP I am not so sure when the copying happens, and the Intel compiler seems not to be very chatty... (certainly much less than the Nvidia compiler).

So probably as a recommendation: it would be good to have the compiler assure the programmer whether he/she is doing a good job avoiding unnecessary copying.

Thanks,

Harald

Barbara_P_Intel · ‎04-05-2021

The OpenMP TARGET DATA implements putting the data on the target device for the entire region. I haven't used OpenACC, but the concept is the same.

Check out the OpenMP 4.5.0 Examples document for how to do that share with OpenMP.

Harald1 · ‎04-06-2021

I am starting to understand what I am seeing.

It appears that the _OPENMP macro as well as openmp_version return 201611, which means IIRC that ifort/ifx support OpenMP 4.5 TR4, not OpenMP 5.x.

OpenMP 5.1 has the following in "2.21.7 Data-Mapping Attribute Rules, Clauses, and Directives":

"If a list item appears in a reduction, lastprivate or linear clause on a combined target construct then it is treated as if it also appears in a map clause with a map-type of tofrom."

So once Intel supports OpenMP 5.x, the additional map clause for the reduction variable will no longer be necessary. For the time being, it is...

JohnNichols · ‎04-06-2021

SUBROUTINE SIMPLE(N, A, B)
S-2
S-3 INTEGER I, N
S-4 REAL B(N), A(N)
S-5
S-6 !$OMP PARALLEL DO !I is private by default
S-7 DO I=2,N
S-8 B(I) = (A(I) + A(I-1)) / 2.0
S-9 ENDDO
S-10 !$OMP END PARALLEL DO
S-11
S-12 END SUBROUTINE SIMPLE
F

So taking this sample for the suggested manual -- this is really just the use of parallel to evaluate a surjective function or did I miss something?

Harald1 · ‎04-07-2021

Yes, you missed the essential keywords TARGET and REDUCTION. And my question was about how to get the right result with the Intel compiler which seems to fall short of implementing OpenMP 5 here w.r.t. the MAP handling.

JohnNichols · ‎04-07-2021

My apologies, I was really talking to myself, the sample program using

subroutine vec_mult(N)
S-2 integer :: i,N
S-3 real :: p(N), v1(N), v2(N)
S-4 call init(v1, v2, N)
S-5 !$omp target map(v1,v2,p)
S-6 !$omp parallel do
S-7 do i=1,N
S-8 p(i) = v1(i) * v2(i)
S-9 end do
S-10 !$omp end target
S-11 call output(p, N)
S-12 end subroutine

from the openmp example manual is also a surjective function v1 and v2 map to the p space, if we take S-8 and make it p(i) = v1(i)*v2(i)*p(i-1) we can not parallel this -- I think, we could recurse it. I am not commenting on your problem - I am just wondering in my mind. This is probably obvious to those wo do parallel all of the time, but most problems I have a p(i) = v1(i)*v2(i)*p(i-1) - thanks for the interesting lesson.

Barbara_P_Intel · ‎04-09-2021

According to this article, the REDUCTION clause on TEAMS is supported.

I helped another developer with converting OpenACC to OpenMP and learned a few things.

This works including the reduction and allows you run different TEAMS with the same data eliminating multiple data transfers.

program p5
  implicit none
  integer           :: i, n = 1000
  real              :: s, s1
  real, allocatable :: a(:)
  allocate (a(n))

  do i = 1, n
     a(i) = 2*i
  end do

  s = 0.
  s1 = 0.

!$omp target enter data map(to:a,n,s,s1)

!$omp target teams
!$omp distribute parallel do reduction(+:s)
  do i = 1, n
     s = s + a(i)
  end do
!$omp end target teams

!$omp target teams
!$omp distribute parallel do reduction(+:s1)
  do i = 1, n
     s1 = s1 + a(i)
  end do
!$omp end target teams

!$omp target exit data map(from:s,s1)

  print *, a(1),a(n)
  print *, s
  print *, s1
end program