- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've been searching for a while why this doesn't work for me. Maybe somebody else could hit me on the head or help debugging...
Code:
program p5
implicit none
integer :: i, n = 1000
real :: s
real, allocatable :: a(:)
allocate (a(n))
do i = 1, n
a(i) = 2*i
end do
s = 0.
!$omp target data map(a,s)
!$omp target teams reduction(+:s) map(s)
do i = 1, n
s = s + a(i)
end do
!$omp end target teams
!$omp end target data
print *, a(1),a(n)
print *, s
end program
This works for me with nvidia (nvfortran -mp=multicore) and gcc of OpenSuse Leap 15.2 (gfortran-10 -fopenmp -foffload=nvptx-none), i.e. I get the expected result.
However, trying:
% ifort -fopenmp -qopenmp-offload=host
I always get zero for the sum. I am using:
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.1.2 Build 20201208_000000
I believe I have installed all needed libraries, I've also tried to set
export LIBOMPTARGET_DEBUG=1
which yield nothing.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The ifort (Classic) compiler does not support OMP TARGET directives. Use ifx instead.
This Getting Started article is mostly about C++, but there is a bit about Fortran. The compiler options used there will get you started.
By default, a program built with those compiler options will offload to an Intel GPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, that is surprising. ifort -help openmp says:
-qopenmp-offload[=<kind>]
Enables OpenMP* offloading compilation for TARGET directives.
Enabled by default with -qopenmp.
Use -qno-openmp-offload to disable.
Specify kind to specify the default device for TARGET directives.
host - allow target code to run on host system while still doing
the outlining for offload
ifx does give the right result for the testcase.
I tried some of the features of the mentioned documentation, to figure out if the code runs on the CPU or on the GPU.
Setting "export LIBOMPTARGET_DEBUG=1" does not really help, neither with "-fopenmp" nor "-fiopenmp". This changes if I add "-fopenmp-targets=spir64".
(It could be made clearer that "-fopenmp-targets=spir64" requires "-fiopenmp". If forgotten, the error message refers to "-fopenmp". Adding "-debug offload" crashes ifx.)
The combination of "-fiopenmp -fopenmp-targets=spir64" and "export OMP_TARGET_OFFLOAD=MANDATORY" leads on my system to:
Libomptarget --> Checking user-specified plugin 'libomptarget.rtl.opencl.so'...
Libomptarget --> Loading library 'libomptarget.rtl.opencl.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.opencl.so': libOpenCL.so.1: cannot open shared object file: No such file or directory!
[...]
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
or random crashes or a wrong result, dependent on the setting of LIBOMPTARGET_DEBUG. Interestingly, there is
/opt/intel/oneapi/compiler/2021.1.1/linux/lib/libOpenCL.so.1
I hope that is not a problem of my system. Or could that have happened while upgrading from 2021.1.1 to 2021.1.2?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm still struggling with the data sharing clauses to make the testcase work with ifx.
The code quoted has:
!$omp target teams reduction(+:s) map(s)
although I believe it should already work with
!$omp target teams reduction(+:s)
since the mapping was declared in the enclosing !$omp target data. Nevertheless the code gives a wrong result with ifx (but not with gfortran or nvfortran) if I omit that map. Interestingly, it does not matter if I use map(from:s) or map(to:s); both seem to help...
Is there something I am missing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is your failure? What version of ifx are you using?
I just ran this and it offloaded successfully. I didn't do the math to determine if the reduction is correct.
ifx (IFORT) 2021.1.2 Beta 20201214
Copyright (C) 1985-2020 Intel Corporation. All rights reserved.
+ ifx -fiopenmp -fopenmp-targets=spir64 p5.F90
+ export LIBOMPTARGET_PROFILE=T,usec
+ LIBOMPTARGET_PROFILE=T,usec
+ a.out
2.000000 2000.000
4.8048000E+07
LIBOMPTARGET_PROFILE for OMP DEVICE(0) Intel(R) UHD Graphics 630 [0x3e92], Thread 139877046483840
-- Name: Host Time (usec), Device Time (usec)
-- DataAlloc: 11.682510, 11.682510
-- DataRead: 0.000000, 0.000000
-- DataWrite: 0.953674, 0.953674
-- Kernel#__omp_offloading_803_80042b_MAIN___l15: 5090.951920, 4818.980000
-- ModuleBuild: 2641505.956650, 2641505.956650
-- Total: 2646609.544754, 2646337.572834
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just run the code without OpenMP (without -fiopenmp), or without -fopenmp-targets=spir64, and you get:
2.000000 2000.000
1001000.
which is what is expected.
So the result you show is likely wrong, or there is something we are missing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I realized after I posted my reply that I should just run it serially.
!$omp target data map(a,s)
!$omp target teams distribute map(tofrom:s), map(to:n,a), reduction(+:s)
do i = 1, n
s = s + a(i)
end do
!$omp end target data
This also works
!$omp target teams distribute map(tofrom:s), map(to:n,a), reduction(+:s)
do i = 1, n
s = s + a(i)
end do
I tend to be as specific as I can with OpenMP directives to make the code self-documenting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is probably a good idea to make the self-documenting.
The reason why I split and "nested" the target directives is probably obvious. As a motivation: with OpenACC I can place data on the device for a whole region e.g. using "!$acc data ..." / "!$acc end data" to avoid unnecessary copying between host and device between successive kernels. I can even verify that this works at compile time and also at run time.
With OpenMP I am not so sure when the copying happens, and the Intel compiler seems not to be very chatty... (certainly much less than the Nvidia compiler).
So probably as a recommendation: it would be good to have the compiler assure the programmer whether he/she is doing a good job avoiding unnecessary copying.
Thanks,
Harald
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The OpenMP TARGET DATA implements putting the data on the target device for the entire region. I haven't used OpenACC, but the concept is the same.
Check out the OpenMP 4.5.0 Examples document for how to do that share with OpenMP.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am starting to understand what I am seeing.
It appears that the _OPENMP macro as well as openmp_version return 201611, which means IIRC that ifort/ifx support OpenMP 4.5 TR4, not OpenMP 5.x.
OpenMP 5.1 has the following in "2.21.7 Data-Mapping Attribute Rules, Clauses, and Directives":
"If a list item appears in a reduction, lastprivate or linear clause on a combined target construct then it is treated as if it also appears in a map clause with a map-type of tofrom."
So once Intel supports OpenMP 5.x, the additional map clause for the reduction variable will no longer be necessary. For the time being, it is...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SUBROUTINE SIMPLE(N, A, B)
S-2
S-3 INTEGER I, N
S-4 REAL B(N), A(N)
S-5
S-6 !$OMP PARALLEL DO !I is private by default
S-7 DO I=2,N
S-8 B(I) = (A(I) + A(I-1)) / 2.0
S-9 ENDDO
S-10 !$OMP END PARALLEL DO
S-11
S-12 END SUBROUTINE SIMPLE
F
So taking this sample for the suggested manual -- this is really just the use of parallel to evaluate a surjective function or did I miss something?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you missed the essential keywords TARGET and REDUCTION. And my question was about how to get the right result with the Intel compiler which seems to fall short of implementing OpenMP 5 here w.r.t. the MAP handling.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My apologies, I was really talking to myself, the sample program using
subroutine vec_mult(N)
S-2 integer :: i,N
S-3 real :: p(N), v1(N), v2(N)
S-4 call init(v1, v2, N)
S-5 !$omp target map(v1,v2,p)
S-6 !$omp parallel do
S-7 do i=1,N
S-8 p(i) = v1(i) * v2(i)
S-9 end do
S-10 !$omp end target
S-11 call output(p, N)
S-12 end subroutine
from the openmp example manual is also a surjective function v1 and v2 map to the p space, if we take S-8 and make it p(i) = v1(i)*v2(i)*p(i-1) we can not parallel this -- I think, we could recurse it. I am not commenting on your problem - I am just wondering in my mind. This is probably obvious to those wo do parallel all of the time, but most problems I have a p(i) = v1(i)*v2(i)*p(i-1) - thanks for the interesting lesson.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
According to this article, the REDUCTION clause on TEAMS is supported.
I helped another developer with converting OpenACC to OpenMP and learned a few things.
This works including the reduction and allows you run different TEAMS with the same data eliminating multiple data transfers.
program p5
implicit none
integer :: i, n = 1000
real :: s, s1
real, allocatable :: a(:)
allocate (a(n))
do i = 1, n
a(i) = 2*i
end do
s = 0.
s1 = 0.
!$omp target enter data map(to:a,n,s,s1)
!$omp target teams
!$omp distribute parallel do reduction(+:s)
do i = 1, n
s = s + a(i)
end do
!$omp end target teams
!$omp target teams
!$omp distribute parallel do reduction(+:s1)
do i = 1, n
s1 = s1 + a(i)
end do
!$omp end target teams
!$omp target exit data map(from:s,s1)
print *, a(1),a(n)
print *, s
print *, s1
end program

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page