- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see the new version of intel Fortran compiler supports reduction with do concurrent. I wanted to test it with both options of /Qopenmp and /Qparallel with ifort.
Do concurrent with reduce
The following code uses do concurrent with reduce.
!
! Do concurrent test with reduce
!
program dc_test
use omp_lib
implicit none
! variables
integer ( 4 ), parameter :: x = 256 , y = 256, dx = 1, dy = 1
integer ( 4 ) :: steps = 100000 , step, i, j, ip, jp, im, jm
real ( 8 ) :: dt = 0.01 , time_start , time_end, m = 1.0 , k = 0.5
real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md
allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )
!---
call random_number ( r )
c = 0.4 + 0.02*( 0.5 - r )
time_start = omp_get_wtime()
!--- iteration
do step = 1, steps
do concurrent ( integer :: i=1:x, j=1:y )
f(i,j) = 24.0*c(i,j)*( 1.50 - c(i,j) )**2 - c(i,j)**3*( 1.03 - c(i,j) )
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = x
if ( ip == ( x + 1) ) ip = 1
if ( jm == 0 ) jm = y
if ( jp == ( y + 1) ) jp = 1
cc(i,j) = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
4.0*c(i,j) ) /( dx*dy )
dc(i,j) = f(i,j) - k*cc(i,j)
md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
+ dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )
end do
! using reduction
do concurrent ( integer :: i=1:x, j=1:y ) reduce (+ : c )
c(i,j) = c(i,j) + dt*m*md(i,j)
end do
end do
time_end = omp_get_wtime()
print '(" The time is = ", f8.3," seconds." )', time_end - time_start
end program dc_test
I performed the tests and got results:
1 ) ifort main.f90 /Qopenmp
stackoverflow error
2 ) ifort main.f90 /Qparallel /QxHost /O3
The code works fine and there is no error!
Do concurrent without reduce
The same code uses do concurrent without reduce.
!
! Do concurrent test with no reduce
!
program dc_test
use omp_lib
implicit none
! variables
integer ( 4 ), parameter :: x = 256 , y = 256, dx = 1, dy = 1
integer ( 4 ) :: steps = 100000 , step, i, j, ip, jp, im, jm
real ( 8 ) :: dt = 0.01 , time_start , time_end, m = 1.0 , k = 0.5
real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md
allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )
!---
call random_number ( r )
c = 0.4 + 0.02*( 0.5 - r )
time_start = omp_get_wtime()
!--- iteration
do step = 1, steps
do concurrent ( integer :: i=1:x, j=1:y )
f(i,j) = 24.0*c(i,j)*( 1.50 - c(i,j) )**2 - c(i,j)**3*( 1.03 - c(i,j) )
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = x
if ( ip == ( x + 1) ) ip = 1
if ( jm == 0 ) jm = y
if ( jp == ( y + 1) ) jp = 1
cc(i,j) = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
4.0*c(i,j) ) /( dx*dy )
dc(i,j) = f(i,j) - k*cc(i,j)
md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
+ dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )
c(i,j) = c(i,j) + dt*m*md(i,j)
end do
end do
time_end = omp_get_wtime()
print '(" The time is = ", f8.3," seconds." )', time_end - time_start
end program dc_test
3 ) ifort main.f90 /Qopenmp
No error !
4 ) ifort main.f90 /Qparallel /QxHost /O3
No error !
So why does /Qopenmp option shows error when using reduce with do concurrent ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may need to link option to set the stack size (larger than default)
Also, with /Qopenmp, the additional OpenMP threads get a default stack size (that may be too small for the reduction). If you must use /Qopenmp, include "use omp_lib" in PROGRAM then at beginnng of the program insert "call kmp_set_stacksize(YourSizeHere)"
Sets to size the number of bytes that will be allocated for each parallel thread to use as its private stack. This value can also be set via the KMP_STACKSIZE environment variable. In order for KMP_SET_STACKSIZE_S() to have an effect, it must be called before the beginning of the first (dynamically executed) parallel region in the program.
or set environment variable OMP_STACKSIZE=sizeYouWant
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why are you using /Qparallel? That turns on the auto-parallelizer. I'm not sure what that does if anything with DO CONCURRENT.
As I just posted on another thread the DO CONCURRENT / openmp combination uses OMP SIMD. The parallelization team is looking at using OMP PARALLEL DO in a future release.
What happens if you use ifx?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
with ifx the error seems to be the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also after some modifications in the code:
x = 100
y = 100
1) no reduce
program dc_test
use omp_lib
implicit none
! variables
integer ( 4 ), parameter :: x = 100 , y = 100, dx = 1, dy = 1
integer ( 4 ) :: steps = 100000 , step, i, j, ip, jp, im, jm
real ( 8 ) :: dt = 0.01 , time_start , time_end, time, m = 1.0 , k = 0.5
real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md
!real ( 8 ), dimension ( x, y ) :: r, c, f, cc, dc, md
allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )
! call kmp_set_stacksize_s(1000000)
!---
call random_number ( r )
c = 0.4 + 0.02*( 0.5 - r )
time_start = omp_get_wtime()
!--- iteration
do step = 1, steps
do concurrent ( integer :: i=1:x, j=1:y )
f(i,j) = 24.0*c(i,j)*( 1.50 - c(i,j) )**2 - &
&c(i,j)**3*( 1.03 - c(i,j) )
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = x
if ( ip == ( x + 1) ) ip = 1
if ( jm == 0 ) jm = y
if ( jp == ( y + 1) ) jp = 1
cc(i,j) = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
4.0*c(i,j) ) /( dx*dy )
dc(i,j) = f(i,j) - k*cc(i,j)
md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
+ dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )
end do
! using reduction
do concurrent ( integer :: i=1:x, j=1:y ) !reduce (+ : c )
c(i,j) = c(i,j) + dt*m*md(i,j)
end do
end do
time_end = omp_get_wtime()
time = time_end - time_start
print*, 'x = ', x
print*, 'y = ', y
print*, time
end program dc_test
I get the output.
2) ifx with reduce in do concurrent
do concurrent ( integer :: i=1:x, j=1:y ) reduce (+ : c )
No output
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may need to link option to set the stack size (larger than default)
Also, with /Qopenmp, the additional OpenMP threads get a default stack size (that may be too small for the reduction). If you must use /Qopenmp, include "use omp_lib" in PROGRAM then at beginnng of the program insert "call kmp_set_stacksize(YourSizeHere)"
Sets to size the number of bytes that will be allocated for each parallel thread to use as its private stack. This value can also be set via the KMP_STACKSIZE environment variable. In order for KMP_SET_STACKSIZE_S() to have an effect, it must be called before the beginning of the first (dynamically executed) parallel region in the program.
or set environment variable OMP_STACKSIZE=sizeYouWant
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the new reproducers! That saved me doing that work. I used the new versions for these tests just now.
I cannot reproduce the runtime failures using ifort with the reduce or without the reduce clause. I used Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000 on Windows 11 from the Intel(r) oneAPI Tools command windows. Works ok on Linux, too.
Now for ifx, it works just fine without the reduce clause, but with the reduce clause there is no output on Windows as you mentioned. I do get output for both versions on Linux.
I'll get a bug filed against ifx on Windows.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I filed CMPLRLLVM-52867. I'll let you know when it's fixed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Compile the reduce version of the reproducer like this and it runs and prints output. As @jimdempseyatthecove pointed out earlier the size of the stack needs to be increased.
Yes, I used ifx. It's the compiler of the future. The -F compiler option also works with ifort.
>ifx -Qopenmp -F20000000 dctest.reduce.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.34.31937.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:dctest.reduce.exe
-subsystem:console
-stack:20000000
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
dctest.reduce.obj
>dctest.reduce.exe
x = 100
y = 100
8.11413409999977
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It does work perfectly fine with ifx. The speed is almost double compared to ifort.
ifort /Qopenmp /F20000000 main.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:main.exe
-subsystem:console
-stack:20000000
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj
>main
x = 100
y = 100
17.5483603999019
ifx /Qopenmp /F20000000 main.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:main.exe
-subsystem:console
-stack:20000000
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj
>main
x = 100
y = 100
9.16886619990692
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
However using the routine in the code
call kmp_set_stacksize_s(20000000)
with the command
ifx /Qopenmp main.f90
gives no output.
ifx /Qopenmp main.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj
>main
>
But with ifort
ifort /Qopenmp main.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj
>main
x = 100
y = 100
17.6567553000059
I get the output.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In anticipation of another question, why is there a performance difference between the version with the reduce and without it?
On my laptop using ifx I see:
reduce 8.11413409999977 seconds
no reduce 3.61359100000027 seconds
Strictly speaking
c(i,j) = c(i,j) + dt*m*md(i,j)
is not a reduction.
Reduction using addition often would have a form like s = s + x(i), which looks similar to c(i,j) = c(i,j) + dt*m*md(i,j), but the difference is that in s = s + x(i) the same s value is being updated in each loop iteration.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah, I see now that it is not really reduction. Thanks for the clarification. It helps a lot!
Since reduction is a new feature in the latest compiler updates, I was just playing around with it to have a look. Just out of curiosity it was a random shot.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The reproducer is doing more work than necessary with both compilers due to the REDUCE clause that isn't really doing a reduce.
Where did you place kmp_set_stack_s()? The DGR states:
"In order for KMP_SET_STACKSIZE_S() to have an effect, it must be called before the beginning of the first (dynamically executed) parallel region in the program."
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah, now I put the routine at the top before parallel region.
But the output with Ifx on windows seems to have three different behaviors by using the stack size routine
call kmp_set_stacksize_s( stack_size )
Here is the new modification
program dc_test
use omp_lib
implicit none
! variables
integer ( 4 ), parameter :: x = 100 , y = 100, dx = 1, dy = 1
integer ( 4 ) :: steps = 1000 , step, i, j, ip, jp, im, jm
integer ( 8 ) :: stack_size = 20000000
real ( 8 ) :: dt = 0.01 , time_start , time_end, time, m = 1.0 , k = 0.5
real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md
call kmp_set_stacksize_s( stack_size )
allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )
!=== print array size
print*, ''
print*, 'x = ', x
print*, 'y = ', y
!---
call random_number ( r )
c = 0.4 + 0.02*( 0.5 - r )
time_start = omp_get_wtime()
!--- iteration
do step = 1, steps
do concurrent ( integer :: i=1:x, j=1:y )
f(i,j) = 24.0*c(i,j)*( 1.50 - c(i,j) )**2 - &
&c(i,j)**3*( 1.03 - c(i,j) )
jp = j + 1
jm = j - 1
ip = i + 1
im = i - 1
if ( im == 0 ) im = x
if ( ip == ( x + 1) ) ip = 1
if ( jm == 0 ) jm = y
if ( jp == ( y + 1) ) jp = 1
cc(i,j) = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
4.0*c(i,j) ) /( dx*dy )
dc(i,j) = f(i,j) - k*cc(i,j)
md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
+ dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )
end do
do concurrent ( integer :: i=1:x, j=1:y ) reduce (+ : c )
c(i,j) = c(i,j) + dt*m*md(i,j)
end do
end do
time_end = omp_get_wtime()
time = time_end - time_start
!=== print the computing time here
print*, 'Time = ', time
end program dc_test
1) When the array is small,
the output (after reduction ) is there i.e. Time
ifx main.f90 /Qopenmp
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj
>main
x = 100
y = 100
Time = 0.101688399910927
2) When the array is big ,
stack overflow error,
no output after reduction,
there is output before reduction
ifx main.f90 /Qopenmp
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj
>main
x = 400
y = 400
forrtl: severe (170): Program Exception - stack overflow
Image PC Routine Line Source
main.exe 00007FF7DB711657 Unknown Unknown Unknown
main.exe 00007FF7DB6D2DE2 Unknown Unknown Unknown
libiomp5md.dll 00007FFA6C71EEC3 Unknown Unknown Unknown
libiomp5md.dll 00007FFA6C667BA7 Unknown Unknown Unknown
libiomp5md.dll 00007FFA6C668EE9 Unknown Unknown Unknown
libiomp5md.dll 00007FFA6C66EC22 Unknown Unknown Unknown
libiomp5md.dll 00007FFA6C669686 Unknown Unknown Unknown
libiomp5md.dll 00007FFA6C62256C Unknown Unknown Unknown
main.exe 00007FF7DB6D1A86 Unknown Unknown Unknown
main.exe 00007FF7DB6D9DDB Unknown Unknown Unknown
main.exe 00007FF7DB711860 Unknown Unknown Unknown
KERNEL32.DLL 00007FFAB0A17344 Unknown Unknown Unknown
ntdll.dll 00007FFAB1F426B1 Unknown Unknown Unknown
3) When the array is in between:
no output after reduction,
there is output before reduction
ifx main.f90 /Qopenmp
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj
>main
x = 200
y = 200
Now for ifort, for the array size between
ifort main.f90 /Qopenmp
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj
>main
x = 200
y = 200
Time = 1.58534609992057
For the small and big array size, the ifort and ifx behaviors are the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The stack size needs to change with the size of the problem. One size does not fit all.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page