Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28756 Discussions

/Qopenmp and /Qparallel with do concurrent reduce

Fortran10
Novice
1,458 Views

I see the new version of intel Fortran compiler supports reduction with do concurrent. I wanted to test it with both options of /Qopenmp and /Qparallel with ifort

 

Do concurrent with reduce

The following code uses do concurrent with reduce. 

!      
!  Do concurrent test with reduce 
!

program dc_test
  use omp_lib
  implicit none


  ! variables 

  integer ( 4 ), parameter :: x = 256 , y = 256, dx = 1, dy = 1
  integer ( 4 ) :: steps = 100000 , step, i, j, ip, jp, im, jm
  real ( 8 )    :: dt = 0.01 , time_start , time_end, m = 1.0 , k = 0.5
  real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md


  allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )


  !---


  call random_number ( r )
  c = 0.4 + 0.02*( 0.5 - r )

  time_start = omp_get_wtime()


  !--- iteration


  do step = 1, steps


     do concurrent ( integer :: i=1:x, j=1:y ) 


        f(i,j) =  24.0*c(i,j)*( 1.50 - c(i,j) )**2 - c(i,j)**3*( 1.03 - c(i,j) )   

        jp = j + 1
        jm = j - 1

        ip = i + 1
        im = i - 1

        if ( im == 0 ) im = x
        if ( ip == ( x + 1) ) ip = 1
        if ( jm == 0 ) jm = y
        if ( jp == ( y + 1) ) jp = 1

        cc(i,j)   = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
             4.0*c(i,j) ) /( dx*dy )

        dc(i,j) = f(i,j) - k*cc(i,j)

        md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
             + dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )

     end do


     ! using reduction

     do concurrent ( integer :: i=1:x, j=1:y ) reduce (+ : c )

        c(i,j) =  c(i,j) + dt*m*md(i,j)

     end do


  end do


  time_end = omp_get_wtime()


  print '("  The time is   = ", f8.3," seconds." )', time_end - time_start


end program dc_test

I performed the tests and got results:

 

1 ) ifort main.f90 /Qopenmp

stackoverflow error

 

Qopenmp.PNG

2 ) ifort main.f90 /Qparallel /QxHost /O3

The code works fine and there is no error!

Qparallel.PNG

 

Do concurrent without reduce

The same code uses do concurrent without reduce. 

!      
!  Do concurrent test with no reduce 
!

program dc_test
  use omp_lib
  implicit none


  ! variables 

  integer ( 4 ), parameter :: x = 256 , y = 256, dx = 1, dy = 1
  integer ( 4 ) :: steps = 100000 , step, i, j, ip, jp, im, jm
  real ( 8 )    :: dt = 0.01 , time_start , time_end, m = 1.0 , k = 0.5
  real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md


  allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )


  !---


  call random_number ( r )
  c = 0.4 + 0.02*( 0.5 - r )

  time_start = omp_get_wtime()


  !--- iteration


  do step = 1, steps


     do concurrent ( integer :: i=1:x, j=1:y ) 


        f(i,j) =  24.0*c(i,j)*( 1.50 - c(i,j) )**2 - c(i,j)**3*( 1.03 - c(i,j) )   

        jp = j + 1
        jm = j - 1

        ip = i + 1
        im = i - 1

        if ( im == 0 ) im = x
        if ( ip == ( x + 1) ) ip = 1
        if ( jm == 0 ) jm = y
        if ( jp == ( y + 1) ) jp = 1

        cc(i,j)   = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
             4.0*c(i,j) ) /( dx*dy )

        dc(i,j) = f(i,j) - k*cc(i,j)

        md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
             + dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )

        c(i,j) =  c(i,j) + dt*m*md(i,j)

     end do


  end do


  time_end = omp_get_wtime()


  print '("  The time is   = ", f8.3," seconds." )', time_end - time_start


end program dc_test

 

3 ) ifort main.f90 /Qopenmp

No error !

 

Qopenmp_noreduce.PNG

 

4 ) ifort main.f90 /Qparallel /QxHost /O3

No error !

 

Qparallel_noreduce.PNG

 

So why does /Qopenmp option shows error when using reduce with do concurrent

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,391 Views

You may need to link option to set the stack size (larger than default)

Also, with /Qopenmp, the additional OpenMP threads get a default stack size (that may be too small for the reduction). If you must use /Qopenmp, include "use omp_lib" in PROGRAM then at beginnng of the program insert "call kmp_set_stacksize(YourSizeHere)"

Sets to size the number of bytes that will be allocated for each parallel thread to use as its private stack. This value can also be set via the KMP_STACKSIZE environment variable. In order for KMP_SET_STACKSIZE_S() to have an effect, it must be called before the beginning of the first (dynamically executed) parallel region in the program.

or set environment variable OMP_STACKSIZE=sizeYouWant

Jim Dempsey

 

View solution in original post

0 Kudos
14 Replies
Barbara_P_Intel
Employee
1,410 Views

Why are you using /Qparallel? That turns on the auto-parallelizer. I'm not sure what that does if anything with DO CONCURRENT.

As I just posted on another thread the DO CONCURRENT / openmp combination uses OMP SIMD. The parallelization team is looking at using OMP PARALLEL DO in a future release.

What happens if you use ifx?

 

0 Kudos
Fortran10
Novice
1,320 Views

with ifx the error seems to be the same.

ifx.PNG

0 Kudos
Fortran10
Novice
1,309 Views

Also after some modifications in the code:

 

x = 100

y = 100

 

1)  no reduce

program dc_test
  use omp_lib
  implicit none


  ! variables 

  integer ( 4 ), parameter :: x = 100 , y = 100, dx = 1, dy = 1
  integer ( 4 ) :: steps = 100000 , step, i, j, ip, jp, im, jm
  real ( 8 )    :: dt = 0.01 , time_start , time_end, time,  m = 1.0 , k = 0.5
  real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md
  !real ( 8 ), dimension ( x, y ) :: r, c, f, cc, dc, md

  allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )

 ! call kmp_set_stacksize_s(1000000)

  !---


  call random_number ( r )
  c = 0.4 + 0.02*( 0.5 - r )

  time_start = omp_get_wtime()

  !--- iteration

  do step = 1, steps

     do concurrent ( integer :: i=1:x, j=1:y ) 

        f(i,j) =  24.0*c(i,j)*( 1.50 - c(i,j) )**2 - &
             &c(i,j)**3*( 1.03 - c(i,j) )   

        jp = j + 1
        jm = j - 1

        ip = i + 1
        im = i - 1

        if ( im == 0 ) im = x
        if ( ip == ( x + 1) ) ip = 1
        if ( jm == 0 ) jm = y
        if ( jp == ( y + 1) ) jp = 1

        cc(i,j)   = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
             4.0*c(i,j) ) /( dx*dy )

        dc(i,j) = f(i,j) - k*cc(i,j)

        md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
             + dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )

     end do

     ! using reduction

     do concurrent ( integer :: i=1:x, j=1:y ) !reduce (+ : c )

        c(i,j) =  c(i,j) + dt*m*md(i,j)

     end do

  end do


  time_end = omp_get_wtime()

  time = time_end - time_start

  print*, 'x = ', x
  print*, 'y = ', y
  print*, time


end program dc_test

 ifx_noreduce.PNG

I get the output.

 

2) ifx with reduce in do concurrent

    

 do concurrent ( integer :: i=1:x, j=1:y ) reduce (+ : c )

      

No output

 

ifx_reduce.PNG

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,392 Views

You may need to link option to set the stack size (larger than default)

Also, with /Qopenmp, the additional OpenMP threads get a default stack size (that may be too small for the reduction). If you must use /Qopenmp, include "use omp_lib" in PROGRAM then at beginnng of the program insert "call kmp_set_stacksize(YourSizeHere)"

Sets to size the number of bytes that will be allocated for each parallel thread to use as its private stack. This value can also be set via the KMP_STACKSIZE environment variable. In order for KMP_SET_STACKSIZE_S() to have an effect, it must be called before the beginning of the first (dynamically executed) parallel region in the program.

or set environment variable OMP_STACKSIZE=sizeYouWant

Jim Dempsey

 

0 Kudos
Barbara_P_Intel
Employee
1,281 Views

Thanks for the new reproducers! That saved me doing that work. I used the new versions for these tests just now.

I cannot reproduce the runtime failures using ifort with the reduce or without the reduce clause. I used Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000 on Windows 11 from the Intel(r) oneAPI Tools command windows. Works ok on Linux, too.

Now for ifx, it works just fine without the reduce clause, but with the reduce clause there is no output on Windows as you mentioned. I do get output for both versions on Linux.

I'll get a bug filed against ifx on Windows.

 

0 Kudos
Barbara_P_Intel
Employee
1,268 Views

I filed CMPLRLLVM-52867. I'll let you know when it's fixed.



0 Kudos
Barbara_P_Intel
Employee
1,208 Views

Compile the reduce version of the reproducer like this and it runs and prints output. As @jimdempseyatthecove pointed out earlier the size of the stack needs to be increased.

Yes, I used ifx. It's the compiler of the future. The -F compiler option also works with ifort.

>ifx -Qopenmp -F20000000 dctest.reduce.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.34.31937.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:dctest.reduce.exe
-subsystem:console
-stack:20000000
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
dctest.reduce.obj

>dctest.reduce.exe
 x =          100
 y =          100
   8.11413409999977

 

0 Kudos
Fortran10
Novice
1,183 Views

It does work perfectly fine with ifx. The speed is almost double compared to ifort. 

 

ifort /Qopenmp /F20000000 main.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
-subsystem:console
-stack:20000000
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj

>main
 x =          100
 y =          100
   17.5483603999019
ifx /Qopenmp /F20000000 main.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
-subsystem:console
-stack:20000000
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj

>main
 x =          100
 y =          100
   9.16886619990692
0 Kudos
Fortran10
Novice
1,180 Views

However using the routine in the code

 

call kmp_set_stacksize_s(20000000)

 

with the command

 

ifx /Qopenmp main.f90

gives no output.

 

ifx /Qopenmp main.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj

>main

>

 But with ifort

 

ifort /Qopenmp main.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj

>main
 x =          100
 y =          100
   17.6567553000059

  I get the output.

0 Kudos
Barbara_P_Intel
Employee
1,200 Views

In anticipation of another question, why is there a performance difference between the version with the reduce and without it?

 

On my laptop using ifx I see:

reduce   8.11413409999977 seconds

no reduce   3.61359100000027 seconds

 

Strictly speaking

    c(i,j) = c(i,j) + dt*m*md(i,j)

is not a reduction.

Reduction using addition often would have a form like s = s + x(i), which looks similar to c(i,j) = c(i,j) + dt*m*md(i,j), but the difference is that in s = s + x(i) the same s value is being updated in each loop iteration.

 

0 Kudos
Fortran10
Novice
1,180 Views

Yeah, I see now that it is not really reduction. Thanks for the clarification. It helps a lot!

 

Since reduction is a new feature in the latest compiler updates, I was just playing around with it to have a look. Just out of curiosity it was a random shot. 

 

 

0 Kudos
Barbara_P_Intel
Employee
1,148 Views

The reproducer is doing more work than necessary with both compilers due to the REDUCE clause that isn't really doing a reduce.

Where did you place kmp_set_stack_s()? The DGR states:

"In order for KMP_SET_STACKSIZE_S() to have an effect, it must be called before the beginning of the first (dynamically executed) parallel region in the program."

 

0 Kudos
Fortran10
Novice
1,124 Views

Yeah, now I put the routine at the top before parallel region.

 

But the output with Ifx on windows seems to have three different behaviors by using the stack size routine

call kmp_set_stacksize_s( stack_size )

 

Here is the new modification

 

program dc_test
  use omp_lib
  implicit none


  ! variables 

  integer ( 4 ), parameter :: x = 100 , y = 100, dx = 1, dy = 1
  integer ( 4 ) :: steps = 1000 , step, i, j, ip, jp, im, jm
  integer ( 8 ) :: stack_size = 20000000
  real ( 8 )    :: dt = 0.01 , time_start , time_end, time,  m = 1.0 , k = 0.5
  real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md


  call kmp_set_stacksize_s( stack_size )

  allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )


  !=== print array size 

  print*, ''
  print*, 'x = ', x
  print*, 'y = ', y


  !---


  call random_number ( r )
  c = 0.4 + 0.02*( 0.5 - r )

  time_start = omp_get_wtime()


  !--- iteration


  do step = 1, steps


     do concurrent ( integer :: i=1:x, j=1:y ) 

        f(i,j) =  24.0*c(i,j)*( 1.50 - c(i,j) )**2 - &
             &c(i,j)**3*( 1.03 - c(i,j) )   

        jp = j + 1
        jm = j - 1

        ip = i + 1
        im = i - 1

        if ( im == 0 ) im = x
        if ( ip == ( x + 1) ) ip = 1
        if ( jm == 0 ) jm = y
        if ( jp == ( y + 1) ) jp = 1

        cc(i,j)   = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
             4.0*c(i,j) ) /( dx*dy )

        dc(i,j) = f(i,j) - k*cc(i,j)

        md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
             + dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )

     end do


     do concurrent ( integer :: i=1:x, j=1:y ) reduce (+ : c )

        c(i,j) =  c(i,j) + dt*m*md(i,j)

     end do


  end do


  time_end = omp_get_wtime()

  time = time_end - time_start


  !=== print the computing time here

  print*, 'Time = ', time


end program dc_test

 

1) When the array is small,

 

the output (after reduction ) is there i.e. Time

 

ifx main.f90 /Qopenmp
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj

>main

 x =          100
 y =          100
 Time =   0.101688399910927

2) When the array is big ,

 

stack overflow error,

no output after reduction,

there is output before reduction

 

ifx main.f90 /Qopenmp
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj

>main

 x =          400
 y =          400
forrtl: severe (170): Program Exception - stack overflow
Image              PC                Routine            Line        Source
main.exe           00007FF7DB711657  Unknown               Unknown  Unknown
main.exe           00007FF7DB6D2DE2  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C71EEC3  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C667BA7  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C668EE9  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C66EC22  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C669686  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C62256C  Unknown               Unknown  Unknown
main.exe           00007FF7DB6D1A86  Unknown               Unknown  Unknown
main.exe           00007FF7DB6D9DDB  Unknown               Unknown  Unknown
main.exe           00007FF7DB711860  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFAB0A17344  Unknown               Unknown  Unknown
ntdll.dll          00007FFAB1F426B1  Unknown               Unknown  Unknown

 

3) When the array is in between:

 

no output after reduction,

there is output before reduction

ifx main.f90 /Qopenmp
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj

>main

 x =          200
 y =          200

 

Now for ifort, for the array size between

 

ifort main.f90 /Qopenmp
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main.obj

>main

 x =          200
 y =          200
 Time =    1.58534609992057

 

  For the small and big array size, the ifort and ifx behaviors are the same.

0 Kudos
Barbara_P_Intel
Employee
1,117 Views

The stack size needs to change with the size of the problem. One size does not fit all.


0 Kudos
Reply