/Qopenmp and /Qparallel with do concurrent reduce


I see the new version of intel Fortran compiler supports reduction with do concurrent. I wanted to test it with both options of /Qopenmp and /Qparallel with ifort


Do concurrent with reduce

The following code uses do concurrent with reduce. 

!  Do concurrent test with reduce 

program dc_test
  use omp_lib
  implicit none

  ! variables 

  integer ( 4 ), parameter :: x = 256 , y = 256, dx = 1, dy = 1
  integer ( 4 ) :: steps = 100000 , step, i, j, ip, jp, im, jm
  real ( 8 )    :: dt = 0.01 , time_start , time_end, m = 1.0 , k = 0.5
  real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md

  allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )


  call random_number ( r )
  c = 0.4 + 0.02*( 0.5 - r )

  time_start = omp_get_wtime()

  !--- iteration

  do step = 1, steps

     do concurrent ( integer :: i=1:x, j=1:y ) 

        f(i,j) =  24.0*c(i,j)*( 1.50 - c(i,j) )**2 - c(i,j)**3*( 1.03 - c(i,j) )   

        jp = j + 1
        jm = j - 1

        ip = i + 1
        im = i - 1

        if ( im == 0 ) im = x
        if ( ip == ( x + 1) ) ip = 1
        if ( jm == 0 ) jm = y
        if ( jp == ( y + 1) ) jp = 1

        cc(i,j)   = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
             4.0*c(i,j) ) /( dx*dy )

        dc(i,j) = f(i,j) - k*cc(i,j)

        md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
             + dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )

     end do

     ! using reduction

     do concurrent ( integer :: i=1:x, j=1:y ) reduce (+ : c )

        c(i,j) =  c(i,j) + dt*m*md(i,j)

     end do

  end do

  time_end = omp_get_wtime()

  print '("  The time is   = ", f8.3," seconds." )', time_end - time_start

end program dc_test

I performed the tests and got results:


1 ) ifort main.f90 /Qopenmp

stackoverflow error



2 ) ifort main.f90 /Qparallel /QxHost /O3

The code works fine and there is no error!



Do concurrent without reduce

The same code uses do concurrent without reduce. 

!  Do concurrent test with no reduce 

program dc_test
  use omp_lib
  implicit none

  ! variables 

  integer ( 4 ), parameter :: x = 256 , y = 256, dx = 1, dy = 1
  integer ( 4 ) :: steps = 100000 , step, i, j, ip, jp, im, jm
  real ( 8 )    :: dt = 0.01 , time_start , time_end, m = 1.0 , k = 0.5
  real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md

  allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )


  call random_number ( r )
  c = 0.4 + 0.02*( 0.5 - r )

  time_start = omp_get_wtime()

  !--- iteration

  do step = 1, steps

     do concurrent ( integer :: i=1:x, j=1:y ) 

        f(i,j) =  24.0*c(i,j)*( 1.50 - c(i,j) )**2 - c(i,j)**3*( 1.03 - c(i,j) )   

        jp = j + 1
        jm = j - 1

        ip = i + 1
        im = i - 1

        if ( im == 0 ) im = x
        if ( ip == ( x + 1) ) ip = 1
        if ( jm == 0 ) jm = y
        if ( jp == ( y + 1) ) jp = 1

        cc(i,j)   = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
             4.0*c(i,j) ) /( dx*dy )

        dc(i,j) = f(i,j) - k*cc(i,j)

        md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
             + dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )

        c(i,j) =  c(i,j) + dt*m*md(i,j)

     end do

  end do

  time_end = omp_get_wtime()

  print '("  The time is   = ", f8.3," seconds." )', time_end - time_start

end program dc_test


3 ) ifort main.f90 /Qopenmp

No error !




4 ) ifort main.f90 /Qparallel /QxHost /O3

No error !




So why does /Qopenmp option shows error when using reduce with do concurrent

Honored Contributor III

You may need to link option to set the stack size (larger than default)

Also, with /Qopenmp, the additional OpenMP threads get a default stack size (that may be too small for the reduction). If you must use /Qopenmp, include "use omp_lib" in PROGRAM then at beginnng of the program insert "call kmp_set_stacksize(YourSizeHere)"

Sets to size the number of bytes that will be allocated for each parallel thread to use as its private stack. This value can also be set via the KMP_STACKSIZE environment variable. In order for KMP_SET_STACKSIZE_S() to have an effect, it must be called before the beginning of the first (dynamically executed) parallel region in the program.

or set environment variable OMP_STACKSIZE=sizeYouWant

Jim Dempsey


Why are you using /Qparallel? That turns on the auto-parallelizer. I'm not sure what that does if anything with DO CONCURRENT.

As I just posted on another thread the DO CONCURRENT / openmp combination uses OMP SIMD. The parallelization team is looking at using OMP PARALLEL DO in a future release.

What happens if you use ifx?


with ifx the error seems to be the same.


Also after some modifications in the code:


x = 100

y = 100


1)  no reduce

program dc_test
  use omp_lib
  implicit none

  ! variables 

  integer ( 4 ), parameter :: x = 100 , y = 100, dx = 1, dy = 1
  integer ( 4 ) :: steps = 100000 , step, i, j, ip, jp, im, jm
  real ( 8 )    :: dt = 0.01 , time_start , time_end, time,  m = 1.0 , k = 0.5
  real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md
  !real ( 8 ), dimension ( x, y ) :: r, c, f, cc, dc, md

  allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )

 ! call kmp_set_stacksize_s(1000000)


  call random_number ( r )
  c = 0.4 + 0.02*( 0.5 - r )

  time_start = omp_get_wtime()

  !--- iteration

  do step = 1, steps

     do concurrent ( integer :: i=1:x, j=1:y ) 

        f(i,j) =  24.0*c(i,j)*( 1.50 - c(i,j) )**2 - &
             &c(i,j)**3*( 1.03 - c(i,j) )   

        jp = j + 1
        jm = j - 1

        ip = i + 1
        im = i - 1

        if ( im == 0 ) im = x
        if ( ip == ( x + 1) ) ip = 1
        if ( jm == 0 ) jm = y
        if ( jp == ( y + 1) ) jp = 1

        cc(i,j)   = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
             4.0*c(i,j) ) /( dx*dy )

        dc(i,j) = f(i,j) - k*cc(i,j)

        md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
             + dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )

     end do

     ! using reduction

     do concurrent ( integer :: i=1:x, j=1:y ) !reduce (+ : c )

        c(i,j) =  c(i,j) + dt*m*md(i,j)

     end do

  end do

  time_end = omp_get_wtime()

  time = time_end - time_start

  print*, 'x = ', x
  print*, 'y = ', y
  print*, time

end program dc_test


I get the output.


2) ifx with reduce in do concurrent


 do concurrent ( integer :: i=1:x, j=1:y ) reduce (+ : c )


No output



Honored Contributor III

You may need to link option to set the stack size (larger than default)

Also, with /Qopenmp, the additional OpenMP threads get a default stack size (that may be too small for the reduction). If you must use /Qopenmp, include "use omp_lib" in PROGRAM then at beginnng of the program insert "call kmp_set_stacksize(YourSizeHere)"

Sets to size the number of bytes that will be allocated for each parallel thread to use as its private stack. This value can also be set via the KMP_STACKSIZE environment variable. In order for KMP_SET_STACKSIZE_S() to have an effect, it must be called before the beginning of the first (dynamically executed) parallel region in the program.

or set environment variable OMP_STACKSIZE=sizeYouWant

Jim Dempsey


Thanks for the new reproducers! That saved me doing that work. I used the new versions for these tests just now.

I cannot reproduce the runtime failures using ifort with the reduce or without the reduce clause. I used Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000 on Windows 11 from the Intel(r) oneAPI Tools command windows. Works ok on Linux, too.

Now for ifx, it works just fine without the reduce clause, but with the reduce clause there is no output on Windows as you mentioned. I do get output for both versions on Linux.

I'll get a bug filed against ifx on Windows.


I filed CMPLRLLVM-52867. I'll let you know when it's fixed.

Compile the reduce version of the reproducer like this and it runs and prints output. As @jimdempseyatthecove pointed out earlier the size of the stack needs to be increased.

Yes, I used ifx. It's the compiler of the future. The -F compiler option also works with ifort.

>ifx -Qopenmp -F20000000 dctest.reduce.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.34.31937.0
Copyright (C) Microsoft Corporation.  All rights reserved.


 x =          100
 y =          100


It does work perfectly fine with ifx. The speed is almost double compared to ifort. 


ifort /Qopenmp /F20000000 main.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.


 x =          100
 y =          100
ifx /Qopenmp /F20000000 main.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.


 x =          100
 y =          100
However using the routine in the code


call kmp_set_stacksize_s(20000000)


with the command


ifx /Qopenmp main.f90

gives no output.


ifx /Qopenmp main.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.




 But with ifort


ifort /Qopenmp main.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.


 x =          100
 y =          100

  I get the output.

In anticipation of another question, why is there a performance difference between the version with the reduce and without it?


On my laptop using ifx I see:

reduce   8.11413409999977 seconds

no reduce   3.61359100000027 seconds


Strictly speaking

    c(i,j) = c(i,j) + dt*m*md(i,j)

is not a reduction.

Reduction using addition often would have a form like s = s + x(i), which looks similar to c(i,j) = c(i,j) + dt*m*md(i,j), but the difference is that in s = s + x(i) the same s value is being updated in each loop iteration.


Yeah, I see now that it is not really reduction. Thanks for the clarification. It helps a lot!


Since reduction is a new feature in the latest compiler updates, I was just playing around with it to have a look. Just out of curiosity it was a random shot. 



The reproducer is doing more work than necessary with both compilers due to the REDUCE clause that isn't really doing a reduce.

Where did you place kmp_set_stack_s()? The DGR states:

"In order for KMP_SET_STACKSIZE_S() to have an effect, it must be called before the beginning of the first (dynamically executed) parallel region in the program."


Yeah, now I put the routine at the top before parallel region.


But the output with Ifx on windows seems to have three different behaviors by using the stack size routine

call kmp_set_stacksize_s( stack_size )


Here is the new modification


program dc_test
  use omp_lib
  implicit none

  ! variables 

  integer ( 4 ), parameter :: x = 100 , y = 100, dx = 1, dy = 1
  integer ( 4 ) :: steps = 1000 , step, i, j, ip, jp, im, jm
  integer ( 8 ) :: stack_size = 20000000
  real ( 8 )    :: dt = 0.01 , time_start , time_end, time,  m = 1.0 , k = 0.5
  real ( 8 ), dimension ( :, : ), allocatable :: r, c, f, cc, dc, md

  call kmp_set_stacksize_s( stack_size )

  allocate ( r(x,y) , c(x,y), f(x,y), cc(x,y), dc(x,y), md(x,y) )

  !=== print array size 

  print*, ''
  print*, 'x = ', x
  print*, 'y = ', y


  call random_number ( r )
  c = 0.4 + 0.02*( 0.5 - r )

  time_start = omp_get_wtime()

  !--- iteration

  do step = 1, steps

     do concurrent ( integer :: i=1:x, j=1:y ) 

        f(i,j) =  24.0*c(i,j)*( 1.50 - c(i,j) )**2 - &
             &c(i,j)**3*( 1.03 - c(i,j) )   

        jp = j + 1
        jm = j - 1

        ip = i + 1
        im = i - 1

        if ( im == 0 ) im = x
        if ( ip == ( x + 1) ) ip = 1
        if ( jm == 0 ) jm = y
        if ( jp == ( y + 1) ) jp = 1

        cc(i,j)   = ( c(ip,j) + c(im,j) + c(i,jm) + c(i,jp) - &
             4.0*c(i,j) ) /( dx*dy )

        dc(i,j) = f(i,j) - k*cc(i,j)

        md(i,j) = ( dc(ip,j) + dc(im,j) + dc(i,jm) &
             + dc(i,jp) - 4.0*dc(i,j) ) / ( dx*dy )

     end do

     do concurrent ( integer :: i=1:x, j=1:y ) reduce (+ : c )

        c(i,j) =  c(i,j) + dt*m*md(i,j)

     end do

  end do

  time_end = omp_get_wtime()

  time = time_end - time_start

  !=== print the computing time here

  print*, 'Time = ', time

end program dc_test


1) When the array is small,


the output (after reduction ) is there i.e. Time


ifx main.f90 /Qopenmp
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.



 x =          100
 y =          100
 Time =   0.101688399910927

2) When the array is big ,


stack overflow error,

no output after reduction,

there is output before reduction


ifx main.f90 /Qopenmp
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.



 x =          400
 y =          400
forrtl: severe (170): Program Exception - stack overflow
Image              PC                Routine            Line        Source
main.exe           00007FF7DB711657  Unknown               Unknown  Unknown
main.exe           00007FF7DB6D2DE2  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C71EEC3  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C667BA7  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C668EE9  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C66EC22  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C669686  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFA6C62256C  Unknown               Unknown  Unknown
main.exe           00007FF7DB6D1A86  Unknown               Unknown  Unknown
main.exe           00007FF7DB6D9DDB  Unknown               Unknown  Unknown
main.exe           00007FF7DB711860  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFAB0A17344  Unknown               Unknown  Unknown
ntdll.dll          00007FFAB1F426B1  Unknown               Unknown  Unknown


3) When the array is in between:


no output after reduction,

there is output before reduction

ifx main.f90 /Qopenmp
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.2.0 Build 20230627
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.



 x =          200
 y =          200


Now for ifort, for the array size between


ifort main.f90 /Qopenmp
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.10.0 Build 20230609_000000
Copyright (C) 1985-2023 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.



 x =          200
 y =          200
 Time =    1.58534609992057


  For the small and big array size, the ifort and ifx behaviors are the same.

The stack size needs to change with the size of the problem. One size does not fit all.

