Solved: Using function pointers in the openmp parallel domain results in significant performance degradation

Fortranlover · ‎02-26-2024

Hello everyone! I'm a beginner using Fortran for scientific calculations.

I used openmp to accelerate parallel computing in my code. I called a million computing subroutines in the parallel domain, and the acceleration effect is very good. The example code is as follows, my Intel Fortran Compiler version is 2024.0.2 Build 20231213, using compilation options ifx /Qopenmp on Windows.

MODULE SOLVER

IMPLICIT NONE
real             :: var1,var2,var3,var4
integer          :: i,j,timestep,nx,ny
real,allocatable :: array1(:,:),array2(:,:)

contains

subroutine main_loop

nx = 1000
ny = 1000
allocate( array1(nx,ny), array2(nx,ny) )

!$OMP PARALLEL DEFAULT(FIRSTPRIVATE)shared(array1,array2)
Do timestep=1,100000

!$OMP DO
DO j=1,ny
DO i=1,nx

var1 = ( array1(i+1,j+1) + array1(i+1,j+1) )/ var3
var2 = ( array1(i-1,j+1) + array1(i-1,j+1) )/ var3
call sub_calculate( var1, var2, var4 )
var5 = var4 * (var1+var2)

var1 = ( array1(i-1,j+1) + array1(i+1,j+1) ) / var3
var2 = ( array1(i+1,j+1) + array1(i-1,j+1) ) / var3
call sub_calculate( var1, var2, var4 )
var6 = var4 * (var1+var2)

array2(i,j) = var5+var6

END DO
END DO
!$OMP END DO
END DO
!$OMP END PARALLEL
end subroutine

subroutine sub_calculate( var1, var2, var4 )
real :: var1, var2, var4
var4 = var1**2 + var2**2
end subroutine

END MODULE SOLVER

program main
use solver
implicit none
call main_loop
end program

Later, I wanted to implement code to perform different operations based on the variable mode_num values. And make the call in the openmp parallel domain, the example code is as follows,

MODULE SOLVER

IMPLICIT NONE
!same as above,add 
integer          :: mode_num

contains

subroutine main_loop

mode_num = 4
!same as above

end subroutine

subroutine sub_calculate( var1, var2, var4 )
real :: var1, var2, var4

select case(mode_num)
case(4)
var4 = var1**2 + var2**2
case(6)
!var4 = some other caculations
!...
case(100)
!...
end select

end subroutine

END MODULE SOLVER

program main
use solver
implicit none
call main_loop
end program

I found a significant decrease in computational performance, which I believe is due to the use of selection structures in the sub_calculate subroutine. The variable mode_num only needs to be read once throughout the entire calculation and remains unchanged. Therefore, I adopted the method of function pointers to avoid making judgments on mode_num every time the subroutine is called. Although I can get the correct results, the performance is still not ideal.

MODULE SOLVER

IMPLICIT NONE
real             :: var1,var2,var3,var4
integer          :: i,j,timestep,nx,ny,mode_num
real,allocatable :: array1(:,:),array2(:,:)

abstract interface
subroutine sub_calculate( var1, var2, var4 )
real :: var1, var2, var4
end interface

PROCEDURE(sub_calculate), POINTER :: PTR

contains

subroutine main_loop

nx = 1000
ny = 1000
mode_num = 4
allocate( array1(nx,ny), array2(nx,ny) )

SELECT CASE(mode_num)
CASE(4)
PTR => sub_calculate_4
CASE(6)
PTR => sub_calculate_6
!...
!CASE(100)
!PTR => sub_calculate_100
END SELECT

!$OMP PARALLEL DEFAULT(FIRSTPRIVATE)shared(PTR,array1,array2)
Do timestep=1,100000

!$OMP DO
DO j=1,ny
DO i=1,nx

var1 = ( array1(i+1,j+1) + array1(i+1,j+1) )/ var3
var2 = ( array1(i-1,j+1) + array1(i-1,j+1) )/ var3
call PTR( var1, var2, var4 )
var5 = var4 * (var1+var2)

var1 = ( array1(i-1,j+1) + array1(i+1,j+1) ) / var3
var2 = ( array1(i+1,j+1) + array1(i-1,j+1) ) / var3
call PTR( var1, var2, var4 )
var6 = var4 * (var1+var2)

array2(i,j) = var5+var6

END DO
END DO
!$OMP END DO
END DO
!$OMP END PARALLEL
end subroutine

subroutine sub_calculate_4( var1, var2, var4 )
rael :: var1, var2, var4
var4 = var1**2 + var2**2
end subroutine

subroutine sub_calculate_6( var1, var2, var4 )
rael :: var1, var2, var4
!var4 = some other caculations
end subroutine

!...

subroutine sub_calculate_100( var1, var2, var4 )
real :: var1, var2, var4
!var4 = some other caculations
end subroutine

END MODULE SOLVER

program main
use solver
implicit none
call main_loop
end program

Is it a question of whether the pointer variable PTR is shared or firstprivate in the openmp parallel domain? But after trying both, I found that the performance was very poor. Anyone can help me?

Fortranlover · ‎02-27-2024

Hello, Jim.

Using the compilation options ifx -Qopt-report=2 -c file.f90 , I found that the calls are not inlined. So I added the ATTRIBUTES directive options FORCEINLINE before the subroutine. It works normally right now.

Thank you very much!

View solution in original post

jimdempseyatthecove · ‎02-27-2024

I think the cause is when your code calls the sub_calculate_nnn directly (and your test code subroutines are relatively small) that the compiler optimization inlines the called procedures. Whereas, when the pointer is used, the calls are not inlined.

You can check the inline report to see if this is the case.

What happens to your test case program when you use the SELECT CASE(mode_num) inside the !$omp DO to call the selected sub_calculate_nnn procedure?

You can re-check the inline report to see if the calls are inlined

Also, your posted code has "rael" in place of "real"

Note, when the called procedures are substantive, the compiler optimization may not inline, and therefore the pointer method performance may not significantly differ from the (out-of-line) direct call method.

Jim Dempsey

Fortranlover · ‎02-27-2024

Hello, Jim.

Using the compilation options ifx -Qopt-report=2 -c file.f90 , I found that the calls are not inlined. So I added the ATTRIBUTES directive options FORCEINLINE before the subroutine. It works normally right now.

Thank you very much!

Using function pointers in the openmp parallel domain results in significant performance degradation

Fortran Language

OpenMP

Performance