Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29273 讨论

Question regarding Compiler Speed compared to Lahey Fortran

Marius13
新手
2,465 次查看

Hello,
so first I have to mention, that I am very new to the Intel Fortran Compiler and there may be some obvious things that I am not doing/taking into account.


I was developing a simulation code with the Lahey Fujitsu Fortran Compiler but because computation times have gotten quite long I decided to switch to a Compiler that supports OpenMP and chose the Intel Fortran Compiler with Visual Studio 2022.
I manages to translate most of my code, but noticed that the computation time was much longer than with the Lahey compiler (before any parallelization).

I decided to test this with an example code and got a similar result. The test code does some random array multiplications and is attached at the end here.

With the Lahey Fujitsu Compiler the Test code took 0.5 seconds and with the intel fortran compiler the code took 7.9 seconds. Is there something fundamental I am doing wrong, or does the Lahey compiler just have some better array multiplication optimization?

-------------------- Test Code

program test

!:::: Define main variables
!-----------------------------------------------------------------------------------------------

IMPLICIT NONE

COMPLEX (KIND(0.0)), PARAMETER :: IM = (0.0, 1.0)
REAL (KIND(0.0)), PARAMETER :: Pi = 3.141592653589793

INTEGER :: jmax, lmax, nmax, j, l, n, i

REAL (KIND(0.0)) :: Time1, Time2

REAL (KIND(0.0)), DIMENSION(:,:,:), ALLOCATABLE :: M1, M2, M3

CALL cpu_time(Time1)

jmax = 1000
lmax = 100
nmax = 100

ALLOCATE(M1(jmax,lmax,nmax),M2(jmax,lmax,nmax),M3(jmax,lmax,nmax))

M1 = 0.
M2 = 0.
M3 = 0.
i = 0

DO j=1,jmax,1
DO l=1,lmax,1
DO n=1,nmax,1
i = i + 1

M1(j,l,n) = l +IM*j

END DO
END DO
END DO
i = 0

DO j=1,jmax,1
DO l=1,lmax,1
DO n=1,nmax,1
i = i + 1

M2(j,l,n) = j +IM*l

END DO
END DO
END DO

i = 0

DO j=2,jmax-1,1
DO l=2,lmax-1,1
DO n=2,nmax-1,1

i = i + 1

M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)

END DO
END DO
END DO

DO j=2,jmax-1,1
DO l=2,lmax-1,1
DO n=2,nmax-1,1

i = i + 1

M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1))
M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.33)
M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.47)
M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.87)

END DO
END DO
END DO

CALL cpu_time(Time2)

Print *, "Finished after: ", (Time2-Time1), i

PAUSE


end program test

标签 (1)
0 项奖励
10 回复数
Steve_Lionel
名誉分销商 III
2,458 次查看

Is it possible that you built this using a Debug configuration, which adds many runtime checks and disables optimizations? When I build and run this using default optimization using Intel Fortran on my Intel NUC system (laptop-class processor), your program completes in 0.26 seconds.

Marius13
新手
2,455 次查看

Hey Steve, thanks fot the quick reply!

That could very well be an issue. I tries changing the Runtime Library setting (project -> Fortran -> Libraries) in VS 2022 from Debug Multithread DLL to just Multithreaded but this did not change much in the computation time...


Do you have an idea what I could be doing wrong here, or how I get out of the "Debug configuration"? I basically did a fresh install for VS22 and the oneAPI so everything should be at the default settings.

0 项奖励
Marius13
新手
2,444 次查看

Ok I figured it out - I just had to got to Build -> Configuration manager and change the project from debug to release. Now the Test code also only takes 0.3 seconds.

 

Thanks again for your help!

0 项奖励
mecej4
名誉分销商 III
2,438 次查看

In addition to the points discussed, note that your triple DO loops are in wrong order for accessing memory. Changing

 

   DO j=2,jmax-1,1
      DO l=2,lmax-1,1
         DO n=2,nmax-1,1

 

to

 

   DO n=2,nmax-1,1
      DO l=2,lmax-1,1
         DO j=2,jmax-1,1

 

in your third triple DO, and making corresponding changes to the three other nested loops, changed the run time with the current version of IFort from 0.130 s to 0.069 s on my PC (Ryzen 7 4800U, Windows 11 Pro). With LF7.1, the timings were 0.210 and 0.128 s.

In some situations, a compiler (armed with suitable optimization options) may generate machine code with reordered loops for efficient memory access. For an extended discussion of this topic, see for example this thread at Fortran Discourse .

JohnNichols
重要分销商 III
2,399 次查看

It helps if you put the code inside one of the code windows.  It is not the first time you run a loop, it may be the second third etc.  I added a loop that does it 100 times and removed the initialization to zero.  It does not appear to do anything for the program. 

program test

    !:::: Define main variables
    !-----------------------------------------------------------------------------------------------

    IMPLICIT NONE

    COMPLEX (KIND(0.0)), PARAMETER :: IM = (0.0, 1.0)
    REAL (KIND(0.0)), PARAMETER :: Pi = 3.141592653589793

    INTEGER :: jmax, lmax, nmax, j, l, n, i,k

    REAL (KIND(0.0)) :: Time1, Time2

    REAL (KIND(0.0)), DIMENSION(:,:,:), ALLOCATABLE :: M1, M2, M3
    
    do k = 1,100

        CALL cpu_time(Time1)

        jmax = 1000
        lmax = 100
        nmax = 100

        if(k .le. 1) then
            ALLOCATE(M1(jmax,lmax,nmax),M2(jmax,lmax,nmax),M3(jmax,lmax,nmax))
        endif

       ! M1 = 0.
       ! M2 = 0.
      !  M3 = 0.
        i = 0

        DO j=1,jmax,1
            DO l=1,lmax,1
                DO n=1,nmax,1
                    i = i + 1

                    M1(j,l,n) = l +IM*j

                END DO
            END DO
        END DO
        i = 0

        DO j=1,jmax,1
            DO l=1,lmax,1
                DO n=1,nmax,1
                    i = i + 1

                    M2(j,l,n) = j +IM*l

                END DO
            END DO
        END DO

        i = 0

        DO n=2,nmax-1,1
      DO l=2,lmax-1,1
         DO j=2,jmax-1,1

                    i = i + 1

                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)

                END DO
            END DO
        END DO

        DO j=2,jmax-1,1
            DO l=2,lmax-1,1
                DO n=2,nmax-1,1

                    i = i + 1

                    M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1))
                    M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.33)
                    M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.47)
                    M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.87)

                END DO
            END DO
        END DO

        CALL cpu_time(Time2)

        write(*,10)(Time2-Time1), i
10      Format( "Finished after: ", f10.5,I10)

    end do

    end program test

This is 32 bit in debug mode. 

 

Screenshot 2023-04-20 085847.png

64 bit release ifx

Screenshot 2023-04-20 084745.png

Second time

Screenshot 2023-04-20 084440.png

One single run did 0.3e-2 seconds. 

 

0 项奖励
JohnNichols
重要分销商 III
2,399 次查看

When the program ran, the Windows security scan tags it every single time, this then loads a 10 second check.  

0 项奖励
Ron_Green
主持人
2,377 次查看

as @Steve_Lionel said, you have to make sure your configuration is RELEASE and not DEBUG.  Visual Studio uses Configurations to control things like optimizations and debug settings.  Visual Studio default is DEBUG Configuration, so you will not get any optimization AND it also inserts runtime checks. 

0 项奖励
Steve_Lionel
名誉分销商 III
2,376 次查看

Changing the run-time library type will have no effect on runtime (well, maybe a bit due to some additional checks during memory allocation). You don't need to go into the configuration manager - there's a control right on the toolbar:

Steve_Lionel_0-1682008813847.png

Change this to Release and you're good to go.

JohnNichols
重要分销商 III
2,292 次查看

Why do you think the 100 loop slowed up to about 0.25 seconds and then oscillated a bit?  

0 项奖励
mecej4
名誉分销商 III
2,276 次查看

There is something seriously amiss with the cut-down example code. The third set of nested DO loops has the single assignment statement repeated 11 times. Similarly, in the fourth set the same array element M3(j,l,n) is assigned to in four sequential statements, the only change being in one constant multiplier. In both cases, an optimizing compiler is going to discard all except the last assignment statement. Because of this, the example code is probably useless for timing purposes.

0 项奖励
回复