Question regarding Compiler Speed compared to Lahey Fortran

Marius13 · ‎04-20-2023

Hello,
so first I have to mention, that I am very new to the Intel Fortran Compiler and there may be some obvious things that I am not doing/taking into account.

I was developing a simulation code with the Lahey Fujitsu Fortran Compiler but because computation times have gotten quite long I decided to switch to a Compiler that supports OpenMP and chose the Intel Fortran Compiler with Visual Studio 2022.
I manages to translate most of my code, but noticed that the computation time was much longer than with the Lahey compiler (before any parallelization).

I decided to test this with an example code and got a similar result. The test code does some random array multiplications and is attached at the end here.

With the Lahey Fujitsu Compiler the Test code took 0.5 seconds and with the intel fortran compiler the code took 7.9 seconds. Is there something fundamental I am doing wrong, or does the Lahey compiler just have some better array multiplication optimization?

-------------------- Test Code

program test

!:::: Define main variables
!-----------------------------------------------------------------------------------------------

IMPLICIT NONE

COMPLEX (KIND(0.0)), PARAMETER :: IM = (0.0, 1.0)
REAL (KIND(0.0)), PARAMETER :: Pi = 3.141592653589793

INTEGER :: jmax, lmax, nmax, j, l, n, i

REAL (KIND(0.0)) :: Time1, Time2

REAL (KIND(0.0)), DIMENSION(:,:,:), ALLOCATABLE :: M1, M2, M3

CALL cpu_time(Time1)

jmax = 1000
lmax = 100
nmax = 100

ALLOCATE(M1(jmax,lmax,nmax),M2(jmax,lmax,nmax),M3(jmax,lmax,nmax))

M1 = 0.
M2 = 0.
M3 = 0.
i = 0

DO j=1,jmax,1
DO l=1,lmax,1
DO n=1,nmax,1
i = i + 1

M1(j,l,n) = l +IM*j

END DO
END DO
END DO
i = 0

DO j=1,jmax,1
DO l=1,lmax,1
DO n=1,nmax,1
i = i + 1

M2(j,l,n) = j +IM*l

END DO
END DO
END DO

i = 0

DO j=2,jmax-1,1
DO l=2,lmax-1,1
DO n=2,nmax-1,1

i = i + 1

M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)

END DO
END DO
END DO

DO j=2,jmax-1,1
DO l=2,lmax-1,1
DO n=2,nmax-1,1

i = i + 1

M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1))
M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.33)
M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.47)
M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.87)

END DO
END DO
END DO

CALL cpu_time(Time2)

Print *, "Finished after: ", (Time2-Time1), i

PAUSE

end program test

Steve_Lionel · ‎04-20-2023

Is it possible that you built this using a Debug configuration, which adds many runtime checks and disables optimizations? When I build and run this using default optimization using Intel Fortran on my Intel NUC system (laptop-class processor), your program completes in 0.26 seconds.

Marius13 · ‎04-20-2023

Hey Steve, thanks fot the quick reply!

That could very well be an issue. I tries changing the Runtime Library setting (project -> Fortran -> Libraries) in VS 2022 from Debug Multithread DLL to just Multithreaded but this did not change much in the computation time...

Do you have an idea what I could be doing wrong here, or how I get out of the "Debug configuration"? I basically did a fresh install for VS22 and the oneAPI so everything should be at the default settings.

Marius13 · ‎04-20-2023

Ok I figured it out - I just had to got to Build -> Configuration manager and change the project from debug to release. Now the Test code also only takes 0.3 seconds.

Thanks again for your help!

mecej4 · ‎04-20-2023

In addition to the points discussed, note that your triple DO loops are in wrong order for accessing memory. Changing

   DO j=2,jmax-1,1
      DO l=2,lmax-1,1
         DO n=2,nmax-1,1

to

   DO n=2,nmax-1,1
      DO l=2,lmax-1,1
         DO j=2,jmax-1,1

in your third triple DO, and making corresponding changes to the three other nested loops, changed the run time with the current version of IFort from 0.130 s to 0.069 s on my PC (Ryzen 7 4800U, Windows 11 Pro). With LF7.1, the timings were 0.210 and 0.128 s.

In some situations, a compiler (armed with suitable optimization options) may generate machine code with reordered loops for efficient memory access. For an extended discussion of this topic, see for example this thread at Fortran Discourse .

JohnNichols · ‎04-20-2023

It helps if you put the code inside one of the code windows. It is not the first time you run a loop, it may be the second third etc. I added a loop that does it 100 times and removed the initialization to zero. It does not appear to do anything for the program.

program test

    !:::: Define main variables
    !-----------------------------------------------------------------------------------------------

    IMPLICIT NONE

    COMPLEX (KIND(0.0)), PARAMETER :: IM = (0.0, 1.0)
    REAL (KIND(0.0)), PARAMETER :: Pi = 3.141592653589793

    INTEGER :: jmax, lmax, nmax, j, l, n, i,k

    REAL (KIND(0.0)) :: Time1, Time2

    REAL (KIND(0.0)), DIMENSION(:,:,:), ALLOCATABLE :: M1, M2, M3
    
    do k = 1,100

        CALL cpu_time(Time1)

        jmax = 1000
        lmax = 100
        nmax = 100

        if(k .le. 1) then
            ALLOCATE(M1(jmax,lmax,nmax),M2(jmax,lmax,nmax),M3(jmax,lmax,nmax))
        endif

       ! M1 = 0.
       ! M2 = 0.
      !  M3 = 0.
        i = 0

        DO j=1,jmax,1
            DO l=1,lmax,1
                DO n=1,nmax,1
                    i = i + 1

                    M1(j,l,n) = l +IM*j

                END DO
            END DO
        END DO
        i = 0

        DO j=1,jmax,1
            DO l=1,lmax,1
                DO n=1,nmax,1
                    i = i + 1

                    M2(j,l,n) = j +IM*l

                END DO
            END DO
        END DO

        i = 0

        DO n=2,nmax-1,1
      DO l=2,lmax-1,1
         DO j=2,jmax-1,1

                    i = i + 1

                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)
                    M3(j,l,n) = M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)

                END DO
            END DO
        END DO

        DO j=2,jmax-1,1
            DO l=2,lmax-1,1
                DO n=2,nmax-1,1

                    i = i + 1

                    M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1))
                    M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.33)
                    M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.47)
                    M3(j,l,n) = M3(j,l,n)/(M1(j-1,l-1,n+1)*M2(j+1,l+1,n-1)*0.87)

                END DO
            END DO
        END DO

        CALL cpu_time(Time2)

        write(*,10)(Time2-Time1), i
10      Format( "Finished after: ", f10.5,I10)

    end do

    end program test

This is 32 bit in debug mode.

64 bit release ifx

Second time

One single run did 0.3e-2 seconds.

JohnNichols · ‎04-20-2023

When the program ran, the Windows security scan tags it every single time, this then loads a 10 second check.

Ron_Green · ‎04-20-2023

as @Steve_Lionel said, you have to make sure your configuration is RELEASE and not DEBUG. Visual Studio uses Configurations to control things like optimizations and debug settings. Visual Studio default is DEBUG Configuration, so you will not get any optimization AND it also inserts runtime checks.

Steve_Lionel · ‎04-20-2023

Changing the run-time library type will have no effect on runtime (well, maybe a bit due to some additional checks during memory allocation). You don't need to go into the configuration manager - there's a control right on the toolbar:

Change this to Release and you're good to go.

JohnNichols · ‎04-21-2023

Why do you think the 100 loop slowed up to about 0.25 seconds and then oscillated a bit?

mecej4 · ‎04-21-2023

There is something seriously amiss with the cut-down example code. The third set of nested DO loops has the single assignment statement repeated 11 times. Similarly, in the fourth set the same array element M3(j,l,n) is assigned to in four sequential statements, the only change being in one constant multiplier. In both cases, an optimizing compiler is going to discard all except the last assignment statement. Because of this, the example code is probably useless for timing purposes.

Question regarding Compiler Speed compared to Lahey Fortran

Performance