Solved: possible issue with ifx

prop_design · ‎04-01-2023

hi,

i was creating a simple benchmark program and i ran into something really weird with ifx. i was getting up to 1.5 gflops using ifort. with ifx it drops way down to around .2 gflops. on other codes i wrote, the performance was about the same or a little better with ifx. so this is really odd.

c7.bat and c8.bat are the ifx compiler options i tried. c.bat is for sse2. the rest are avx-512. the a added to the name was to try some compiler options related to memory.

i attached the program. it's Fortran 77. however, i ran into the windows 2gb memory limit. there was odd performance behavior with the size of the matrix as well. 1000x1000 seems to work best. my cpu has a lot of variability. so i have it repeat the tests 5 times.

not sure if it matters, but my cpu is the intel core i5-1135g7. i have windows 10 home edition, which is up to date with patches etc.

there are a number of limitations with the program, due to my lack of programming skills. i have no doubt it could be made a lot better. it does what i need though. it also helps me to find the best compiler options.

anthony

Ron_Green · ‎07-21-2023

bug CMPLRLLVM-46530 is fixed in 2023.2.0

View solution in original post

jimdempseyatthecove · ‎04-01-2023

ifx is not as mature as ifort. Many of the optimizations implemented in ifort have yet to be implemented in ifx.

The object files of ifort are compatible with the object files of ifx. This means that by using VTune, you can identify the lesser optimized code (under-performing code) in the ifx code, and then split your solution (or equivalent Linux assembly) into two libraries to be linked with an ifx main program. One of the libraries to be built by ifx and the other by ifort. Note, the ifort compiled code cannot contain features provided by ifx (!$omp targed ...).

Jim Dempsey

JohnNichols · ‎04-01-2023

I installed ONEAPI including the HPC yesterday on a computer, and then this morning there is a new update?

It might be worth trying?

Seems strange.

prop_design · ‎04-01-2023

oh, i greatly screwed up the math and the calculation of gflops. so don't use the code as is for anything substaintial. however, it still works to show the weird performance drop from ifort to ifx.

mecej4 · ‎04-02-2023

Here is a modified version of Anthony's test program, which may make it easier for someone to investigate why the EXE generated by ifort is so much faster than the one generated by ifx. The test program multiplies a lower triangular 2000 X 2000 matrix with a vector 5000 times, varying the nonzero values from one iteration to the next.

On a PC with an i7-10710U CPU, the former runs in 2.8 s, whereas the latter runs in 21.5 s; the ratio is 7.7. I used the current versions of the compilers (ifort 2021.8.0 and ifx 2023.0.0) with command line options /traceback /MD /Qxhost /fast .

I also tested a version (not shown here) in which the innermost DO loops were replaced by array assignment, and for that version both compilers produced EXEs that took about 20 seconds to run.

program fpubench
   implicit none
   integer, parameter :: maxsiz = 2000, nrep = 5000
   real*8 a(maxsiz,maxsiz), b(maxsiz), c(maxsiz,maxsiz)
   integer dinfo, i, j, mtype, nsiz, nl, irep, rl
   real*8 cputimesecs, finish, start, gflop, gflops
   character*12 :: stars = "(1x,80('*'))"

   mtype = 1
   dinfo = 1
   nsiz = maxsiz
   print stars
   write (*,'(1x,a,i0,a,i0)') 'Running ',nrep, &
      ' times with Matrix size ', nsiz

! Begin Computational Kernel
   call cpu_time(start)
   do irep = 1, nrep
      do j = 1, nsiz
         b(j) = dble(j)
         do i = 1, nsiz
            if ( i < j ) then
               a(i,j) = 0.0d0
               c(i,j) = 0.0d0
            else
               a(i,j) = dble(2*i+3*j) + 0.005*irep
               c(i,j) = a(i,j)*b(j)
            endif
         enddo
      enddo
      !call check(nsiz, c)
   enddo
   call cpu_time(finish)
   call check(nsiz, c)
! End Computational Kernel

   cputimesecs = finish - start
   gflop = real(nsiz*nsiz,8)*nrep/1.0d9
   if ( cputimesecs/=0.0d0 ) then
      gflops = gflop/cputimesecs
   else
      gflops = -999.0
   endif
   write (*,*) 'lower triangular matrix'
   write (*,'(3(1x,a,g12.3))') 'a = ', a(1,1), 'b = ', b(1), 'c = ', c(1,1)
   write (*,'(3(1x,a,g12.3))') 'a = ', a(nsiz,1), 'b = ', b(1), 'c = ', c(nsiz,1)
   write (*,'(1x,a,f10.3)') 'total number of floating point calcs (gflop): ', gflop
   write (*,'(1x,a,f10.3)') 'total cpu time (seconds); ', cputimesecs
   write (*,'(1x,a,f10.3)') 'floating point ops per sec (gflops); ', gflops
   print stars
   write (*,*) 'fpu_bench has finished running'
end program

subroutine check(n,c)
   implicit none
   integer, intent(in) :: n
   real*8, intent(in) :: c(n,n)
   integer i,j,nrep
   nrep = 5000
   do j = 1, n
      do i = 1, n
         if(j > i)then
            if(c(i,j) /= 0.0d0)print '(1x,2i5,es12.4)',i,j,c(i,j)
         else
            if(c(i,j) /= dble((2*i+3*j++ 0.005*nrep)*j))print '(1x,2i5,es12.4)',i,j,c(i,j)
         endif
      enddo
   enddo
   return
end subroutine

JohnNichols · ‎04-02-2023

Trying the program on my DELL with an additional timing for the overall program.

program fpubench
    implicit none
    integer, parameter :: maxsiz = 2000, nrep = 5000
    real*8 a(maxsiz,maxsiz), b(maxsiz), c(maxsiz,maxsiz)
    integer dinfo, i, j, mtype, nsiz, nl, irep, rl
    real*8 cputimesecs, finish, start, gflop, gflops
    character*12 :: stars = "(1x,80('*'))"

    mtype = 1
    dinfo = 1
    nsiz = maxsiz
    print stars
    call DATTIM(0)
    write (*,'(1x,a,i0,a,i0)') 'Running ',nrep, &
        ' times with Matrix size ', nsiz

    ! Begin Computational Kernel
    call cpu_time(start)
    do irep = 1, nrep

        !if(mod(irep,1000) .eq. 1) then
        !write(*,*)irep
        !endif
        do j = 1, nsiz
            b(j) = dble(j)
            do i = 1, nsiz
                if ( i < j ) then
                    a(i,j) = 0.0d0
                    c(i,j) = 0.0d0
                else
                    a(i,j) = dble(2*i+3*j) + 0.005*irep
                    c(i,j) = a(i,j)*b(j)
                endif
            enddo
        enddo
        !call check(nsiz, c)
    enddo
    call cpu_time(finish)
    call check(nsiz, c)
    ! End Computational Kernel

    cputimesecs = finish - start
    gflop = real(nsiz*nsiz,8)*nrep/1.0d9
    if ( cputimesecs/=0.0d0 ) then
        gflops = gflop/cputimesecs
    else
        gflops = -999.0
    endif
    write (*,*) 'lower triangular matrix'
    write (*,'(3(1x,a,g12.3))') 'a = ', a(1,1), 'b = ', b(1), 'c = ', c(1,1)
    write (*,'(3(1x,a,g12.3))') 'a = ', a(nsiz,1), 'b = ', b(1), 'c = ', c(nsiz,1)
    write (*,'(1x,a,f10.3)') 'total number of floating point calcs (gflop): ', gflop
    write (*,'(1x,a,f10.3)') 'total cpu time (seconds); ', cputimesecs
    write (*,'(1x,a,f10.3)') 'floating point ops per sec (gflops); ', gflops
    print stars
    write (*,*) 'fpu_bench has finished running'

    call DATTIM(1)
    end program

    subroutine check(n,c)
    implicit none
    integer, intent(in) :: n
    real*8, intent(in) :: c(n,n)
    integer i,j,nrep
    nrep = 5000
    do j = 1, n
        do i = 1, n
            if(j > i)then
                if(c(i,j) /= 0.0d0)print '(1x,2i5,es12.4)',i,j,c(i,j)
            else
                if(c(i,j) /= dble((2*i+3*j++ 0.005*nrep)*j))print '(1x,2i5,es12.4)',i,j,c(i,j)
            endif
        enddo
    enddo
    return
    end subroutine



    !      ****************************************************************
    !
    SUBROUTINE DATTIM(I)
    !
    !      ***************************************************************

    Implicit none
    CHARACTER CH

    INTEGER*2 IHR,IMIN,ISEC,I100TH,ISEC1,I100TH1
    INTEGER*2 IMIN_OLD,ISEC_OLD,I100TH_OLD
    INTEGER*2 IMIN_TOT,ISEC_TOT,I100TH_TOT
    INTEGER I

    COMMON /TIMER1/ IMIN_OLD,ISEC_OLD

    CH=CHAR(32)

    CALL GETTIM (IHR,IMIN,ISEC,I100TH)
    IF(I .EQ. 0) THEN
        IMIN_OLD = IMIN
        ISEC_OLD = ISEC
        I100TH_OLD = I100TH
    ENDIF

    IF(I .EQ. 1) THEN

        IF(IMIN .LT. IMIN_OLD) THEN
            IMIN = IMIN+60
            IHR = IHR - 1
        ENDIF

        IMIN_TOT = IMIN - IMIN_OLD

        IF(I100TH .LT. I100TH_OLD) THEN
            I100TH1 = I100TH+100
            ISEC = ISEC - 1
        ELSE
            I100TH1 = I100TH
        ENDIF

        I100TH_TOT = I100TH1 - I100TH_OLD

        IF(ISEC .LT. ISEC_OLD) THEN
            ISEC1 = ISEC+60
            IMIN_TOT = IMIN_TOT - 1
        ELSE
            ISEC1 = ISEC
        ENDIF

        ISEC_TOT = ISEC1 - ISEC_OLD

        WRITE(*,'(1X,1A,A15\)')'Elapsed Time : '
        WRITE(*,'( I2.2,1H:,I2.2,1H:,I2.2)')IMIN_TOT,ISEC_TOT,I100TH_TOT
        WRITE(*,'(1X,A15\)')'Start   Time : '
        WRITE(*,'( I2.2,1H:,I2.2,1H:,I2.2)')IMIN,ISEC,I100TH
        WRITE(*,'(1X,A15\)')'Finish  Time : '
        WRITE(*,'( I2.2,1H:,I2.2,1H:,I2.2)')IMIN_OLD,ISEC_OLD,I100TH_OLD
        WRITE(*,'(1X,A15\)')'Ratio   Time : '
    ENDIF

    RETURN
    END

Adding an old timer, I get on a Dell CORE i7 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz 2.50 GHz

Core I7 10710U has a pass mark score of 9848

Core I7 11850H has a pass mark score of 20,867

I am running 64 - release with Windows 11 Preview. I got one at 3.9 seconds, but it had a high overall so I discarded it.

I am running 2023 Fortran

IFX runs in 22+ seconds.

I wonder if W11 is the difference, your machine is a bit faster.

Running your program on IFORT is 4.5 seconds.

mecej4 · ‎04-04-2023

For some reason, the forum will not let me edit my previous post. Please change j++ in line 65 to j+ .

Since a couple of days, I have noticed that the code display uses interspersedfonts of more than one size, and that the line number display has a spacing slightly larger than the spacing between lines of code, all of which make life more exciting than necessary.

JohnNichols · ‎04-04-2023

Seemed appropriate for this group.

mecej4 · ‎04-02-2023

The timings shown by JohnNichols confirm that for this program the IFX compiled EXE takes many times (~5) longer than the EXE compiled with IFORT.

JohnNichols · ‎04-03-2023

Are you running on LINUX? This might explain the difference between your results and mine.

The interesting thing to explain to masters students is that experiments need to be repeated, there is no point doing a Bridge to Far exercise to find out the 82nd stuffed it at Nijmegen and you are hanging out in enemy land.

I define enemy land as any reviewer with a chip on their shoulder. ie all reviewers.

mecej4 · ‎04-03-2023

No, Windows 11-64 Home.

JohnNichols · ‎04-03-2023

I ran it in the CMD Window for IFORT 64 bit, it ran 0.3 seconds slower on average.

How did you get to 2.8 seconds?

prop_design · ‎04-03-2023

hi,

just a reminder that the code i posted was to highlight an oddity with ifx. I WOULD NOT USE IT TO ACTUALLY BENCHMARK ANYTHING. i spent another few days on it. however, it's a lost cause. i don't know if any of the mods people did made it so you could use it. the code i posted and the edits i made definitely are not making any sense.

i thought it was going to be fairly simple to make something to measure peak GFLOPS. however, that didn't turn out to be the case. so i'm not going to worry about it. it's not worth the trouble.

anthony

JohnNichols · ‎04-04-2023

@prop_design

This forum answers interesting questions, but like any group it can be interested in the elements of the problem that are of no interest to the original question. Your stuff has morphed, it happens, just watch the interest, you can learn a lot, I know I learn more from these guys than anywhere else.

Now I am interested, vaguely in why @mecej4 computer solves it in 2.8 seconds and I am about double that on average. Based on the theoretical stats it should not be that way, which means a interesting problem of about Winnie the Pooh level of importance.

JMN

mecej4 · ‎04-03-2023

Nothing special. I compile using

ifort /fast /Qxhost /MD fpubench.f90

I run the resulting EXE once or twice so that the runtime DLLs are located and loaded. I time the next run:

S:\FPUBEN>c:\dos\timethis fpubench.exe
 ********************************************************************************
 Running 5000 times with Matrix size 2000
 lower triangular matrix
 a =     30.0     b =     1.00     c =     30.0
 a =    0.403E+04 b =     1.00     c =    0.403E+04
 total number of floating point calcs (gflop):     20.000
 total cpu time (seconds);      2.812
 floating point ops per sec (gflops);      7.111
 ********************************************************************************
 fpu_bench has finished running
NSNUC-i7V10 Elapsed time:    2.883 s
Command : fpubench.exe
  S: Mon Apr  3 16:53:50 2023
  F: Mon Apr  3 16:53:53 2023

Note that the timing measurement inside the program itself (using CPU_TIME) is not very accurate; here it is 2.812 seconds. The last four lines of the output are printed by the "timethis" utility program, which uses the Windows QueryPerformanceCounter API. The time reported is about 0.07 second higher, because it includes program startup and termination times.

I tried with the additional option /Qparallel, and got an elapsed time of 1.2 s. My PC is set to run on the Balanced power plan. I have turned off a few utilities that keep checking the web for updates, such as Adobe, Google, the Intel DSA tool, etc.

prop_design · ‎04-04-2023

@mecej4

yeah, the timing is a huge problem. i made a different thread about that. can you post how to do an accurate timing, either to this thread or the other. i'm not seeing how you did that in the code you posted earlier.

@JohnNichols

understood, i just didn't want anyone thinking that the code I posted worked. it doesn't. i can't speak for anyone's mods though. i've tried everything i can but haven't got the correct results, so far. i think it's a time measurement issue. i had made a different thread about that. if i do ever get it working, i can post an update. right now, the version of the code i currently have is off roughly 5x. i no longer have the original code i posted. but i'm sure it's off too. just not sure by how much. i changed the math to be like what i was intended. the intent was to maximize the fpu operation. on my processor that's supposed to be 32 flop per cycle. when i do that, i'm 5x lower than i should be. i'm sure the processor is right. other benchmarks report the correct number.

ps

the intel forum is terrible. as i try to type this reply, it keeps jumping and scrolling to odd locations. it's **bleep** near impossible to use.

mecej4 · ‎04-04-2023

@prop_design wrote:

@mecej4

yeah, the timing is a huge problem. i made a different thread about that. can you post how to do an accurate timing, either to this thread or the other. i'm not seeing how you did that in the code you posted earlier.

Here is the source for a utility program, "timethis", adapted from source code published on MS web pages over a decade ago. Compile and link with one of the following commands:

icl /O2 /MD timethis.c ws2_32.lib kernel32.lib
cl /O2 /MD timethis.c ws2_32.lib kernel32.lib

If the program that you wish to time takes arguments, wrap quotes around the command line, as in:

timethis "MyProg.exe arg1 arg2"

prop_design · ‎04-04-2023

@mecej4 thanks for sharing what you did. unfortunately, that's way over my head. i tried to study it some and ended up finding an old forum post on here. i was able to use that to get the time thing working. however, it still didn't fix the problem i'm having at the moment. so i'll have to look for other possible problems. in any event, i appreciate the help. i'll post what i did on the other tread, as far as the time hack for fortran to windows.

Ron_Green · ‎04-04-2023

Ran some tests

2023.0.0 package compilers ifort 2021.8.0 ifx 2023.0.0

Base options for both:

-O2 -xhost -align array64byte

ifort: 3.1sec

ifx: 21.6sec. !woof

Switching to the 2023.1.0 compiler ifort 2021.9.0. ifx 2023.1.0

ifort: 3.1sec

ifx: 21.0sec !still woof

I agree 7x slower is unacceptable. If I had to guess, and I'll confirm this, IFX is not creating efficient masked vector operations for the IF inside the loop. Off the top of my head, I can't think of anything else that would cause such a discrepancy.

I'll open a bug report

mecej4 · ‎04-04-2023

Thanks, Ron. As part of investigating this performance bug, it may be worthwhile to investigate a related performance issue, i.e., why replacing the innermost DO loop of the computational kernel with array assignments leads to slowing down by a factor of 5 to 6 (with Ifort as well as with Ifx), as I alluded to earlier. Here is the array-assignment version of the same test code. Thanks.

program fpubench
   implicit none
   integer, parameter :: maxsiz = 2000, nrep = 5000, dp = kind(0.0d0)
   double precision a(maxsiz,maxsiz), b(maxsiz), c(maxsiz,maxsiz)
   integer i, j, nsiz, irep
   double precision cputimesecs, finish, start, gflop, gflops
   character*12 :: stars = "(1x,80('*'))"

   nsiz = maxsiz
   print stars
   write (*,'(1x,a,i0,a,i0)') 'Running ',nrep, &
      ' times with Matrix size ', nsiz

! Begin Computational Kernel
   call cpu_time(start)
   do irep = 1, nrep
      do j = 1, nsiz            ! innermost DO loop replaced by array assignments
         b(j) = dble(j)
         a(1:j-1,j)  = 0.0d0
         c(1:j-1,j)  = 0.0d0
         a(j:nsiz,j) = [(dble(2*i+3*j),i=j,nsiz)] + 0.005*irep
         c(j:nsiz,j) = a(j:nsiz,j)*b(j)
      enddo
      !call check(nsiz, c)
   enddo
   call cpu_time(finish)
   call check(nsiz, c)
! End Computational Kernel

   cputimesecs = finish - start
   gflop = real(nsiz*nsiz,dp)*nrep/1.0d9
   if ( cputimesecs/=0.0d0 ) then
      gflops = gflop/cputimesecs
   else
      gflops = -999.0
   endif
   write (*,*) 'lower triangular matrix'
   write (*,'(3(1x,a,g12.3))') 'a = ', a(1,1), 'b = ', b(1), 'c = ', c(1,1)
   write (*,'(3(1x,a,g12.3))') 'a = ', a(nsiz,1), 'b = ', b(1), 'c = ', c(nsiz,1)
   write (*,'(1x,a,f10.3)') 'total number of floating point calcs (gflop): ', gflop
   write (*,'(1x,a,f10.3)') 'total cpu time (seconds); ', cputimesecs
   write (*,'(1x,a,f10.3)') 'floating point ops per sec (gflops); ', gflops
   print stars
   write (*,*) 'fpu_bench has finished running'
end program

subroutine check(n,c)
   implicit none
   integer, intent(in) :: n
   double precision, intent(in) :: c(n,n)
   integer i,j,nrep
   nrep = 5000
   do j = 1, n
      do i = 1, n
         if(j > i)then
            if(c(i,j) /= 0.0d0)print '(1x,2i5,es12.4)',i,j,c(i,j)
         else
            if(c(i,j) /= dble((2*i+3*j+0.005*nrep)*j))then
               print '(1x,2i5,es12.4)',i,j,c(i,j)
            endif
         endif
      enddo
   enddo
   return
end subroutine

prop_design · ‎04-05-2023

So this thread took a bit of a tangent. In that regard, I'm attaching what I believe is a working version of the fpu benchmarking program that I created. I didn't test it as far as the ifx slowdown issue. I think the original code I posted shows that well. This update is meant to show the correct GFLOPS numbers. You may have to tweak some of the internal variables and spreadsheet calculations, if you're cpu is substantially different than mine. However, I think this will work for a lot of processors that support AVX-512.

Update:

I did some quick tests comparing compilers and this was the result. On my actual codes, I don't see this type of slowdown. So I was surprised to see this on such a simple code.