ifort 18.0.2 generates 3x slower code comparing to 14.0.1 on a Xeon E3-1240 v3 - Page 2

gn164 · ‎09-29-2018

Hi,

The following piece of code (extracted from real application and modified to reproduce the issue with minimum amounts of code)

runs approx 3x slower compiled with ifort 18.0.2 comparing to 14.0.1.

      program cmp

         parameter (NITER = 1000000)
         parameter (NDIM = 100)

         integer skeys(NDIM)
         integer pkeys(NDIM)
         integer i, cmpcnt

         skeys = 0
         pkeys = 1
         cmpcnt = 0

         do i = 1 , NITER
            ikey = 1
            icnt = 1
            do while (ikey .le. NDIM .and. icnt .le. NDIM)
               if (skeys(icnt) .gt. pkeys(ikey)) then
                  ikey = ikey + 1
               else
                  icnt = icnt + 1
               endif
            enddo
            cmpcnt  = cmpcnt + ikey / icnt
         enddo

         print *, cmpcnt

      end program

The same flags was used to compile both versions:

-qopt-report-file=vecreport -qopt-report=5 -O3

The assembly generated for the inner loop in the two cases:

14.0.1 (fast)

..B1.7:                         # Preds ..B1.8 ..B1.6 ..B1.10
..LN41:
   .loc    1  25  is_stmt 1
        movl      -4+cmp_$SKEYS.0.1(,%rdi,4), %edx              #25.18
..LN42:
        cmpl      -4+cmp_$PKEYS.0.1(,%rax,4), %edx              #25.30
..LN43:
        jle       ..B1.10       # Prob 50%                      #25.30
..LN44:
                                # LOE rax rcx rbx rdi r13 r14 r15 esi r12d
..B1.8:                         # Preds ..B1.7
..LN45:
   .loc    1  26  is_stmt 1
        incq      %rax                                          #26.17
..LN46:
   .loc    1  24  is_stmt 1
        cmpq      $100, %rax                                    #24.27
..LN47:
        jle       ..B1.7        # Prob 99%                      #24.27
..LN48:
        jmp       ..B1.12       # Prob 100%                     #24.27
..LN49:
                                # LOE rax rcx rbx rdi r13 r14 r15 esi r12d
..B1.10:                        # Preds ..B1.7
..LN50:
   .loc    1  28  is_stmt 1
        incq      %rdi                                          #28.16
..LN51:
   .loc    1  24  is_stmt 1
        cmpq      $100, %rdi                                    #24.48
..LN52:
        jle       ..B1.7        # Prob 99%                      #24.48
..LN53:
                                # LOE rax rcx rbx rdi r13 r14 r15 esi r12d
..B1.12:                        # Preds ..B1.10 ..B1.8

18.0.2 (slow)

..B1.7:                         # Preds ..B1.6 ..B1.8
                                # Execution count [4.97e+07]
..LN44:
        .loc    1  25  is_stmt 1
        movl      -4+cmp_$SKEYS.0.1(,%rdi,4), %r8d              #25.14
..LN45:
        .loc    1  26  is_stmt 1
        lea       1(%rax), %rdx                                 #26.17
..LN46:
        .loc    1  25  is_stmt 1
        movl      -4+cmp_$PKEYS.0.1(,%rax,4), %r9d              #25.14
..LN47:
        .loc    1  26  is_stmt 1
        cmpl      %r9d, %r8d                                    #26.17
..LN48:
        .loc    1  28  is_stmt 1
        lea       1(%rdi), %r10                                 #28.16
..LN49:
        .loc    1  26  is_stmt 1
        cmovg     %rdx, %rax                                    #26.17
..LN50:
        .loc    1  28  is_stmt 1
        cmovle    %r10, %rdi                                    #28.16
..LN51:
        .loc    1  24  is_stmt 1
        cmpq      $100, %rax                                    #24.27
..LN52:
        jg        ..B1.10       # Prob 1%                       #24.27
..LN53:
                                # LOE rax rcx rbx rdi r13 r14 r15 esi r12d
..B1.8:                         # Preds ..B1.7
                                # Execution count [4.93e+07]
..L13:
..LN54:
..LN55:
        cmpq      $100, %rdi                                    #24.48
..LN56:
        jle       ..B1.7        # Prob 99%                      #24.48

gn164 · ‎10-16-2018

Hi Jim,

Thanks, I will try this out. What would the more likely reason for the overhead caused by the extra branch in #17?

Is it mispredictiion? The way pkeys and skeys is initialized, this is always a non-taken branch so I would expect the performance not to be

too bad in this case?

jimdempseyatthecove · ‎10-16-2018

RE: why, use VTune with sufficiently large NITER to produce the counts for the (note, using hardware sampling as opposed to statistical sampling).

As I am one known to say "Almost every algorithm can be improved upon"...

program cmp

   parameter (NITER = 1000000)
   parameter (NDIM = 100)

   integer skeys(NDIM+1)
   integer pkeys(NDIM+1)
   integer i, j, cmpcnt
   integer skeys_icnt, pkeys_ikey
   skeys = 0
   pkeys = 1
   cmpcnt = 0

   do i = 1 , NITER
      ikey = 1
      icnt = 1
      skeys_icnt = skeys(icnt)
      pleys_ikey = pkeys(ikey)
      do
         do j = 0,NDIM - max(icnt,ikey)
            if (skeys_icnt) .gt. pkeys_ikey)) then
               ikey = ikey + 1
               pkeys_ikey = pkeys(ikey)
               cycle
            endif
            icnt = icnt + 1
            skeys_icnt = skeys(icnt)
         end do
         if(ikey .gt. NDIM) exit
         if(icnt .gt. NDIM) exit
      end do
      cmpcnt  = cmpcnt + ikey / icnt
   enddo

   print *, cmpcnt

end program