Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28469 Discussions

ifort 18.0.2 generates 3x slower code comparing to 14.0.1 on a Xeon E3-1240 v3

gn164
Beginner
982 Views

 

Hi,

The following piece of code (extracted from real application and modified to reproduce the issue with minimum amounts of code)

runs approx 3x slower compiled with ifort 18.0.2 comparing to 14.0.1.

 

      program cmp

         parameter (NITER = 1000000)
         parameter (NDIM = 100)

         integer skeys(NDIM)
         integer pkeys(NDIM)
         integer i, cmpcnt

         skeys = 0
         pkeys = 1
         cmpcnt = 0

         do i = 1 , NITER
            ikey = 1
            icnt = 1
            do while (ikey .le. NDIM .and. icnt .le. NDIM)
               if (skeys(icnt) .gt. pkeys(ikey)) then
                  ikey = ikey + 1
               else
                  icnt = icnt + 1
               endif
            enddo
            cmpcnt  = cmpcnt + ikey / icnt
         enddo

         print *, cmpcnt

      end program

The same flags was used to compile both versions:

-qopt-report-file=vecreport -qopt-report=5 -O3

The assembly generated for the inner loop in the two cases:

14.0.1 (fast)

..B1.7:                         # Preds ..B1.8 ..B1.6 ..B1.10
..LN41:
   .loc    1  25  is_stmt 1
        movl      -4+cmp_$SKEYS.0.1(,%rdi,4), %edx              #25.18
..LN42:
        cmpl      -4+cmp_$PKEYS.0.1(,%rax,4), %edx              #25.30
..LN43:
        jle       ..B1.10       # Prob 50%                      #25.30
..LN44:
                                # LOE rax rcx rbx rdi r13 r14 r15 esi r12d
..B1.8:                         # Preds ..B1.7
..LN45:
   .loc    1  26  is_stmt 1
        incq      %rax                                          #26.17
..LN46:
   .loc    1  24  is_stmt 1
        cmpq      $100, %rax                                    #24.27
..LN47:
        jle       ..B1.7        # Prob 99%                      #24.27
..LN48:
        jmp       ..B1.12       # Prob 100%                     #24.27
..LN49:
                                # LOE rax rcx rbx rdi r13 r14 r15 esi r12d
..B1.10:                        # Preds ..B1.7
..LN50:
   .loc    1  28  is_stmt 1
        incq      %rdi                                          #28.16
..LN51:
   .loc    1  24  is_stmt 1
        cmpq      $100, %rdi                                    #24.48
..LN52:
        jle       ..B1.7        # Prob 99%                      #24.48
..LN53:
                                # LOE rax rcx rbx rdi r13 r14 r15 esi r12d
..B1.12:                        # Preds ..B1.10 ..B1.8

 

18.0.2 (slow)

 

..B1.7:                         # Preds ..B1.6 ..B1.8
                                # Execution count [4.97e+07]
..LN44:
        .loc    1  25  is_stmt 1
        movl      -4+cmp_$SKEYS.0.1(,%rdi,4), %r8d              #25.14
..LN45:
        .loc    1  26  is_stmt 1
        lea       1(%rax), %rdx                                 #26.17
..LN46:
        .loc    1  25  is_stmt 1
        movl      -4+cmp_$PKEYS.0.1(,%rax,4), %r9d              #25.14
..LN47:
        .loc    1  26  is_stmt 1
        cmpl      %r9d, %r8d                                    #26.17
..LN48:
        .loc    1  28  is_stmt 1
        lea       1(%rdi), %r10                                 #28.16
..LN49:
        .loc    1  26  is_stmt 1
        cmovg     %rdx, %rax                                    #26.17
..LN50:
        .loc    1  28  is_stmt 1
        cmovle    %r10, %rdi                                    #28.16
..LN51:
        .loc    1  24  is_stmt 1
        cmpq      $100, %rax                                    #24.27
..LN52:
        jg        ..B1.10       # Prob 1%                       #24.27
..LN53:
                                # LOE rax rcx rbx rdi r13 r14 r15 esi r12d
..B1.8:                         # Preds ..B1.7
                                # Execution count [4.93e+07]
..L13:
..LN54:
..LN55:
        cmpq      $100, %rdi                                    #24.48
..LN56:
        jle       ..B1.7        # Prob 99%                      #24.48

 

 

0 Kudos
22 Replies
gn164
Beginner
219 Views

 

Hi Jim,

Thanks, I will try this out. What would the more likely reason for the overhead caused by the extra branch in #17?

Is it mispredictiion? The way pkeys and skeys is initialized, this is always a non-taken branch so I would expect the performance not to be

too bad in this case?

0 Kudos
jimdempseyatthecove
Honored Contributor III
219 Views

RE: why, use VTune with sufficiently large NITER to produce the counts for the (note, using hardware sampling as opposed to statistical sampling).

As I am one known to say "Almost every algorithm can be improved upon"...

program cmp

   parameter (NITER = 1000000)
   parameter (NDIM = 100)

   integer skeys(NDIM+1)
   integer pkeys(NDIM+1)
   integer i, j, cmpcnt
   integer skeys_icnt, pkeys_ikey
   skeys = 0
   pkeys = 1
   cmpcnt = 0

   do i = 1 , NITER
      ikey = 1
      icnt = 1
      skeys_icnt = skeys(icnt)
      pleys_ikey = pkeys(ikey)
      do
         do j = 0,NDIM - max(icnt,ikey)
            if (skeys_icnt) .gt. pkeys_ikey)) then
               ikey = ikey + 1
               pkeys_ikey = pkeys(ikey)
               cycle
            endif
            icnt = icnt + 1
            skeys_icnt = skeys(icnt)
         end do
         if(ikey .gt. NDIM) exit
         if(icnt .gt. NDIM) exit
      end do
      cmpcnt  = cmpcnt + ikey / icnt
   enddo

   print *, cmpcnt

end program

 

0 Kudos
Reply