- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The following piece of code (extracted from real application and modified to reproduce the issue with minimum amounts of code)
runs approx 3x slower compiled with ifort 18.0.2 comparing to 14.0.1.
program cmp parameter (NITER = 1000000) parameter (NDIM = 100) integer skeys(NDIM) integer pkeys(NDIM) integer i, cmpcnt skeys = 0 pkeys = 1 cmpcnt = 0 do i = 1 , NITER ikey = 1 icnt = 1 do while (ikey .le. NDIM .and. icnt .le. NDIM) if (skeys(icnt) .gt. pkeys(ikey)) then ikey = ikey + 1 else icnt = icnt + 1 endif enddo cmpcnt = cmpcnt + ikey / icnt enddo print *, cmpcnt end program
The same flags was used to compile both versions:
-qopt-report-file=vecreport -qopt-report=5 -O3
The assembly generated for the inner loop in the two cases:
14.0.1 (fast)
..B1.7: # Preds ..B1.8 ..B1.6 ..B1.10 ..LN41: .loc 1 25 is_stmt 1 movl -4+cmp_$SKEYS.0.1(,%rdi,4), %edx #25.18 ..LN42: cmpl -4+cmp_$PKEYS.0.1(,%rax,4), %edx #25.30 ..LN43: jle ..B1.10 # Prob 50% #25.30 ..LN44: # LOE rax rcx rbx rdi r13 r14 r15 esi r12d ..B1.8: # Preds ..B1.7 ..LN45: .loc 1 26 is_stmt 1 incq %rax #26.17 ..LN46: .loc 1 24 is_stmt 1 cmpq $100, %rax #24.27 ..LN47: jle ..B1.7 # Prob 99% #24.27 ..LN48: jmp ..B1.12 # Prob 100% #24.27 ..LN49: # LOE rax rcx rbx rdi r13 r14 r15 esi r12d ..B1.10: # Preds ..B1.7 ..LN50: .loc 1 28 is_stmt 1 incq %rdi #28.16 ..LN51: .loc 1 24 is_stmt 1 cmpq $100, %rdi #24.48 ..LN52: jle ..B1.7 # Prob 99% #24.48 ..LN53: # LOE rax rcx rbx rdi r13 r14 r15 esi r12d ..B1.12: # Preds ..B1.10 ..B1.8
18.0.2 (slow)
..B1.7: # Preds ..B1.6 ..B1.8 # Execution count [4.97e+07] ..LN44: .loc 1 25 is_stmt 1 movl -4+cmp_$SKEYS.0.1(,%rdi,4), %r8d #25.14 ..LN45: .loc 1 26 is_stmt 1 lea 1(%rax), %rdx #26.17 ..LN46: .loc 1 25 is_stmt 1 movl -4+cmp_$PKEYS.0.1(,%rax,4), %r9d #25.14 ..LN47: .loc 1 26 is_stmt 1 cmpl %r9d, %r8d #26.17 ..LN48: .loc 1 28 is_stmt 1 lea 1(%rdi), %r10 #28.16 ..LN49: .loc 1 26 is_stmt 1 cmovg %rdx, %rax #26.17 ..LN50: .loc 1 28 is_stmt 1 cmovle %r10, %rdi #28.16 ..LN51: .loc 1 24 is_stmt 1 cmpq $100, %rax #24.27 ..LN52: jg ..B1.10 # Prob 1% #24.27 ..LN53: # LOE rax rcx rbx rdi r13 r14 r15 esi r12d ..B1.8: # Preds ..B1.7 # Execution count [4.93e+07] ..L13: ..LN54: ..LN55: cmpq $100, %rdi #24.48 ..LN56: jle ..B1.7 # Prob 99% #24.48
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Thanks, I will try this out. What would the more likely reason for the overhead caused by the extra branch in #17?
Is it mispredictiion? The way pkeys and skeys is initialized, this is always a non-taken branch so I would expect the performance not to be
too bad in this case?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
RE: why, use VTune with sufficiently large NITER to produce the counts for the (note, using hardware sampling as opposed to statistical sampling).
As I am one known to say "Almost every algorithm can be improved upon"...
program cmp parameter (NITER = 1000000) parameter (NDIM = 100) integer skeys(NDIM+1) integer pkeys(NDIM+1) integer i, j, cmpcnt integer skeys_icnt, pkeys_ikey skeys = 0 pkeys = 1 cmpcnt = 0 do i = 1 , NITER ikey = 1 icnt = 1 skeys_icnt = skeys(icnt) pleys_ikey = pkeys(ikey) do do j = 0,NDIM - max(icnt,ikey) if (skeys_icnt) .gt. pkeys_ikey)) then ikey = ikey + 1 pkeys_ikey = pkeys(ikey) cycle endif icnt = icnt + 1 skeys_icnt = skeys(icnt) end do if(ikey .gt. NDIM) exit if(icnt .gt. NDIM) exit end do cmpcnt = cmpcnt + ikey / icnt enddo print *, cmpcnt end program
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »