Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Skylake Instruction Latencies

JJoha8
New Contributor I
1,263 Views

With the Skylake-EP release still in the distant future, I did some benchmarking on an Core i5-6500 to explore some of the new Skylake features. During my instruction latency benchmarking I came across something odd: The AVX instructions (vaddps/vaddpd) seem to have a latency of four clock cycles as specified in Table C-8 of the Optimization Manual. The scalar and SSE instructions (addss/addsd/addps/addpd) however seem to have a latency of five clock cycles, contradicting the values found in Tables C-15 and C-16 of the Optimization Manual. I (almost certainly) rule out problems with my benchmarking code, because it includes verification tests and the results come in fine. But maybe I made a mistake:

#define INSTR vaddpd
#define NINST 6
#define N edi
#define i r8d


.intel_syntax noprefix
.globl ninst
.data
ninst:
.long NINST
.text
.globl latency
.type latency, @function
.align 32
latency:
        push      rbp
        mov       rbp, rsp
        xor       i, i
        test      N, N
        jle       done
        # create SP 1.0
        vpcmpeqw xmm0, xmm0, xmm0       # all ones
        vpsllq xmm0, xmm0, 54           # logical left shift: 11111110..0 (54 = 64 - (10 - 1))
        vpsrlq xmm0, xmm0, 2            # logical right shift: 1 bit for sign; leading mantissa bit is zero
        # copy SP 1.0
        vmovaps xmm1, xmm0
loop:
        inc       i
        INSTR     xmm0, xmm0, xmm1
        INSTR     xmm0, xmm0, xmm1
        INSTR     xmm0, xmm0, xmm1
        cmp       i, N
        INSTR     xmm0, xmm0, xmm1
        INSTR     xmm0, xmm0, xmm1
        INSTR     xmm0, xmm0, xmm1
        jl        loop
done:
        mov  rsp, rbp
        pop rbp
        ret
.size latency, .-latency

What really strikes me as odd is the fact that this implies dedicated hardware units for AVX and scalar/SSE instructions. Does it have to do with the AVX Turbo being lower than the (regular) Turbo frequency? Giving the scalar/SSE units a longer pipeline (5 instead of 4 steps) would allow them to be clocked higher. The same seems to be true for other instructions (FMA, see Table below). By the way, FMA latency and throughput is not documented in Appendix C of the Optimization Manual. Is there another resource that them documented?

Screen Shot 2016-02-25 at 12.36.56.png

0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
1,262 Views

I have not looked at Agner Fog's test harness in detail, but I generally find his results more comprehensive than Intel's -- especially when I am looking at changes in performance over time.  (I sometimes have trouble parsing the compressed format he uses for documentation of instruction types for the newer processors, especially for instructions that have very similar mnemonics, but that is not a problem here.)

The Skylake results in the 2016-01-09 revision of his instruction_tables document (http://www.agner.org/optimize/instruction_tables.pdf) don't show any deviations from the 4 cycle reported latency for the ADD, MUL, and FMA operations for the various SIMD widths, or from the 11 cycles for the single-precision divides, but he reports "13-14" cycles for the double-precision divides.

For the FP divides there are quite a few possible timings, depending on the specific values involved, due to the iterative nature of the process.  WIth a SIMD implementation an implementation can probably only take an "early out" if all of the values satisfy the "early out" criteria (e.g., dividing a non-denorm number by 2 and obtaining a non-denorm result), but it is easy to imagine the timing varying by a cycle or so for different combinations of special cases in the different SIMD lanes.

You are probably already compensating for these, but two issues that cause me trouble when I am trying to get very detailed timing are the "start-up" overhead of the 256-bit pipelines and the transition of the branch predictor from correctly predicting to not correctly predicting the loop exit at somewhere in the range of 32-38 iterations (on Haswell -- I am not sure if Skylake is the same).

0 Kudos
Reply