Is there any document telling the latency of every instruction?

Easton_A_ · ‎10-29-2014

hi, buddies

I am writing a Phi native application, but the utilization of CPU is very low (about 25%) and the CPI is a little high (about 3). I used plenty vector instruction in the application, and suspect that's the reason for the bad performance. I wonder if there is any official document or manual telling the latency of every instruction of Xeon Phi, I have searched but only got some empirical experience like this

http://arxiv.org/abs/1310.5842

so, please let me know if you have ideas, thanks.

TaylorIoTKidd · ‎10-29-2014

Easton,

First off, thanks for the article reference.

Secondly, I don't think such a document exists or is even possible for current modern architectures. The pipeline of modern processors, even the Pentium upon which the coprocessor cores are based, are too optimized for performance. What I mean is that when a resource is not used by one HW thread, it is likely used in another executing on the same core. Another example is branch prediction and HW prefetching. Each of these make precise latency measurements impossible. And I'm not even getting into the later generation (not MIC) out of order cores.

Have you used the optimization and vectorization report features of the Intel compilers? It looks like you used the inspector suite (e.g. VTune) to look at the performance counters since you have CPI. Have you checked vector instructions execution numbers? Cache misses?

Regards
--
Taylor

McCalpinJohn · ‎10-29-2014

In general the most common way to get high CPI is with memory stalls -- especially with an in-order processor core such as the one used in Xeon Phi.

Using the STREAM benchmark as an example, the "Triad" result of 176 GB/s on 60 cores corresponds to one cache line every 24 cycles on each core. There are 10 instructions in the inner loop, giving a CPI of about 2.4. Of the 10 instructions in the inner loop: 3 are doing the actual work (load, FMA, store), 4 are doing prefetches (2 vprefetch0 and 2 vprefetch1), and 3 are doing loop control (add, compare, branch). Unrolling the loop would reduce the effective number of instructions from 10 to closer to 7 and move the CPI to about 3.4. For this benchmark running multiple threads per core reduces the performance, bringing the CPI up toward 4.

Most vectorizable codes have few vector instruction dependency stalls -- the same constructs that would lead to dependency stalls usually prevent vectorization. There are counter-examples, of course, but I would start by looking at the memory footprint and trying to estimate which level of the memory hierarchy is providing the inputs to your vector operations.

4f0drlp7eyj3 · ‎10-30-2014

John,

I wanted to understand your calculation and hope to get clarification on something. Starting from 176 GB/s, I can see

176/60 = 2.93333 (GB/s/core)

2.93333/1.1 GHz = 2.6666 bytes/cycle/core

64 bytes/cacheline / 2.6666 bytes/cycle = 24 cycles/cacheline

for a single thread. But, that thread only executes an instruction every other clock on Phi, doesn't it? Should the frequency be halved?