- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi, buddies
I am writing a Phi native application, but the utilization of CPU is very low (about 25%) and the CPI is a little high (about 3). I used plenty vector instruction in the application, and suspect that's the reason for the bad performance. I wonder if there is any official document or manual telling the latency of every instruction of Xeon Phi, I have searched but only got some empirical experience like this
http://arxiv.org/abs/1310.5842
so, please let me know if you have ideas, thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Easton,
First off, thanks for the article reference.
Secondly, I don't think such a document exists or is even possible for current modern architectures. The pipeline of modern processors, even the Pentium upon which the coprocessor cores are based, are too optimized for performance. What I mean is that when a resource is not used by one HW thread, it is likely used in another executing on the same core. Another example is branch prediction and HW prefetching. Each of these make precise latency measurements impossible. And I'm not even getting into the later generation (not MIC) out of order cores.
Have you used the optimization and vectorization report features of the Intel compilers? It looks like you used the inspector suite (e.g. VTune) to look at the performance counters since you have CPI. Have you checked vector instructions execution numbers? Cache misses?
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In general the most common way to get high CPI is with memory stalls -- especially with an in-order processor core such as the one used in Xeon Phi.
Using the STREAM benchmark as an example, the "Triad" result of 176 GB/s on 60 cores corresponds to one cache line every 24 cycles on each core. There are 10 instructions in the inner loop, giving a CPI of about 2.4. Of the 10 instructions in the inner loop: 3 are doing the actual work (load, FMA, store), 4 are doing prefetches (2 vprefetch0 and 2 vprefetch1), and 3 are doing loop control (add, compare, branch). Unrolling the loop would reduce the effective number of instructions from 10 to closer to 7 and move the CPI to about 3.4. For this benchmark running multiple threads per core reduces the performance, bringing the CPI up toward 4.
Most vectorizable codes have few vector instruction dependency stalls -- the same constructs that would lead to dependency stalls usually prevent vectorization. There are counter-examples, of course, but I would start by looking at the memory footprint and trying to estimate which level of the memory hierarchy is providing the inputs to your vector operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
I wanted to understand your calculation and hope to get clarification on something. Starting from 176 GB/s, I can see
176/60 = 2.93333 (GB/s/core)
2.93333/1.1 GHz = 2.6666 bytes/cycle/core
64 bytes/cacheline / 2.6666 bytes/cycle = 24 cycles/cacheline
for a single thread. But, that thread only executes an instruction every other clock on Phi, doesn't it? Should the frequency be halved?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page