Performance issues with Xeon Phi

Vladimir_Dergachev · ‎08-28-2013

In chasing down performance issues with our code on Xeon Phi I have isolated a section that illustrates many of problems we are seeing.

Screenshot from VTune Amplifier is attached.

The left panel contains source snippet highlighted in grey, the corresponding assembly is highlighted in pale blue on the right panel. VTune assembler was run collecting only events from CPU_CLK_UNHALTED and PIPELINE_FLUSHES.

First, notice that in the assembly listing corresponding to addresses 0x4503a8 to 0x4503de only every other line contains CPU_CLK_UNHALTED counts, even though it is not the case for lines 0x4503de to 0x4503f2.

An examination shows that all instructions in that area were placed by compiler to have dependencies, for example instruction at 0x4503c4 is using %ecx which is computed in previous instruction at 0x4503c1. This is baffling - why didn't the compiler exchange instructions at 0x4503c4 and 0x4503c6 ? They access different registers and this would have decrease code dependency.

Second, take a look at the column PIPELINE_FLUSHES. There are three prominent peaks there, which do not correspond to any branch.

The peaks at 0x4503b6 and 0x4503d0 are the most puzzling. They correspond to shift instructions and I do not see any way how these events could have been misattributed from some other branch.

Could someone shed some light on this issue ?

Vladimir_Dergachev · ‎08-28-2013

Screenshot attached - somehow it did not make it the first time.

Vladimir_Dergachev · ‎08-28-2013

One more thing - the code was compiled with icpc, using options

-Wall -g -fopenmp -O3 -vec-report6 -mmic -openmp -fma -funroll-loops -fminshared -fno-math-errno -no-vec-guard-write -ipo-jobs32 -finline -inline-forceinline -inline-level=2 -ipo -fp-model fast -opt-prefetch -mP2OPT_hlo_enable_all_mem_refs_prefetch2=T -fimf-domain-exclusion=8 -mcmodel=medium -opt-calloc -opt-mem-layout-trans

icpc version

icpc (ICC) 13.1.2 20130514
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.

Victor_L_2 · ‎08-29-2013

Vladimir,

The attachment is missing. After adding the file, you still need to press "Start upload" button below.

Vladimir_Dergachev · ‎08-29-2013

The attachment is in the second posting.

robert-reed · ‎09-03-2013

It is also the case that the lines between 0x4503a8 and 0x4503de that have accumulations of CPU_CLK_UNHALTED events are mostly shift instructions. Perhaps these are what are taking longer to execute, and so are statistically more available to accumulate CPU_CLK_UNHALTED events? I'm not sure there's anything to justify concern about the order of instruction emission here. Most over the intervening instructions are register-to-register integer adds, which the Intel Xeon Phi coprocessor should be able to dispatch in one clock and have the results available for the next instruction slot on each HW thread (which may be four clocks later, given the "smart round robin" scheduling that occurs with HW threads in each core). It's not clear to me that the instructions swap you suggest would have any effect on performance--it certainly does nothing about the result dependence that you point out other than possibly giving the core more time to provide the previous value, which it may or may not need.

Finally, the question about PIPELINE_FLUSHES, which are among the available events for Intel Xeon Phi, but not among the recommended events for performance tuning. I am not able to explain the three peaks you noticed, but I did notice that the assembly lines reporting PIPELINE_FLUSHES are identically the lines reporting CPU_CLK_UNHALTED. Coincidence? And I'm not even certain what PIPELINE_FLUSH means on this in-order UV-pipe processor: trying to do dual-issue of the adds in the U and V pipes? Don't know.

Finally I must point out that the loop being optimized here is neither parallelizable or vectorizable because of the loop constraints, so is guaranteed to run serial scalar, a worst-case scenario for code on the Intel Xeon Phi coprocessor. The whole idea of the coprocessor is to run parallel vector code.

Vladimir_Dergachev · ‎09-04-2013

The problem is that there are no counts for ADDs at all. I could imagine a shift being 50% to 100% slower, but not so much that no counts show up at all. For example, address 0x4503a4 has a count more than 15 times smaller than counts reported for shift, but it does show up.

Could it be that CPU_CLK_UNHALTED is only reported for U pipe ?

I used PIPELINE_FLUSHES to provide indication of problems with code due to instruction scheduling, mispredicted brunches, etc. Would you know a better counter for the same purpose ?

This particular function is used in our setup code and it is not normally expected to be a large part of computation. However, on Xeon Phi it shows up because of large mismatch in execution speed between the scalar and vector unit.

jimdempseyatthecove · ‎09-05-2013

In looking at the disassembly, it appears that the majority of the CPU_CLK_UNHALTED are "billed" to the instructions that have a dependency on the result of a prior (close) register load. The instructions that are not "billed" do not reference regsters that were (closely) loaded. My guess is when integer instructions are paired, and one has dependency stall, that that thread is the only one getting "billed".

Jim Dempsey