Performance of Half Precision Floats on KNL

Zhen · ‎01-13-2017

Hi

I want to use half precision floats on Knights Landing. However, I find the half-floats really have very poor performance. Does anyone know the reason?

I use the following intrinsic instructions to load and store data and then perform vectorized computation. I find that the half floats version have about 5x performance degradation in compare with the single floats version. According to the article https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats, my results seem to be weird. Is there something wrong with my code?

#define LOAD_HALF(addr) _mm512_cvtph_ps(_mm256_castps_si256(_mm256_load_ps((addr))))

#define STORE_HALF(addr, data) _mm256_store_ps(addr, _mm256_castsi256_ps(_mm512_cvtps_ph(data, 0)))

The icc version is 17.0.

Thanks!

Zhen

jimdempseyatthecove · ‎01-16-2017

Can you show your code?

The KNL instruction latency and throughput is not listed in the Intel intrinsic guide, but the Haswell __m256 lists latency of 6 and throughput of 1. Agner Fog's site lists KNL PS(512) -> PH(256) latency of 7, throughput 9 and PH(256) -> PS(512) latency of 7, throughput of 8. For computation to be effective, you would have to have several floating point operations between the load/cvt and cvt/store. I suggest you experiment using the AVX2 version of the instructions, I suspect lower latency, and if so, you may be able to overlap the conversion time with computation or load/store. It should be easy enough for you to write a test.

Maybe future versions will support half float operations (as well as quad).

Jim Dempsey

McCalpinJohn · ‎01-17-2017

Optimization of matrix multiplication on multi-level cache hierarchies is a very big and complex topic....

For mainstream Intel processors, high-performance (real) matrix multiplication is typically implemented with 2-3 levels of blocking:

Register blocking
L2 cache blocking
L3/TLB blocking

Note that L1 blocking is not typically used, because the L1 is too small to hold one block from each of the three arrays. Fortunately, register blocking reduces the bandwidth requirement enough that the much higher bandwidth of the L1 is not required -- the bandwidth of the L2 cache is adequate.

For KNL the best implementation for double precision looks quite different than the implementation on Haswell processors, and I am still trying to understand some of the details.... I have not looked at the implementation for single precision, but based on Figure 1 at https://software.intel.com/en-us/articles/intel-xeon-phi-delivers-competitive-performance-for-deep-learning-and-getting-better-fast it looks like the MKL performance for SGEMM is good enough that there are very few extra cycles available for data precision conversion....

jimdempseyatthecove · ‎01-17-2017

>> Do you have other suggestions?

Compare your half-float with MKL single precision SGEMM. Use various array sizes, as well as to compare with the representative size you intend to use. What is unknown (to me) is if the _mm512_cvtph_ps and _mm512_cvtps_ph, with their latency and throughputs of 7, 8 interfere with memory load and store as well as fmadd_ps. My unfounded suspicion is that because the throughput~=latency that the VPU will be tied up (at least for this version of KNL). Haswell and Broadwell may be a different story (but with code significantly different that what you show).

Jim Dempsey

Zhen · ‎01-17-2017

Hi John,

Thanks for the reply. Do you mean that it is hard to hide the penalty of precision conversion and half-float is not easy to achieve better performance?

Thanks!

Zhen

Mccalpin, John wrote:

Optimization of matrix multiplication on multi-level cache hierarchies is a very big and complex topic....

For mainstream Intel processors, high-performance (real) matrix multiplication is typically implemented with 2-3 levels of blocking:

Register blocking

L2 cache blocking

L3/TLB blocking

Note that L1 blocking is not typically used, because the L1 is too small to hold one block from each of the three arrays. Fortunately, register blocking reduces the bandwidth requirement enough that the much higher bandwidth of the L1 is not required -- the bandwidth of the L2 cache is adequate.

For KNL the best implementation for double precision looks quite different than the implementation on Haswell processors, and I am still trying to understand some of the details.... I have not looked at the implementation for single precision, but based on Figure 1 at https://software.intel.com/en-us/articles/intel-xeon-phi-delivers-compet... it looks like the MKL performance for SGEMM is good enough that there are very few extra cycles available for data precision conversion....

Zhen · ‎01-17-2017

Hi Jim,

I do not quit undstand. Do you suspect that the cvtph_ps/cvtps_ph will interfere with load/store? So the half-float version has low performance? You mentioned that VPU will be tied up, do you mean the VPU should be fully used, however the VPU does not for some reason?

I also run the same code on Haswell. I just change the intrinsic instructions. On Haswell, the half-float version still has very low performance in compare with the float version.

Thanks!

Zhen

jimdempseyatthecove wrote:

>> Do you have other suggestions?

Compare your half-float with MKL single precision SGEMM. Use various array sizes, as well as to compare with the representative size you intend to use. What is unknown (to me) is if the _mm512_cvtph_ps and _mm512_cvtps_ph, with their latency and throughputs of 7, 8 interfere with memory load and store as well as fmadd_ps. My unfounded suspicion is that because the throughput~=latency that the VPU will be tied up (at least for this version of KNL). Haswell and Broadwell may be a different story (but with code significantly different that what you show).

Jim Dempsey

McCalpinJohn · ‎01-17-2017

On the web page I referred to above, KNL SGEMM performance (using MKL) was as high as about 4.5 TFLOPS on the Xeon Phi 7250. The nominal peak single-precision performance on that processor is about 6.1 TFLOPS (1.4 GHz * 68 cores * 64 FLOPS/cycle), so the 4.5 TFLOPS results are at about 74% of the nominal peak performance. (The actual frequency during execution may be lower or higher -- I have not seen much frequency throttling on KNL for the DGEMM kernel, but it throttles a lot when running the HPL benchmark -- another mystery to investigate....)

Obtaining 74% of peak performance means that the processor is executing an average of 1.48 vector FMA instructions every cycle, out of a maximum instruction issue capability of 2 instructions per cycle. The SGEMM code requires some non-FMA instructions as well (loads, pointer increments, prefetches, compare & branch, etc), each of which effectively displaces the execution of an FMA instruction. I have not looked at the details for SGEMM, but for DGEMM almost 20% of the instructions are not FMAs. If the code is similar, then 1.48 FMA instructions per cycle plus 25% non-FMA overhead is 1.85 instructions per cycle, or 92.5% of the peak instruction issue capability of the processor. The "lost" 7.5% is due to non-overlapped cache misses, branch mispredictions, and non-overlapped instruction latencies. If you increase the number of instructions by more than 7.5%, the increased time to issue the instructions is guaranteed to outweigh any possible benefits.

In practice it will be much worse than this because the float to half-float conversion instructions do not appear to be fully pipelined. Agner Fog's instruction tables show that you can only execute one VCVTPS2PH every 7 cycles in each FPU. There may be partial overlap between these instructions and surrounding arithmetic instructions, but even with perfect overlap you can easily calculate a lower bound on the cycle count for converting the data from and to half precision -- 14 cycles/2 FPUs for every 16 elements. With a blocked SGEMM implementation you can re-use the 32-bit data from only one of the three arrays, while the other two arrays are swapped out after each sub-block, so they have to pay this translation penalty for every block.

The short summary is that the half-precision conversion instructions only exist for compatibility reasons on KNL, and not for performance. Ivy Bridge, Haswell, Broadwell, and Skylake have full-performance implementations of these instructions (i.e., a throughput of one instruction per cycle, rather than one instruction every seven cycles), so those platforms are much more likely to be interesting targets for this sort of experimentation....

Zhen · ‎01-17-2017

Hi John,

Thanks so much for the detail explanation!

Best,

Zhen

Zhen · ‎01-17-2017

Hi Jim,

Thanks for your reply and the suggestion. I attach the code. It perform matrix multiplication. I want to accelerate it by using half-float so as to decrease the cache miss. Do you have other suggestions?

Thanks!

Zhen

SergeyKostrov · ‎02-09-2017

>>....However, I find the half-floats really have very poor performance. Does anyone know the reason? Did you check optimization reports that your processing was completely vectorized? Intel Software Developers Manual ( SDM ) clearly states that HP FPU data type needs to be used for storing of lower precision FP values and I hope that you read it. I don't see source codes of your test case but I could assume it could be implemented in two possible ways ( consider that matrices are square ): [ Version 1 - Classic ] ... for( i = 0; i < n; i += 1) { for( j = 0; j < n; j += 1) { for( k = 0; k < n; k += 1) { LOAD_HALF(addr) // Do math with a vector of SP FPU data types ( of course, vectorized ) STORE_HALF(addr) } } } ... [ Version 2 - Use Loop Blocking Optimization Technique ] ... for( i = 0; i < n; i += 1) { for( j = 0; j < n; j += 1) { for( k = 0; k < n; k += 1) { ... for( ... ) { // Convert a block of HP FPU data type values to SP FPU data type values LOAD_HALF(addr) } ... // Do math with a block of SP FPU data types ( of course, vectorized ) ... for( ... ) { // Convert a block of SP FPU data type values to HP FPU data type values STORE_HALF(addr) } } } } ...

Zhen · ‎05-20-2017

Hi Sergey,

Sorry for just reading your reply. I should find it earler. I do not perform matrix multiplication here, just want to use HP to multiply and add. I attach the code.

Thanks!

Zhen