BLAS sgemv: Skylake only half as fast as Broadwell? Cache performance?

M_A_1 · ‎12-18-2018

I need to perform a 4000x2620 single precision matrix-vector multiplication (40 MiB) repeatedly with a new vector in every iteration.

On a dual Xeon Gold 6154 (18C, 3.0 GHz, 24.75 MiB L3) this takes about 200 us (105 GFLOPS) if I use the entire second CPU of the computer (18 threads) for this task. The same thing takes only about 110 us (190 GFLOPS) on a dual Xeon E5-2698 v4 (20C, 2.2 GHz, 50 MiB L3) if I use that computer's entire second CPU (20 threads). That is the Broadwell is practically two times faster than the comparably priced Skylake model that is listed as a successor at e.g. https://www.siliconmechanics.com/i78151/xeonscalable.php.

Why is the Skylake severely slower than its predecessor in this specific task? Is this to be expected? I see that the Broadwell L3 is large enough to store the full matrix in L3, but the non-inclusive L3 + L2 (24.75 MiB + 18 MiB = 42.75 MiB) of the Skylake should be big enough, too, to keep the matrix in the CPU without referring to the RAM.

What Xeon SP model would you expect to keep up with the Broadwell in this case?

My benchmark program runs with high SCHED_FIFO priority calling cblas_sgemv() 100,000 times and outputting the average and peak performace of this call. Results with both OpenBLAS 0.3.4 (compiled for Broadwell and Skylake-X) and Intel MKL are similar. All arrays are 64-byte aligned.

(Only for much smaller matrices, the Skylake reaches up to about 400 GFLOPS whereas the Broadwell maxes out at around 220 GFLOPS.)

TimP · ‎12-19-2018

This is certainly an interesting subject. Assuming you use MKL, that forum would be more likely to fetch an expert response. More data would be interesting. For example, show results of comparison with both machines running the same code, which may be possible with CBWR facility, and with both using the automatic choice of ISA specific code. MKL experts would wish to know specifically which release you use.

VTune or similar analysis might be interesting, but possibly cumbersome without source code.

It's reminiscent of past CPU introductions. For example, Sandy Bridge and Ivy Bridge had insufficient bandwidth between L1 and L2 to take advantage of AVX instructions when data locality to L1 wasn't achieved. Without a change in that data path, there would have been little point in AVX2 introduction.

M_A_1 · ‎12-19-2018

I'm using the latest versions of both MKL (2019.1) and OpenBLAS (0.3.4). I'm unable to identify any performance difference of the two on those two computers. I attach the source code of my little benchmark program. Please refer to the file how I compiled it and how I ran it.

What kind of data would you be interested in? Both timings I listed before are the output of this benchmark program.

McCalpinJohn · ‎12-19-2018

I have not tested this on the 18c Skylake Xeon, but on the 24c Xeon Platinum 8160 my testing showed that the maximum array size that will fit in the L2+L3 is slightly under 80% of the nominal capacity. For example, repeatedly reading an array that is 85% of the nominal capacity (24*(1+1.375)MiB) results in a minimum miss rate (measured at the DRAM) of about 6%. If the array size is decreased to 75% of the nominal capacity, the DRAM access rate drops to close to zero. In previous Intel processors (with inclusive L3 caches), I also found that it was not possible to use the entire L3. There are a variety of poorly-documented mechanisms at work here, such as "Direct Cache Access" (which places all IO DMA traffic in one "way" of the L3), and "replacement hints" (messages that tell the L3 that a line is in active use, so it should not be allowed to fall to the LRU position). The "replacement hints" for data in the L1D and L2 caches are important in some applications, but they are probably critical for caches lines mapped in the L1 Instruction cache.