I need to perform a 4000x2620 single precision matrix-vector multiplication (40 MiB) repeatedly with a new vector in every iteration.
On a dual Xeon Gold 6154 (18C, 3.0 GHz, 24.75 MiB L3) this takes about 200 us (105 GFLOPS) if I use the entire second CPU of the computer (18 threads) for this task. The same thing takes only about 110 us (190 GFLOPS) on a dual Xeon E5-2698 v4 (20C, 2.2 GHz, 50 MiB L3) if I use that computer's entire second CPU (20 threads). That is the Broadwell is practically two times faster than the comparably priced Skylake model that is listed as a successor at e.g. https://www.siliconmechanics.com/i78151/xeonscalable.php.
Why is the Skylake severely slower than its predecessor in this specific task? Is this to be expected? I see that the Broadwell L3 is large enough to store the full matrix in L3, but the non-inclusive L3 + L2 (24.75 MiB + 18 MiB = 42.75 MiB) of the Skylake should be big enough, too, to keep the matrix in the CPU without referring to the RAM.
What Xeon SP model would you expect to keep up with the Broadwell in this case?
My benchmark program runs with high SCHED_FIFO priority calling cblas_sgemv() 100,000 times and outputting the average and peak performace of this call. Results with both OpenBLAS 0.3.4 (compiled for Broadwell and Skylake-X) and Intel MKL are similar. All arrays are 64-byte aligned.
(Only for much smaller matrices, the Skylake reaches up to about 400 GFLOPS whereas the Broadwell maxes out at around 220 GFLOPS.)
This is certainly an interesting subject. Assuming you use MKL, that forum would be more likely to fetch an expert response. More data would be interesting. For example, show results of comparison with both machines running the same code, which may be possible with CBWR facility, and with both using the automatic choice of ISA specific code. MKL experts would wish to know specifically which release you use.
VTune or similar analysis might be interesting, but possibly cumbersome without source code.
It's reminiscent of past CPU introductions. For example, Sandy Bridge and Ivy Bridge had insufficient bandwidth between L1 and L2 to take advantage of AVX instructions when data locality to L1 wasn't achieved. Without a change in that data path, there would have been little point in AVX2 introduction.
I'm using the latest versions of both MKL (2019.1) and OpenBLAS (0.3.4). I'm unable to identify any performance difference of the two on those two computers. I attach the source code of my little benchmark program. Please refer to the file how I compiled it and how I ran it.
What kind of data would you be interested in? Both timings I listed before are the output of this benchmark program.