performance numbers MKL 11.0 vs Eigen?

Azua_Garcia__Giovann · ‎09-05-2012

Hello,

I found the results here a bit surprising specially the MVM one (matrix vector multiplication with and without transposition) ... how come MKL that has even AVX and is heavily optimized gets lower performance than Eigen that only has implemented SSE2? http://eigen.tuxfamily.org/index.php?title=Benchmark

They also show that the benchmarks correspond to the latest MKL 11.0

I understand they outperform MKL for "complex expressions" using expression templates, it is clear but how come they still show to outperform MKL in MVM primitives???

Thanks in advance,

Best regards,

Giovanni

Gennady_F_Intel · ‎09-06-2012

what are the problem sizes in that case? it might happens for the smal inputs

Konstantin_A_Intel · ‎09-06-2012

Indeed, the sizes at the MV chart are 100-1000 that's very small and quite unusual for HPC. As you can see, there's a significant drop near 1000 that means the task doesn't fit into last level cache anymore. Frankly speaking, it makes sence to assess memory limited MV operation starting nearly from this point (but not finishing measurements there). And another unclear aspect of all those charts is using only 1 threads on the machine w/ 4 cores. I can only guess that the reason is that the majority of Eigen operations are not threaded. Considering only 1-thread MV performance on such small sizes - yes, it might be that Eigen is faster than all other libraries for this particular case. But this is due to all the libraries has additional overhead associated with calling stack and, probably, because this case has the lowest priority for real tasks. BTW, Eigen provides an easy way to use Intel(R) MKL as a backend: http://eigen.tuxfamily.org/dox-devel/TopicUsingIntelMKL.html

Konstantin_A_Intel · ‎09-06-2012

With respect to AVX - please notice that Intel(R) Core(TM)2 Quad CPU Q9400 used in measurements doesn't support AVX yet.

Gael_G_ · ‎09-07-2012

Indeed, this benchmark is quite old and was performed on a CPU with no AVX support. Activating multi-threading for a matrix-vector operation makes little since most of the time the application is paralelized at a higher level (e.g., matrix factorization). The benchmark goes to matrix sizes of 3000 (not 1000). For larger matrices, all libraries perform poorly since caching strategies cannot be used for level2 operations. The good performance of Eigen here is mainly due to a clever trick to completely avoid unaligned memory access in all situations: we form one unaligned packet from two aligned loads. More details in the code!