- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I found the results here a bit surprising specially the MVM one (matrix vector multiplication with and without transposition) ... how come MKL that has even AVX and is heavily optimized gets lower performance than Eigen that only has implemented SSE2? http://eigen.tuxfamily.org/index.php?title=Benchmark
They also show that the benchmarks correspond to the latest MKL 11.0
I understand they outperform MKL for "complex expressions" using expression templates, it is clear but how come they still show to outperform MKL in MVM primitives???
Thanks in advance,
Best regards,
Giovanni
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
what are the problem sizes in that case?
it might happens for the smal inputs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Indeed, the sizes at the MV chart are 100-1000 that's very small and quite unusual for HPC. As you can see, there's a significant drop near 1000 that means the task doesn't fit into last level cache anymore. Frankly speaking, it makes sence to assess memory limited MV operation starting nearly from this point (but not finishing measurements there). And another unclear aspect of all those charts is using only 1 threads on the machine w/ 4 cores. I can only guess that the reason is that the majority of Eigen operations are not threaded.
Considering only 1-thread MV performance on such small sizes - yes, it might be that Eigen is faster than all other libraries for this particular case. But this is due to all the libraries has additional overhead associated with calling stack and, probably, because this case has the lowest priority for real tasks.
BTW, Eigen provides an easy way to use Intel(R) MKL as a backend:
http://eigen.tuxfamily.org/dox-devel/TopicUsingIntelMKL.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With respect to AVX - please notice that Intel(R) Core(TM)2 Quad CPU Q9400 used in measurements doesn't support AVX yet.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Indeed, this benchmark is quite old and was performed on a CPU with no AVX support. Activating multi-threading for a matrix-vector operation makes little since most of the time the application is paralelized at a higher level (e.g., matrix factorization). The benchmark goes to matrix sizes of 3000 (not 1000). For larger matrices, all libraries perform poorly since caching strategies cannot be used for level2 operations. The good performance of Eigen here is mainly due to a clever trick to completely avoid unaligned memory access in all situations: we form one unaligned packet from two aligned loads. More details in the code!
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page