- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am benchmarking DGEQRF and DPOTRF from MKL 2023.2.0 on a two socket Intel(R) Xeon(R) Platinum 8480CL system with hyperthreading enabled. As a prefix for my benchmark executable I use `KMP_AFFINITY=granularity=fine,compact,1,0 MKL_NUM_THREADS=56 numactl -N0 -m0 ` and as expected during benchmark execution I see in htop that only the first 56 cores are busy.
Now, I measure:
DPOTRF(uplo='L', n=32768,lda=32768) -> 5.55s
DPOTRF(uplo='L', n=32768,lda=32832) -> 4.74s
DGEQRF(m=32768,n=32768,lda=32768) -> 31.18s
DGEQRF(m=32768,n=32768,lda=32832) -> 18.55s
1. Is this performance expected? It would be very helpful if you could share the maximum performance of these routines from your benchmarks on the Intel(R) Xeon(R) Platinum 8480CL.
2. Why is there such a big performance improvement when I set `lda=m+64`?
3. What is the expected maximum parallel and sequential DGEMM GFLOPS on this chip?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in Intel Communities.
Could you please provide us a sample reproducer and OS details, so that we can replicate and investigate more at our end.
Regards,
Jilani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. Could you please provide us with an update?
Regards,
Jilani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Yes, this performance is expected.
- 32768 is a “bad” leading dimension (it’s a large power-of-two), hence the poor performance. Please check the notes about padding the matrices in this documentation about offloading computations, especially Rule 2 below: Rule 2: For best performance, leading dimensions should not be a multiple of a large power of 2 (e.g. 4096 bytes). Increasing the leading dimension slightly (e.g. from 4096 bytes to 4096+64 bytes) can improve performance in some cases.
- Please check the official MKL product page - https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html#gs.qw2c2p to see sgemm performance results on the same CPU. The performance results for the double precision would be ~ 2x smaller.
--Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
A gentle reminder:
We have not heard back from you. Could you please provide us with an update?
Regards,
Jilani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
A gentle reminder:
We have not heard back from you. Could you please provide us with an update?
Regards,
Jilani

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page