- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I ran a larger linear algebra workflow which also calls various blas and lapack function including potrf, trsm, potri, syrk, symm and gemm. The arrays involved can be up to 150,000 x 150,000.
I have observed that the workflow's processing time substantially increases when executed at an AMD cpu, up to the point that the workflow had to be interrupted. Investigating the issue I found that potri is the culprit, where MKL is not even returning. Observing the core usage while potri is called it is multi-core but substantially fluctuates between using all cores and only one.
Here is the setup:
- compiled with Intel Clang++ with arguments:
- -march=x86-64-v4
- -std=c++20
- -fPIE
- -std=gnu++20
- -ferror-limit=4 -O2
- -qopenmp
- -fp-model=precise
- all libraries are statically linked (including MKL, pthread and libiomp5)
System libraries are linked dynamically:
linux-vdso.so.1 (0x00000673cc8ff000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00000673cc807000)
libmvec.so.1 => /lib/x86_64-linux-gnu/libmvec.so.1 (0x00000673c0707000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00000673c06d9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00000673c0400000)
/lib64/ld-linux-x86-64.so.2 (0x00000673cc901000)
software is executed on an Azure instance running Ubuntu 24.04 using environment variable settings
ulimit -s unlimited
export OMP_NUM_THREADS=48
export MKL_NUM_THREADS=48
export OMP_DYNAMIC=FALSE
export OMP_MAX_ACTIVE_LEVELS=2147483647
export OMP_PLACES=cores
export OMP_PROC_BIND=trueOn a E96s v6 instance with an INTEL(R) XEON(R) PLATINUM 8573C processor the software behaves normally.
On a E96ads_v6 instance with an AMD EPYC™ 9004 processor the software hangs in the MKL potri routine.
Note that all other MKL routines (potrf, trsm etc) have not shown the above problems.
Any idea?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The processing fro dpotri on a 1146878 x 1146878 matrix:
| INTEL(R) XEON(R) PLATINUM 8573C, 48 cores | 3198 seconds |
| AMD EPYC™ 9004, 48 cores | 21141 seconds |
This is an increase by factor 6.6.
The above numbers also apply for compiling using clang++ and linking against llvm omp.
The MKL version used is 2026.0.0.198.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page