Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

DPOTRI problems on AMD

may_ka
Beginner
44 Views

Hi,

 

I ran a larger linear algebra workflow which also calls various blas and lapack function including potrf, trsm, potri, syrk, symm and gemm. The arrays involved can be up to 150,000 x 150,000.

I have observed that the workflow's processing time substantially increases when executed at an AMD cpu, up to the point that the workflow had to be interrupted. Investigating the issue I found that potri is the culprit, where MKL is not even returning. Observing the core usage while potri is called it is multi-core but substantially fluctuates between using all cores and only one.

 

Here is the setup:

  1. compiled with Intel Clang++ with arguments:
    1. -march=x86-64-v4
    2. -std=c++20
    3. -fPIE
    4. -std=gnu++20
    5. -ferror-limit=4 -O2
    6. -qopenmp
    7. -fp-model=precise
  2. all libraries are statically linked (including MKL, pthread and libiomp5)

System libraries are linked dynamically:

linux-vdso.so.1 (0x00000673cc8ff000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00000673cc807000)
libmvec.so.1 => /lib/x86_64-linux-gnu/libmvec.so.1 (0x00000673c0707000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00000673c06d9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00000673c0400000)
/lib64/ld-linux-x86-64.so.2 (0x00000673cc901000)

 

software is executed on an Azure instance running Ubuntu 24.04 using environment variable settings

ulimit -s unlimited
export OMP_NUM_THREADS=48
export MKL_NUM_THREADS=48
export OMP_DYNAMIC=FALSE
export OMP_MAX_ACTIVE_LEVELS=2147483647
export OMP_PLACES=cores
export OMP_PROC_BIND=true

On a E96s v6 instance with an INTEL(R) XEON(R) PLATINUM 8573C processor the software behaves normally.

On a E96ads_v6 instance with an AMD EPYC™ 9004 processor the software hangs in the MKL potri routine.

Note that all other MKL routines (potrf, trsm etc) have not shown the above problems.

 

Any idea?

0 Kudos
1 Reply
may_ka
Beginner
23 Views

The processing fro dpotri on a 1146878 x 1146878 matrix:

 

INTEL(R) XEON(R) PLATINUM 8573C, 48 cores3198 seconds
AMD EPYC™ 9004, 48 cores21141 seconds

 

This is an increase by factor 6.6.

The above numbers also apply for compiling using clang++ and linking against llvm omp.

The MKL version used is 2026.0.0.198.

0 Kudos
Reply