DPOTRI problems on AMD

may_ka · ‎06-26-2026

Hi,

I ran a larger linear algebra workflow which also calls various blas and lapack function including potrf, trsm, potri, syrk, symm and gemm. The arrays involved can be up to 150,000 x 150,000.

I have observed that the workflow's processing time substantially increases when executed at an AMD cpu, up to the point that the workflow had to be interrupted. Investigating the issue I found that potri is the culprit, where MKL is not even returning. Observing the core usage while potri is called it is multi-core but substantially fluctuates between using all cores and only one.

Here is the setup:

compiled with Intel Clang++ with arguments:
1. -march=x86-64-v4
2. -std=c++20
3. -fPIE
4. -std=gnu++20
5. -ferror-limit=4 -O2
6. -qopenmp
7. -fp-model=precise
all libraries are statically linked (including MKL, pthread and libiomp5)

System libraries are linked dynamically:

linux-vdso.so.1 (0x00000673cc8ff000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00000673cc807000)
libmvec.so.1 => /lib/x86_64-linux-gnu/libmvec.so.1 (0x00000673c0707000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00000673c06d9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00000673c0400000)
/lib64/ld-linux-x86-64.so.2 (0x00000673cc901000)

software is executed on an Azure instance running Ubuntu 24.04 using environment variable settings

ulimit -s unlimited
export OMP_NUM_THREADS=48
export MKL_NUM_THREADS=48
export OMP_DYNAMIC=FALSE
export OMP_MAX_ACTIVE_LEVELS=2147483647
export OMP_PLACES=cores
export OMP_PROC_BIND=true

On a E96s v6 instance with an INTEL(R) XEON(R) PLATINUM 8573C processor the software behaves normally.

On a E96ads_v6 instance with an AMD EPYC™ 9004 processor the software hangs in the MKL potri routine.

Note that all other MKL routines (potrf, trsm etc) have not shown the above problems.

Any idea?

may_ka · ‎06-26-2026

The processing fro dpotri on a 1146878 x 1146878 matrix:

INTEL(R) XEON(R) PLATINUM 8573C, 48 cores	3198 seconds
AMD EPYC™ 9004, 48 cores	21141 seconds

This is an increase by factor 6.6.

The above numbers also apply for compiling using clang++ and linking against llvm omp.

The MKL version used is 2026.0.0.198.

ivanp · ‎07-02-2026

A quick calculations shows the size of this matrix is almost a 10 TB:

>>> 1146878**2 * 8 / 1024**3
9799.965820342302

According to this page, the Intel Xeon can address 4 TB of memory. And according to this sheet, the AMD EPYC could have up to 6 TB.

Could swapping to disk be the issue?

ivanp · ‎07-02-2026

I mistakenly linked the spec for the Intel® Xeon® Platinum 8570 Processor, not the 8573C.

But the Azure pages (links given below), show the following:

Size Name	vCPUs (Qty.)	Memory (GB)
Standard_E96ads_v6	96	672
Standard_E96s_v6	96	768

The machines differ in the local and remote storage they have available.

Sources:
- https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/memory-optimized/esv6-series?tabs=sizebasic
- https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/memory-optimized/eadsv6-series?tabs=sizebasic

may_ka · ‎07-03-2026

Hi @ivanp

the actual number is 146878. Sorry for the confusion.

I could solve the problem. A large range in diagonal elements with resulting borderline floating point numbers causes the issue. For some reason Intel handles the problem without overhead, contrarily to AMD.

ivanp · ‎07-03-2026

No worries. At 161 GB that should fit.

I have read elsewhere that denormalized numbers and gradual underflows can cause processing slowdowns.