Solved: NumPy with MKL 2023.1.0 causes PySCF RI-K integral (evaluated by dgemm_) polluted

ajz34 · ‎12-20-2023

Hi oneMKL board,

I encountered a very tricky problem of MKL 2023.1.0 shipped with conda, which causes PySCF also break down on this specific task of RI-K integral evaluation.

My analysis show that it's a potential bug of MKL 2023.1.0, but no decisive evidence to support this conjecture. I think that code in PySCF is correct. I also posted this problem in https://github.com/pyscf/pyscf/issues/2004 , and I refer detail of problem there.

Changing MKL version, or just change the way to call `dgemm_` would simply resolve the problem. But MKL 2023.1.0 is the default version in conda defaults channel; and this problem have already affected the credibility of my molecular property computation results. So I'm curious that why this code gives problematic results, and looking forward to hear thoughts on this problem.

Mark_L_Intel · ‎01-10-2024

Alternatively, you could rely on Intel conda channel instead of conda main channel. Intel's channel is up to date with latest MKL version.

View solution in original post

ajz34 · ‎12-22-2023

# Docker workflow to reproduce this issue

Also refers to file docker_issue_2004.zip for build instructions of docker. Same zip file attached in this reply.

## Use image that is already on hub.docker.com

Before entering docker interactive shell,

```bash

docker pull ajz34/issue2004

docker run -it --cpus=8 ajz34/issue2004:latest

```

After entering docker interactive shell,

```bash

conda activate numpy-mkl23

python problem.py

```

Correct value of output matrix is all zero. On Intel CPUs with more than 8 physical cores, the result could be random non-zero values.

Please make sure that CPU is Intel, and CPU has more than 8 physical cores, to reproduce this issue.

## Build this docker image

To build this docker image,

```bash

docker build -t issue2004 .

```

This involves pull of docker and github, so make sure network is available.

Also make sure `problem.py` file is included along `Dockerfile`.

ShanmukhS_Intel · ‎12-23-2023

Hi Andrew,

Thanks for posting in Intel Communities.

Changing MKL version, or just change the way to call `dgemm_` would simply resolve the problem.

>> Could you please confirm if you mean, in 2024.0 version of oneMKL, it is working fine?

change the way to call `dgemm_` would simply resolve the problem.
>>Could you please elaborate a bit on this?

Thanks for sharing the sample reproducer. We will get back to you soon with our findings regarding the same.

Best Regards,

Shanmukh.SS

ajz34 · ‎12-25-2023

Hi Shanmukh.SS,

I can confirm that for this case, MKL 2024.0, 2023.2, 2023.0 is working fine. 2023.1 does not work properly for this case.

All these MKL libraries refers to that shipped by conda (defaults or conda-forge channels), instead of shipped by oneAPI itself.

In https://github.com/pyscf/pyscf/issues/2004 , there are some new discussions. In this issue, currently we consider that it is kind of threading problem.

I also found `MKL_VERBOSE=1`. By verbose printing, I found for this case, if number of threads is confined by 8 threads, then it will print something like

MKL_VERBOSE DGEMM(N,T,41,41,36,0x7ffc87453080,0x7f42f473bdb0,41,0x7f42f4427e30,41,0x7f42f870cbd0,0x7f42f005f6c0,41) 5.36us CNR:OFF Dyn:1 FastMM:1 TID:1 NThr:1
MKL_VERBOSE DGEMM(N,T,41,41,36,0x7ffc87453080,0x7f42f4738f90,41,0x7f42f4425010,41,0x7ffc87452f90,0x55e64adaf440,41) 8.13us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8
MKL_VERBOSE DGEMM(N,T,41,41,36,0x7ffc87453080,0x7f42f47419f0,41,0x7f42f442da70,41,0x7f42f7708c50,0x7f42ec05f6c0,41) 6.10us CNR:OFF Dyn:1 FastMM:1 TID:3 NThr:1
MKL_VERBOSE DGEMM(N,T,41,41,36,0x7ffc87453080,0x7f42f4747630,41,0x7f42f44336b0,41,0x7f42f6704cd0,0x7f42e405dac0,41) 5.95us CNR:OFF Dyn:1 FastMM:1 TID:5 NThr:1
MKL_VERBOSE DGEMM(N,T,41,41,36,0x7ffc87453080,0x7f42f473ebd0,41,0x7f42f442ac50,41,0x7f42f7f0ac50,0x7f42e805f6c0,41) 5.48us CNR:OFF Dyn:1 FastMM:1 TID:2 NThr:1
MKL_VERBOSE DGEMM(N,T,41,41,36,0x7ffc87453080,0x7f42f474a450,41,0x7f42f44364d0,41,0x7f42f5f02d50,0x7f42d805f6c0,41) 6.42us CNR:OFF Dyn:1 FastMM:1 TID:6 NThr:1
MKL_VERBOSE DGEMM(N,T,41,41,36,0x7ffc87453080,0x7f42f4744810,41,0x7f42f4430890,41,0x7f42f6f06cd0,0x7f42e005dac0,41) 6.16us CNR:OFF Dyn:1 FastMM:1 TID:4 NThr:1
MKL_VERBOSE DGEMM(N,T,41,41,35,0x7ffc87453080,0x7f42f474d270,41,0x7f42f44392f0,41,0x7f42f5700d50,0x7f42dc060ac0,41) 6.97us CNR:OFF Dyn:1 FastMM:1 TID:7 NThr:1

The second line shows that in thread id (TID) 0, it uses up all 8 threads, racing with other parallel regions.

I also found that though for AMD CPUs, they generally will give correct results for this case (by meaning it will give zero matrix output, which is desired), this problem actually affects AMD CPUs (by meaning threading problem also exists on AMD CPUs).

ShanmukhS_Intel · ‎12-27-2023

Hi Andrew,

Could you please try upgrading your conda version and get back to us if the issue persists?

Best Regards,

Shanmukh.SS

ajz34 · ‎12-27-2023

The conda version in docker is 23.10.0, and it's quite new. The newest currently is 23.11.0, and this issue presists. I guess it is not related to version of conda itself.

For MKL versions, MKL 2023.1.0 is still the newest in conda's defaults channel. When installing numpy by conda without any additional arguments, mkl=2023.1.0=h213fc3f_46344 will also be installed. Upgrading packages of conda in defaults channel will still show this issue.

I've also tried out other minor revisions of MKL 2023.1.0 in defaults channel (46343, 46342) and in conda-forge channel (48680, 46349). For all these versions, this issue presists. Also see files in Files :: Anaconda.org and Files :: Anaconda.org. So I guess it's the problem of MKL 2023.1.0 version.

Both the versions MKL 2023.0.0 and 2023.2.0 with close minor revisions (2023.0 26648; 2023.2 49572) does not show this issue.

If change channel to conda-forge and create environment by conda create -n tmp numpy libblas=*=*mkl -c conda-forge, then it will install mkl=2023.2.0=h84fe81f_50496 by default. By this case, this issue will not presist after fully upgrading conda packages.

Mark_L_Intel · ‎01-09-2024

The issue you are reporting has been reproduced. Further investigation showed that the issue is LAPACK (dgesvd) related. Indeed at some point the bug was inadvertently introduced but later it was fixed. The advice is to switch to the latest MKL version if you could. Thank you for the reporting the issue. As far as the MKL versions available in different distribution channels, we will continue look into this.

Mark_L_Intel · ‎01-10-2024

Alternatively, you could rely on Intel conda channel instead of conda main channel. Intel's channel is up to date with latest MKL version.

Mark_L_Intel · ‎01-16-2024

Hello,

We have not heard back from you. Could you please provide us with an update on your issue? Have you found a post from 01-09-2024 useful?

Mark_L_Intel · ‎01-31-2024

Hello,

@ajz34 , We have not heard back from you. This issue will no longer be monitored by Intel. Hopefully you are satisfied with fact that the issue was confirmed on our side and being worked on. Thank you for posting on oneMKL Forum!

ajz34 · ‎02-04-2024

I've been detached with this thread for a while, and missed notifications due to some filter rules of my outlook mail, so sorry for that.

And thanks for reply and suggestion! Hope working goes smoothly.