Incorrect eigenvectors from ZHEEV in MKL using multi-threading

Michel_Peters · ‎04-18-2011

Hi all,

I experienced orthonormality problems with eigenvectors of a complex hermitian matrix calculated from ZHEEV in MKL using parallel computing via openMP (well I am not quite sure this openMP part acutally makes sense, see below).

Short description:
The algorithm fails to calculate correctly orthogonalized vectors when the parameter $OMP_NUM_THREADS takes values larger than one, i.e. when more than a single core of a single node is used.

I was able to reproduce this bug with intel fortran compiler (64 bit) both on versions 11.1.056 and 11.1.059 (on two different machines), but for some reason, could not reproduce it with version 11.1.073 on another third machine.

Is this a well know bug in MKL that was fixed in later versions, thus I should recommand admins of my cluster to upgrade asap, or there is something wrong that I am doing?

I provided a zipped archive containing a simple program that could reproduce the bug. It executes these steps:
1) Creation of an arbitrary complex hermitian matrix.
2) A check-up step to make sure the matrix is actually hermitian.
3) Computation of Eigenvectors and Eigenvalues with zheev, through a call to diagonalize, a home-made routine only to interface with MKL.
4) A check-up step to make sure that the calculated vectors are orthonormal up to 1.d-10 tolerance.
5) Any deviation to orthonormality prints a warning on the screen.

There is also a logical variable to enable printing on screen the content of every relevant matrix at any time of execution, just in case. With $OMP_NUM_THREADS=1 everything goes normally and nothing much is displayed on the screen. Any other value generated several warnings, and eigenvectors seemed to be completely wrong.

I hope I was able to express the problem in a clear manner. Do not hesitate to ask for any clarifications. Thank you very much for your time!
Michel

Description of important files in the archive:
main.f90 - is the main program that could reproduce the issue, it includes a single subroutine diagonalize, itself calling ZHEEV.
makefile - is what I use to compile the code and its modules. The command "make" produce the executable (if the environment variables are correctly set). "make clean" removes all binaries before a fresh compilation.
basics.f90 - is a home-made module to facilitate the declaration of variables. I don't think it is any way related to this issue, but since we use it very often, I wanted to make sure there was no interference, thus included it in the program.

(edit) Please disregard the submit.sh file included in the archive, this was only added for discussions with admins of my cluster and I completely forgot to remove it.

Gennady_F_Intel · ‎04-18-2011

Hello Michel,

Did you check how it works when serial versions of MKL lib's were linked?

-Gennady

mecej4 · ‎04-18-2011

I do not have the specific IFort and MKL versions that you listed, but I ran the example code with no errors, when run with several versions (Itanium IFort 10.1 and also 11.0.069, and 11.1.073 on Suse Linux X64).

On WIndows using 11.1, I found that when compiled with the openmp option I needed to specify sufficiently large stack beyond the default value to avoid seg-faults.

Aleksandr_Z_Intel · ‎04-19-2011

Hi Michel. I also can't reproduce the issue with Intel Compiler 11.1.073 and MKL 10.2 under Linux. But playing with your codes I found out an issue in your linking line (it looks not quit correct). The correct version can be obtained with help of Intel Math Kernel Library Link Line Advisor ( http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ ).

W.B.R.
Alex Zotkevich

Michel_Peters · ‎04-19-2011

Thank you all for this precious input.

So according to your answers, there is no such thing as a bug in these versions of MKL, at least concerning what was observed.

There might be something wrong with the way linking is performed, you know we merely are simple users of this somewhat complicated suite.

That linking line came from a coworker, and was for a different machine that the one I am using right now, and still some time ago. Things might have changed somewhere that have dramatic consequences...

I will make some experiments with the linking line advisor you suggested, it seems like a promising option.

I did not try the serial versions of MKL, but if anything else fails, I might give it a try. I will update with some info as soon as I can.
Michel

Michel_Peters · ‎04-19-2011

Hi Gennady,

I gave a try to the sequential versions of MKL, and it seems to do the trick. Any other configuration using multi-thread libraries could not get rid of the "bug". Does that imply that threading is related to this issue?

I am not sure whether or not this could be a solution to this "annoyance", as we really would like to exploit the most of these powerful multi-core nodes, even if for this simple example the execution time is pretty short.

Still, I wanted to share my observations on theis as it can be helpful in order to find out what went wrong.
Michel

Michel_Peters · ‎04-19-2011

Quoting mecej4

I do not have the specific IFort and MKL versions that you listed, but I ran the example code with no errors, when run with several versions (Itanium IFort 10.1 and also 11.0.069, and 11.1.073 on Suse Linux X64).

On WIndows using 11.1, I found that when compiled with the openmp option I needed to specify sufficiently large stack beyond the default value to avoid seg-faults.

Hi mecej4,

seg-faults were not a problem for me, at least under Linux (I cannot say for windows), unless I increased the matrices size to dimension 1024.

In this case, I could identify that seg-faults occured at line 105 of the main program (that is right after diagonalization was performed), when the overlap matrix is calculated.

Replacing the matmul line by an equivalent do-loop operation got rid of those seg-faults.

For some reason, increasing the stack size to gigabyte order did not seem to do anything in that respect.

The code I provided used a dimension of 256 which was the minimal value for which I was able to reproduce the previously reported behavior.