Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29396 討論

eigensolver code crashes on AMD processor

Brian_Murphy
新貢獻者 II
12,327 檢視

I am using pardiso with ARPACK's Arnoldi eigensolver.  The code has been in use by over 100 users for several years.  I'm getting reports that the code crashes on AMD Ryzen systems.  Is there anything in particular that might be causing this?  I sent a DEBUG build to a user, but this didn't reveal anything as it simply crashed in identical fashion with no messages.

I'm using visual studio 2019.  My ifort compiler command line is as follows:

/nologo
/O2
/I"C:\Users\Me\Documents\Visual Studio 2019\Projects\Xlrotor\ARPACK\LAPACK\x64\Debug"
/I"C:\Users\Me\Documents\Visual Studio 2019\Projects\Xlrotor\Umfpack\x64\Debug"
/I"C:\Users\Me\Documents\Visual Studio 2019\Projects\Xlrotor\ARPACK\BLAS\x64\Debug"
/I"C:\Users\Me\Documents\Visual Studio 2019\Projects\Xlrotor\ARPACK\UTIL\x64\Debug"
/I"C:\Users\Me\Documents\Visual Studio 2019\Projects\Xlrotor\ARPACK\SRC\x64\Debug"
/extend_source:132
/module:"x64\Release\\"
/object:"x64\Release\\"
/Fd"x64\Release\\vc160.pdb"
/libs:static
/threads
/c

I've read about a compiler option /Qimf-arch-consistency in this thread.  If I try this option, should I set it to true or false?

Thanks,

Brian Murphy

0 積分
1 解決方案
Brian_Murphy
新貢獻者 II
10,967 檢視

I am happy to report that my user with a Ryzen 9 7950X has reported that the crash was eliminated with the myMKL_x64.DLL built with IVF 19.1.

In addition, my user with a Ryzen 7 PRO 5875U has reported the same success.

在原始文章中檢視解決方案

50 回應
Steve_Lionel
榮譽貢獻者 III
2,116 檢視

The error reported was an illegal instruction, not access violation.

mecej4
榮譽貢獻者 III
2,103 檢視

Thanks, I have corrected the previous post.

Brian_Murphy
新貢獻者 II
2,086 檢視

Can you please give an example of an "illegal instruction"?

Elsewhere - I tested a build of my program with today's IVF & MKL (i.e. not my custom MKL DLL), and the Arnoldi eigensolver still does not work like the older MKL (i.e. does not produce the same eigenvalues with and without the inclusion of eigenvectors).  Bummer.

andrew_4619
榮譽貢獻者 III
2,077 檢視

Just to be clear you are saying there is a bug in the current MKL routines? Is that a known bug if not you should post a reproducer in the mkl forum.

 

mecej4
榮譽貢獻者 III
2,068 檢視

If you get an "Illegal Instruction" abort when running a program that was compiled from a high level programming language, as explained in this old forum thread, it is the equivalent of "the barbarians have crossed the gates and are in". The causes can be many; the compiler, the RTL and the OS try hard to catch the error earlier, which is why we rarely see this message. As explained in this old thread in this forum, the cause is probably stack corruption.

There can be many causes of stack corruption, and it is not useful to use a magnifying glass to look at the actual illegal instruction at the machine code level. You have to create a small program that demonstrates the error, and provide it here.

Similarly,

[Forum managers:  Problems related to old forum posts that need fixing:

1. Many links in old forum posts no longer work, and may take one to an irrelevant Intel web page. The link to Dr. Fortran's "Don't blow your stack" article,  given in Kevin Davis's post , for example, is bad!

2. In a thread with many posts such as the current thread (over 40 here), it is almost necessary to have post reference numbers. The older version of this forum allowed one to see "#14" as the sequence number of the fourteenth post in the thread, and these sequence numbers made navigation painless. Many old posts in this Forum still contain such sequence numbers in the bodies of posts, but no such reference number is available in the initial lines of each post. Try to find post #13 in the current thread, for example, and note that different readers may read different time stamps for one and the same post, depending on their profile timezone setting.]

Steve_Lionel
榮譽貢獻者 III
2,051 檢視

Re: "illegal instruction"

It is common for new generations of processors to introduce new sets of instructions, where it has been found that common operations formerly done with sequences of instructions can be done faster with a new one. Both Intel and AMD have done this over the years. but more recently AMD just adopts new instruction sets Intel introduces. On the Intel side, there were SSE2, SSE3, SSE4, AVX, AVX2, and AVX512. Intel has more recently added smaller subsets of instructions, not enabled on all processors. If the CPU happens along an instruction it doesn't support, you get this error.

This case is a bit weird, though, in that an older processor runs the code and some newer ones don't.  It may be 1) the newer processors don't support a particular instruction correctly, or 2) the program is jumping into something that isn't a valid instruction. If the error is reliably reproducible, it should be possible to first identify the particular instruction it is complaining about, see if it is supposed to be valid, and if not, figure out how it got there. 

I'm leaning towards choice 2 here, as I am fairly confident that MKL doesn't do CPU dispatch for AMD processors.

Brian_Murphy
新貢獻者 II
2,040 檢視

In response to Andrew regarding "is it a possible bug in MKL".  The change in behavior being a bug could be considered a matter of opinion (or a case of splitting hairs). 

When Arnoldi finishes its iteration for finding eigenvalues, it calls a wrapup routine (dneupd) to prepare the evalues for return to the calling program.  If evectors have not been requested, the evalues are simply copied from work arrays to calling arguments.  If evectors are requested, additional work is done by dneupd to prepare the evectors, and to do this it calls a LAPACK routine named DLAHQR.  The behavior of DLAHQR is where the difference comes in between old and new versions.  If I'm right about that, this is really about LAPACK rather than MKL.  But I'm not totally sure.  DLAHQR recomputes the evalues from a hessenberg matrix.  In the old version of DLAHQR, the recomputed evalues exactly match the evalues determined by Arnoldi iteration, but not so with the new version of DLAHQR.  In the big picture, the differences in evalues are small, but it fouls up other logic used elsewhere in my program.

The source codes of the old and new DLAHQR have way too many differences for me to tell what happened.

Brian_Murphy
新貢獻者 II
2,039 檢視

A possible solution to my AMD crashing problem may be to simply build myMKL_x64.DLL with a newer version of IVF&MKL.  I've done this with Intel® Visual Fortran Compiler – extension version 19.1.0057.16, Package ID: w_comp_lib_2020.2.254, and sent the DLL to a user for testing.

I have IVF 2023.2 on another development system, but I need help figuring out how to build myMKL_x64.DLL on that system.  I will use an earlier thread for that in the MKL forum. 

Brian_Murphy
新貢獻者 II
10,968 檢視

I am happy to report that my user with a Ryzen 9 7950X has reported that the crash was eliminated with the myMKL_x64.DLL built with IVF 19.1.

In addition, my user with a Ryzen 7 PRO 5875U has reported the same success.

回覆