Community
cancel
Showing results for 
Search instead for 
Did you mean: 
mecej4
Black Belt
92 Views

Tough and evasive bug in MKL DSS solver

Here is a very small program that solves two linear equations using the MKL DSS interface to Pardiso. First, the test program:

program ptmkl
use mkl_dss
implicit none
TYPE (MKL_DSS_HANDLE) :: handle
INTEGER opt,dss_err
INTEGER, PARAMETER :: NEQ=2, NNZM=NEQ*NEQ
INTEGER :: rowIDX(NEQ+1) = [1,3,5]
INTEGER :: COL(NNZM) = [1,2, 1,2]
INTEGER :: i,j,k, n = NEQ, nnz = NNZM, perm(NEQ)
DOUBLE PRECISION :: A(NNZM) = [1d0, -1d-2,   -1d-2, 1d0]
DOUBLE PRECISION :: B(NEQ) = [1d0, 2d0], X(NEQ)

opt=MKL_DSS_DEFAULTS
dss_err = dss_create(handle, opt)
write(*,10)'Create ',dss_err
dss_err = dss_define_structure(handle,opt,rowIDX,n,n,COL,nnz)
write(*,10)'Define ',dss_err
dss_err = dss_reorder(handle,opt,perm)
write(*,10)'ReOrder',dss_err

dss_err = dss_factor_real(handle,opt,A)
write(*,10)'Factor ',dss_err
dss_err = dss_solve_real(handle,opt,B,1,X)
write(*,10)'Solve  ',dss_err

10 format(A7,2x,I4)
end program ptmkl

I compile this program with IFort 15.0 IA-32 using the command

ifort /Qmkl /traceback /MD dssbug.f90

When I then run the program repeatedly, it works correctly very often but, once in a while, aborts with a C0000005 or C0000374 error. To track the problem down, I ran the program inside Inspector XE 2015, and the screenshot is attached.

This is a shorter reproducer for the problems reported by another user, see https://software.intel.com/en-us/forums/topic/535430 .

0 Kudos
8 Replies
mecej4
Black Belt
92 Views

Additional note : I compiled the file mkl_dss.f90 (which is installed in the MKL Include directory) all by itself in order to produce the file mkl_dss.mod, which is used in the reproducer.

The problem is encountered with other versions of MKL and IFort, as well. In fact, I have a modified version that I can build with CVF6.6 and the bug is present in the old CXML library.

Update, Nov. 15: I updated my installation to Fortran Composer 15.0 update 1 last night, which updated the MKL version to 11.2.1 Product Build 20141023. The bug is present in this version, too, and here are more details (from a 32-bit run) to help you with a diagnosis and fix.

The  access violation is always at the same location if it occurs at all. This location is in mkl_intel_thread.dll, routine mkl_pds_invs_perm_mod_pardiso() + 0EC2H. The instruction is mov ecx, dword ptr [ecx+edx*4-4], where ecx is set equal to the base of the permutation index array, which is the 9th argument present at the function entry. The memory to which ECX points contains just two entries, with values 1 and 2 (the test problem has n_eq = 2), followed by lots of '0BADFOOD'. All this is fine. However, EDX contains the index (1-base?) into the permutation index array, and when the crash happens it contains various values in different runs, but I have only seen values larger than 0600H. Such large values indicate an array bound error that is responsible for the access violation (remember, the iperm array has only two elements in the test problem, so at the exception point EDX should never have a value greater than 2).

Some of the statements in the last paragraph are speculative, since I have made the statements on the basis of inspecting the disassembly listing in the debugger. I apologize for any incorrect speculation.

 

mecej4
Black Belt
92 Views

This bug is still present in MKL 11.3. It is not evasive any more.

program dssbug
use mkl_dss
implicit none
TYPE (MKL_DSS_HANDLE) :: handle 
CHARACTER(LEN=198) :: vers
INTEGER opt,dss_err
INTEGER, PARAMETER :: NEQ=2, NNZM=NEQ*NEQ
INTEGER :: rowIDX(NEQ+1) = [1,3,5]
INTEGER :: COL(NNZM) = [1,2, 1,2]
INTEGER :: i,j,k, n = NEQ, nnz = NNZM, perm(NEQ)
DOUBLE PRECISION :: A(NNZM) = [1d0, -1d-2,   -1d-2, 1d0]
DOUBLE PRECISION :: B(NEQ)  = [1d0, 2d0], X(NEQ)
                             
call mkl_get_version_string(vers)
write(*,*)trim(vers)
opt=MKL_DSS_DEFAULTS
dss_err = dss_create(handle, opt)
write(*,10)'Create ',dss_err
dss_err = dss_define_structure(handle,opt,rowIDX,n,n,COL,nnz)
write(*,10)'Define ',dss_err
dss_err = dss_reorder(handle,opt,perm)
write(*,10)'ReOrder',dss_err

dss_err = dss_factor_real(handle,opt,A)
write(*,10)'Factor ',dss_err
dss_err = dss_solve_real(handle,opt,B,1,X)
write(*,10)'Solve  ',dss_err

10 format(A7,2x,I4)
end program dssbug

In 32-bits, the traceback fails to report the line number, and the access violation occurs always:

 Intel(R) Math Kernel Library Version 11.3.0 Product Build 20150730 for 32-bit a
 pplications
Create      0
Define      0
ReOrder     0
forrtl: severe (157): Program Exception - access violation
Image              PC        Routine            Line        Source
mkl_intel_thread.  611699C8  Unknown               Unknown  Unknown
libiomp5md.dll     5C6927E5  Unknown               Unknown  Unknown
libiomp5md.dll     5C65FAEC  Unknown               Unknown  Unknown
libiomp5md.dll     5C6313B8  Unknown               Unknown  Unknown
mkl_intel_thread.  611666C2  Unknown               Unknown  Unknown
mkl_core.dll       5D019C8C  Unknown               Unknown  Unknown
mkl_core.dll       5CF2E09C  Unknown               Unknown  Unknown
mkl_core.dll       5CF16F9A  Unknown               Unknown  Unknown
mkl_core.dll       5CECC509  Unknown               Unknown  Unknown
mkl_core.dll       5CEA81EE  Unknown               Unknown  Unknown
mkl_core.dll       5CE676E8  Unknown               Unknown  Unknown
mkl_core.dll       5CE4C428  Unknown               Unknown  Unknown
mkl_core.dll       5CE4C01C  Unknown               Unknown  Unknown
ntdll.dll          777526BB  Unknown               Unknown  Unknown

With 64-bits, the line number is given, but once in a while the traceback is not given but a WER action is triggered.

 Intel(R) Math Kernel Library Version 11.3.0 Product Build 20150730 for Intel(R)
  64 architecture applications
Create      0
Define      0
forrtl: severe (157): Program Exception - access violation
Image              PC                Routine            Line        Source
mkl_intel_thread.  00007FFE9B3CBD97  Unknown               Unknown  Unknown
mkl_core.dll       00007FFE99FB3FF7  Unknown               Unknown  Unknown
mkl_core.dll       00007FFE99F94CCC  Unknown               Unknown  Unknown
mkl_core.dll       00007FFE99F1C6CD  Unknown               Unknown  Unknown
mkl_core.dll       00007FFE99EED525  Unknown               Unknown  Unknown
dssbug.exe         00007FF74CC811F9  MAIN__                     21  dssbug.f90
dssbug.exe         00007FF74CC829CE  Unknown               Unknown  Unknown
dssbug.exe         00007FF74CC82D83  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFEC0322D92  Unknown               Unknown  Unknown
ntdll.dll          00007FFEC1FB9F64  Unknown               Unknown  Unknown

In both cases, I compiled with /traceback /MD /Qmkl. If the access violation is caused by an error in the arguments passed or by a wrong sequence of calls, one should like to know the specific error so that it can be avoided. In fact, I suspect that passing opt=MKL_DSS_DEFAULTS in all the DSS calls is probably not correct, and the documentation should make it clear if that is the case. If there is no linear equations case for which MKL_DSS_DEFAULTS is consistently correct, perhaps "DEFAULTS" is not a good choice as a label.

Gennady_F_Intel
Moderator
92 Views

Thanks mecej4. We missed this problem and I see the similar issue on my side too. Escalated. --Gennady

mecej4
Black Belt
92 Views

Thanks for looking at this bug report (first made eleven months ago). I just tested the program of #3 on Linux, and I do not see the error, whether I use 32- or 64-bits with IFort 16.0. The final solution in X(:) has the correct values, as well.

simon932
Beginner
92 Views

Any updates on this issue? I'm having a similar issue with Pardiso (with either the DSS interface or the standard Pardiso interface) where I will get an access violation error when executing numerical factorization (calling dss_factor_real in the dss interface). What is odd is that the exact same code will run just fine on another machine and yield the expected result. A hint to where the problem comes from or a workaround would be very nice.

mecej4
Black Belt
92 Views

The bug is still present in Parallel Studio 16 Update 1 with MKL 11.3.1 (64-bit, Windows and Linux).

Gennady_F_Intel
Moderator
92 Views

yes, the fix will be available into the nearest update 2 of MKL 11.3

92 Views

HI All,

Default dss parameter mean symmetric matrix for which only upper triangular need to be set. After changing full portrait of matrix on upper triangular in presented reproducer it passed correctly

Thanks,

Alex 

Reply