Error mkl_avx2.2.dll when calculating large array

nvh10 · ‎08-21-2022

I need to calculate the operation A*P*A' by mkl function mkl_sparse_d_syprd.

In the example below, A is the identity matrix and P is the matrix with all elements are one. In my code, for the small array with nstate=100 or 1000, the code can run well. However if the values of nstate=50000 or larger, the following error appears: "Exception thrown at 0x00007FFDFB353C57 (mkl_avx2.2.dll) in Console22.exe: 0xC0000005: Access violation reading location 0x0000025986EEA108."

Please help me solve this problem. Thank you very much!

    ! A=Identity matrix
    !P=matrix with all elements are one 
    
    program test_spblas
    use mkl_spblas
    implicit none
    integer, parameter :: nstate = 50000
    double precision,allocatable, dimension (:,:):: P,APAT
    integer, allocatable, dimension (:):: c_A,pB_A,pE_A
    double precision, allocatable, dimension (:):: v_A
    integer stat,nnz_A,i
    type(sparse_matrix_t) :: A_s
    nnz_A=nstate
    allocate(v_A(nnz_A),c_A(nnz_A),pB_A(nstate),pE_A(nstate))
    allocate(P(nstate,nstate),APAT(nstate,nstate))
    do i=1,nstate
        pB_A(i)=i
        pE_A(i)=i+1
        c_A(i)=i
    enddo
    v_A=1d0
    P=1d0
    APAT=0d0
    stat = mkl_sparse_d_create_csr(a_s,sparse_index_base_one,nstate,nstate,pb_a,pe_a,c_a,v_a)
    stat = mkl_sparse_d_syprd (sparse_operation_non_transpose,a_s, p, sparse_layout_column_major, nstate, 1d0, 0d0, apat,spaRSE_LAYOUT_COLUMN_MAJOR, nstate)
    end program test_spblas

VidyalathaB_Intel · ‎08-22-2022

Hi,

Thanks for reaching out to us.

Could you please try running the code from Intel oneAPI command prompt and see if it is working there?

Please do let us know the MKL version with which you are working.

Regards,

Vidya.

nvh10 · ‎08-22-2022

This is my mkl version:

"Intel(R) oneAPI Math Kernel Library Version 2022.1-Product Build 20220311 for Intel(R) 64 architecture applications"

nvh10 · ‎08-22-2022

This is what I got from oneAPI command prompt :

ifort test_spblas.f90 /Qiopenmp /Qopenmp-targets:spir64 /module:"D:\Fortran\oneAPI\mkl\2021.2.0\include\intel64\ilp64" /DMKL_ILP64 /4I8 -I"D:\Fortran\oneAPI\mkl\2021.2.0\include" /MD /fpp

ifort: command line warning #10148: option '/Qiopenmp' not supported
ifort: command line warning #10148: option '/Qopenmp-targets:spir64' not supported
test.f90(30): error #6633: The type of the actual argument differs from the type of the dummy argument. [PB_A]
stat = mkl_sparse_d_create_csr(a_s,sparse_index_base_one,nstate,nstate,pb_a,pe_a,c_a,v_a)
---------------------------------------------------------------------------^
test.f90(30): error #6633: The type of the actual argument differs from the type of the dummy argument. [PE_A]
stat = mkl_sparse_d_create_csr(a_s,sparse_index_base_one,nstate,nstate,pb_a,pe_a,c_a,v_a)
--------------------------------------------------------------------------------^
test.f90(30): error #6633: The type of the actual argument differs from the type of the dummy argument. [C_A]
stat = mkl_sparse_d_create_csr(a_s,sparse_index_base_one,nstate,nstate,pb_a,pe_a,c_a,v_a)
-------------------------------------------------------------------------------------^

nvh10 · ‎08-22-2022

This is my last try. I used this command:

ifort /DMKL_DIRECT_CALL /fpp test.f90 mkl_intel_lp64.lib mkl_core.lib mkl_intel_thread.lib /Qopenmp -I"D:\Fortran\oneAPI\mkl\2021.2.0\include"/include

For nstate=10000. It's OK.

But for nstate=50000. It's said:

forrtl: severe (157): Program Exception - access violation
Image              PC                Routine            Line        Source
test.exe           00007FF7361CC128  Unknown               Unknown  Unknown
test.exe           00007FF736169AE2  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFAE04B65D3  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFAE0409877  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFAE040B54C  Unknown               Unknown  Unknown
libiomp5md.dll     00007FFAE03C4CE1  Unknown               Unknown  Unknown
test.exe           00007FF736169477  Unknown               Unknown  Unknown
test.exe           00007FF736168012  Unknown               Unknown  Unknown
test.exe           00007FF736151818  Unknown               Unknown  Unknown
test.exe           00007FF7361D21BE  Unknown               Unknown  Unknown
test.exe           00007FF7361D2584  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFB50CB7034  Unknown               Unknown  Unknown
ntdll.dll          00007FFB51C62651  Unknown               Unknown  Unknown

Spencer_P_Intel · ‎02-17-2023

Hi nvh10,

It looks like there are a few things going on here, but the main one that is causing the overflow for nstate=50000 and not for nstate=10000 is the use of lp64. This means that pointers are 64 bit addresses, but integers are only 32 bit integers. Additionally, the internal implementation is actually in C language, so any 2D fortran pointers are actually collapsed and treated like 1D C arrays (you can see this in the module file with regards to the pointer DIMENSION(*)) . This normally works out, but may be to our disadvantage here in this case when dealing with integers and offsets...

It turns out that the range of int32 is [-2147483648, 2147483647] and the smallest int32 N such that N*N is in this range is N=46340. So for 50000, if we are doing something like C[ row * ldc + col] and row, ldc and col are 32 bit integers, then it is possible (likely) they could overflow and end up negative, then be upcast (still negative) in some way in the address offset computation which results in a seg fault. There are things we can do internally to make sure these addresses are computed using 64 bit integers which are compatible with the 64 bit addresses, and we will do this more carefully in the product, but otherwise, we need to be careful about this.

We are still looking into some other aspects of the ilp64 solution where it appears that the ldb and ldc are sometimes incorrect when they get to our internal kernels. Will update on that once more is understood.

Hope this helps a bit so far

Spencer

Spencer_P_Intel · ‎02-22-2023

Ok, here is the rest of the details. It turns out that there was an additional issue in the mkl_spblas.f90 module file for mkl_sparse_x_syprd which prevented the ilp64 version from working properly. We use ISO_C_BINDING 's to map this from the Fortran API you are calling to a C function implemented internally. In the case of mkl_sparse_x_syprd, there are two input arguments: ldb and ldc which where being incorrectly mapped.

If you change

INTEGER(C_INT) , INTENT(IN) :: ldb
and
INTEGER(C_INT) , INTENT(IN) :: ldc

to

INTEGER, INTENT(IN) :: ldb
and
INTEGER, INTENT(IN) :: ldc

then everything will work as desired. The C_INT kind always maps to a 4 byte integer, but ldb and ldc should be a 4 or 8 byte integer depending on use of the compiler option ( -i8 on linux/mac ) or ( /4I8 on Windows) to make integer 8 bytes in Fortran.

These are changes you can make yourself to the module file if this is necessary for another project immediately and will be fixed in the next oneMKL release (likely oneMKL 2023.1). Thank you for sharing this issue, so we could fix it

VidyalathaB_Intel · ‎08-29-2022

Hi,

Thanks for sharing the details.

The issue is reproducible from our end as well.

We are working on this issue, we will get back to you soon.

Regards,

Vidya.

nvh10 · ‎09-02-2022

Thank you very much for helping me!

Ruqiu_C_Intel · ‎02-28-2023

Thank you again to raise the issue. The fixed will be available in oneMKL next release.