Solved: SuperLu 5.3 with IntelOneAPI Icl and ifort: upcase linking , openmp problem, header testing problem.

tat0x · ‎11-26-2022

Hello

I am testing the IntelOne Api Fortran 2021.7.0, C compiler and MKL 2022 in Win11 and my new laptop i5-12450H. Not very expert, still learning the parallel programming. About Superlu-5.3.0 from github xiaoyeli, before jumping to superlu-mt, i has questions:

1. The convention in Fortran calls in uppercase, while intel calls in lowercase_, the slu_cnames.h contains several options for several machine compiling. there are ifdef ADD_; ADD__; UPCASE; CRAY. Somehow i cant push the C compiler to use existing H file to use uppercase.

I tried

-DUPCASE and -Dlowercasename_=UPPERCASENAME

not working. I must edit the slu_Cnames.h to just def UPCASE, then it works,

So there's no other way using any switch compiling without editing H file right?

2. There are several compiler now. ICX ICL IFORT IFX.

I think icx is clang mode, icl is the classic one. icx somhbcode1.fehow doesnt know /Qpar

what different ifort vs ifx?

3. I try to built superlu and it fortran wrapper first, so i copy c_fortran_*.c to SRC Superlu

since it using same include and same c files. icl *.c /c with modified slu_Cnames.h that contain upcase calling only. xilib *.obj /out:superlu.lib. this lib then copied to fortran folder.

ifort f77_main.f hbcode1.f superlu.lib /Qmkl:sequential /O3 works

f77_main < ..\EXAMPLE\g20.rua looks good

The openmp seems strange. ifort test_openmp.f hbcode1.f superlu.lib /Qmkl:sequential /Qopenmp /O3 /Heap-arrays:0 /F:200000000 created test_omp.exe

I ran.

set OMP_THREAD_NUMS=2

test_omp.exe < ..\EXAMPLE\g20.rua

It seems stalled. It showed before hbcode1, after hbcode1, and just stopped.

/Heap-arrays:0 not helping need stack /F:200000000 to helps running, otherwise stackoverflow.

4. If i used Ninja build, it stopped when building TESTING lib

in superlu root, mkdir build, cd build,

cmake -GNinja -DBLAS_VENDOR=Intel_lp64 -D"enable_blaslib_DEFAULT=OFF" -D"CMAKE_BUILD_TYPE=RELEASE" -D"BUILD_SHARED_LIBS=OFF" -D"enable_internal_blaslib=FALSE" -D"XSDK_ENABLE_Fortran=TRUE" ..

and run ninja failed in finding unistd.h. I cant find a propert unistd.h replacement.

jimdempseyatthecove · ‎11-30-2022

Did you add the environment variables:

OMP_STACKSIZE=200M

KMP_STACKSIZE=200M

(though the KMP_STACKSIZE might not be needed)

Coding issues:

    !$omp parallel do private(j,values,b,info,factors)
    do j=1,nsys
    !$omp parallel workshare
    values(1:nnz) = A(1:nnz,j)
    b(1:n) = brhs(1:n,j)
    !$omp end parallel workshare
    info = 0

    call c_fortran_dgssv( iopt,n,nnz, nrhs, values, rowind, colptr, &
    & b, ldb, factors, info )

    !$omp parallel workshare
    A(1:nnz,j) = values(1:nnz)
    brhs(1:n,j) = b(1:n)
    !$omp end parallel workshare

1) Line 3 and 12 contain "parallel". This instructs the compiler to generate a nested parallel region. IIF workshare were warranted (it is not), then you would use "!$omp workshare" without the "parallel". I suspect that prior to using the !$omp parallel do ..." you had the "!$omp parallel workshare".

2) Because values and b are private, together with j being (the array slice) index to A and brhs and j being the index of the parallel do, there need not be any workshare construct as the work is already divided.

3) A quirk of (Intel?) Fortran is non-allocated local arrays in PROGRAM procedure are equivalent to SAVE. For other procedures they are stack.

The untested suggested code is as follows:

! A simple OpenMP example to use SuperLU to solve multiple independent linear systems.
! Contributor: Ed D'Azevedo, Oak Ridge National Laboratory
!
program tslu_omp
    implicit none
    integer, parameter :: maxn = 10*1000
    integer, parameter :: maxnz = 100*maxn
    integer, parameter :: nsys = 6 !! 64

    ! convert the arrays using maxn and/or maxz from static/stack to allocatables
    real*8, allocatable :: values(:), b(:)          ! allocate(values(maxnz), b(maxn))
    integer, allocatable :: rowind(:), colptr(:)    ! allocate(rowind(maxnz), colptr(maxn))
    ! integer :: Ai(:, :), Aj(:,  ! Sherry added  ! allocate(Ai(maxnz, nsys), Aj(maxn, nsys))
    integer n, nnz, nrhs, ldb, info, iopt
    integer*8 :: factors, lufactors(nsys)
    real*8, allocatable :: A(:,                   ! allocate(A(maxnz, nsys))
    integer :: luinfo(nsys)
    real*8, allocatable :: brhs(:,:)                ! allocate(brhs(maxn,nsys))
    integer :: i,j
    real*8 :: err, maxerr
    integer :: nthread
    !$ integer, external :: omp_get_num_threads

    ! allocate the large arrays
    allocate(values(maxnz), b(maxn))
    allocate(rowind(maxnz), colptr(maxn))
    !allocate(Ai(maxnz, nsys), Aj(maxn, nsys))
    allocate(A(maxnz, nsys))
    allocate(brhs(maxn,nsys))
    ! --------------
    ! read in matrix
    ! --------------
    print*, 'before hbcode1'
    call hbcode1(n,n,nnz,values,rowind,colptr)
    print*, 'after hbcode1'

    nthread = 1
    !$omp parallel
    !$omp master
    !$ nthread = omp_get_num_threads()
    !$omp end master
    !$omp end parallel
    write(*,*) 'nthreads = ',nthread
    write(*,*) 'nsys = ',nsys
    write(*,*) 'n, nnz ', n, nnz


    !$omp parallel do private(j)
    do j=1,nsys
        A(1:nnz,j) = values(1:nnz)
    enddo

    nrhs = 1
    ldb = n

    !$omp parallel do private(j)
    do j=1,nsys
        brhs(:,j) = j
    enddo

 


    ! ---------------------
    ! perform factorization
    ! ---------------------
    iopt = 1

    !$omp parallel do private(j,values,b,info,factors)
    do j=1,nsys
        values(1:nnz) = A(1:nnz,j)
        b(1:n) = brhs(1:n,j)
        info = 0

        call c_fortran_dgssv( iopt,n,nnz, nrhs, values, rowind, colptr, &
        & b, ldb, factors, info )

        A(1:nnz,j) = values(1:nnz)
        brhs(1:n,j) = b(1:n)
        luinfo(j) = info
        lufactors(j) = factors
    enddo

    do j=1,nsys
        info = luinfo(j)
        if (info.ne.0) then
            write(*,9010) j, info
9010        format(' factorization of j=',i7,' returns info= ',i7)
        endif
    enddo

    ! ---------------------------------------
    ! solve the system using existing factors
    ! ---------------------------------------
    iopt = 2
    !$omp parallel do private(j,b,values,factors,info)
    do j=1,nsys
        factors = lufactors(j)
        values(1:nnz) = A(1:nnz,j)
        info = 0
        b(1:n) = brhs(1:n,j)
        call c_fortran_dgssv( iopt,n,nnz,nrhs,values,rowind,colptr, &
        & b,ldb,factors,info )
        lufactors(j) = factors
        luinfo(j) = info
        brhs(1:n,j) = b(1:n)
    enddo

    ! ------------
    ! simple check
    ! ------------
    err = 0
    maxerr = 0

    do j=2,nsys
        do i=1,n
            err = abs(brhs(i,1)*j - brhs(i,j))
            maxerr = max(maxerr,err)
        enddo
    enddo
    write(*,*) 'max error = ', maxerr

    ! -------------
    ! free storage
    ! -------------

    iopt = 3
    !$omp parallel do private(j)
    do j=1,nsys
        call c_fortran_dgssv(iopt,n,nnz,nrhs,A(:,j),rowind,colptr, &
        & brhs(:,j), ldb, lufactors(j), luinfo(j) )
    enddo

    stop
end program

Note, when pasting code sample click on the "..." icon to open additional icons. Then click on the </> icon to insert marked up code/text, select Fortran Markup, paste in the code (edit if necessary)

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎11-27-2022

Please consult interoperability and BIND(C).

The heap arrays option is /heap-arrays:0 (not /Heap-arrays:0)

The /F<n> option sets the stack size for the main thread, and not necessarily that of the spawned threads.

You may need to combine /F20000000 with environment variable OMP_STACKSIZE=200M

*** However, it is not advisable to have 200MB stack size per thread. Either use heap arrays or, preferably, explicitly allocate the (very) large arrays.

While running with 2 threads may not present a problem now, consider running with 32, 64, 128, ... threads later.

Jim Dempsey

Steve_Lionel · ‎11-27-2022

Without BIND(C), the Fortran standard makes no representations whatsoever about the case or decoration of global symbols. Conventions for these vary across compilers and platforms. Intel Fortran on Windows upcases names, on Linux and Mac downcases. Regularizing this was one of the major goals of the C interoperability features added in Fortran 2003 - with BIND(C), the name is downcased (assuming NAME= not used) and the decoration is the same as the "companion C processor".

Please do NOT use command line switches to change these behaviors - use the standard language features designed for that purpose.

tat0x · ‎11-28-2022

Thank you

Jack and Steve

I will try using BIND(C) later in fortran subroutine.

Jack, i was tried using modified h, and i successfully run. however for the omp program test is stalled in threads > 3.

The gcc was ran successfully in 16 threads. The OMP_STACKSIZE=200MB helps little bit, but stuck in 4 threads,

before hbcode1
after hbcode1
nthreads = 8
nsys = 6
n, nnz 400 1920

... stalled.

/heap-arrays:0 not helped. Must be a larger stacksize F:20000000 still got stackoverflow. Maybe you can replicate the problem.

jimdempseyatthecove · ‎11-29-2022

>>however for the omp program test is stalled in threads > 3.

This is possibly due to the application consuming more virtual memory than the process can (or is permitted) use in physical memory. In this situation, the application enters paging mode (page-fault, swap some pages out, swap other pages in, continue).

i5-12450H has 12 threads

"new laptop" you do not list which to see how much RAM is available (4GB, 8GB, 16GB)

/heap-arrays:0 not helped... still get stack overflow

I suggest you use /heap-arrays:0 /traceback with OMP_STACKSIZE undefined (iow "deafult stack"). Then run to stack overflow (adjust thread count to force error).

Then in the console window, you should see the source file and line number last executed prior to error. This should give you an idea of where the error is located.

Second thing to consider. Is your code, either intentionally or inadvertently using nested parallelism?

If so, then with 4 first level threads, each calling a procedure entering a nested parallel region, each main level thread creates (or re-uses on 2nd time) a 4 thread team. IOW you program now requires 4*4 (16) threads. And if the cause of this nesting happens to be in a recursive procedure, then next recursion will require 4*4*4 threads, next recursion 4*4*4*4 threads, ...

gcc may default with OMP_NESTED=false, whereas ifort/ifx may default with OMP_NESTED=true.

**** MKL 2022

Note, MKL internally uses OpenMP

Typical usage is:

Single-threaded application is to link with the Multi-Threaded MKL

Multi-threaded application is to link with the Serial (single threded) MKL

Jim Dempsey

jimdempseyatthecove · ‎11-29-2022

BTW (consider this)

When MKL Threaded library is used .AND. MKL_NUM_THREADS is .NOT. specified, then MKL will default to using the number of available hardware threads (12 in your case). Thus you "4-thread" application could be trying to run with 4*12=48 threads.

Jim Dempsey

tat0x · ‎11-29-2022

I set MKL_NUM_THREADS=4 and OMP_NUM_THREADS=4 not working, stalled.

i use ifort test_omp.F hbcode1.f superlu.lib /Qmkl:sequential /Qopenmp /heap-arrays:0 /traceback /F:2000000

Where hbcode 1.f is matrix reader from text, and test_omp.f is superlu test file to calculate. it call c_fortran_dgssv.c for using superlu lib.

forrtl: severe (170): Program Exception - stack overflow
Image PC Routine Line Source
test_omp.exe 00007FF75A7C7A07 Unknown Unknown Unknown
test_omp.exe 00007FF75A70100D MAIN__ 4 test_omp.F
test_omp.exe 00007FF75A7B584E Unknown Unknown Unknown
test_omp.exe 00007FF75A7C7C10 Unknown Unknown Unknown
KERNEL32.DLL 00007FFAD768244D Unknown Unknown Unknown
ntdll.dll 00007FFAD868DFB8 Unknown Unknown Unknown

Debug failed at early program

Looks like stack overflow at calling program ???

Release version 5.3.0 · xiaoyeli/superlu · GitHub

And here is the test_omp.F

! A simple OpenMP example to use SuperLU to solve multiple independent linear systems.
! Contributor: Ed D'Azevedo, Oak Ridge National Laboratory
!
program tslu_omp
implicit none
integer, parameter :: maxn = 10*1000
integer, parameter :: maxnz = 100*maxn
integer, parameter :: nsys = 6 !! 64

real*8 :: values(maxnz), b(maxn)
integer :: rowind(maxnz), colptr(maxn)
! integer :: Ai(maxnz, nsys), Aj(maxn, nsys) ! Sherry added
integer n, nnz, nrhs, ldb, info, iopt
integer*8 :: factors, lufactors(nsys)
real*8 :: A(maxnz, nsys)
integer :: luinfo(nsys)
real*8 :: brhs(maxn,nsys)
integer :: i,j
real*8 :: err, maxerr
integer :: nthread
!$ integer, external :: omp_get_num_threads

! --------------
! read in matrix
! --------------
print*, 'before hbcode1'
call hbcode1(n,n,nnz,values,rowind,colptr)
print*, 'after hbcode1'

nthread = 1
!$omp parallel
!$omp master
!$ nthread = omp_get_num_threads()
!$omp end master
!$omp end parallel
write(*,*) 'nthreads = ',nthread
write(*,*) 'nsys = ',nsys
write(*,*) 'n, nnz ', n, nnz

!$omp parallel do private(j)
do j=1,nsys
A(1:nnz,j) = values(1:nnz)
enddo

nrhs = 1
ldb = n

!$omp parallel do private(j)
do j=1,nsys
brhs(:,j) = j
enddo

! ---------------------
! perform factorization
! ---------------------
iopt = 1

!$omp parallel do private(j,values,b,info,factors)
do j=1,nsys
!$omp parallel workshare
values(1:nnz) = A(1:nnz,j)
b(1:n) = brhs(1:n,j)
!$omp end parallel workshare
info = 0

call c_fortran_dgssv( iopt,n,nnz, nrhs, values, rowind, colptr, &
& b, ldb, factors, info )

!$omp parallel workshare
A(1:nnz,j) = values(1:nnz)
brhs(1:n,j) = b(1:n)
!$omp end parallel workshare
luinfo(j) = info
lufactors(j) = factors
enddo

do j=1,nsys
info = luinfo(j)
if (info.ne.0) then
write(*,9010) j, info
9010 format(' factorization of j=',i7,' returns info= ',i7)
endif
enddo

! ---------------------------------------
! solve the system using existing factors
! ---------------------------------------
iopt = 2
!$omp parallel do private(j,b,values,factors,info)
do j=1,nsys
factors = lufactors(j)
values(1:nnz) = A(1:nnz,j)
info = 0
b(1:n) = brhs(1:n,j)
call c_fortran_dgssv( iopt,n,nnz,nrhs,values,rowind,colptr, &
& b,ldb,factors,info )
lufactors(j) = factors
luinfo(j) = info
brhs(1:n,j) = b(1:n)
enddo

! ------------
! simple check
! ------------
err = 0
maxerr = 0

do j=2,nsys
do i=1,n
err = abs(brhs(i,1)*j - brhs(i,j))
maxerr = max(maxerr,err)
enddo
enddo
write(*,*) 'max error = ', maxerr

! -------------
! free storage
! -------------

iopt = 3
!$omp parallel do private(j)
do j=1,nsys
call c_fortran_dgssv(iopt,n,nnz,nrhs,A(:,j),rowind,colptr, &
& brhs(:,j), ldb, lufactors(j), luinfo(j) )
enddo

stop
end program

jimdempseyatthecove · ‎11-30-2022

Did you add the environment variables:

OMP_STACKSIZE=200M

KMP_STACKSIZE=200M

(though the KMP_STACKSIZE might not be needed)

Coding issues:

    !$omp parallel do private(j,values,b,info,factors)
    do j=1,nsys
    !$omp parallel workshare
    values(1:nnz) = A(1:nnz,j)
    b(1:n) = brhs(1:n,j)
    !$omp end parallel workshare
    info = 0

    call c_fortran_dgssv( iopt,n,nnz, nrhs, values, rowind, colptr, &
    & b, ldb, factors, info )

    !$omp parallel workshare
    A(1:nnz,j) = values(1:nnz)
    brhs(1:n,j) = b(1:n)
    !$omp end parallel workshare

1) Line 3 and 12 contain "parallel". This instructs the compiler to generate a nested parallel region. IIF workshare were warranted (it is not), then you would use "!$omp workshare" without the "parallel". I suspect that prior to using the !$omp parallel do ..." you had the "!$omp parallel workshare".

2) Because values and b are private, together with j being (the array slice) index to A and brhs and j being the index of the parallel do, there need not be any workshare construct as the work is already divided.

3) A quirk of (Intel?) Fortran is non-allocated local arrays in PROGRAM procedure are equivalent to SAVE. For other procedures they are stack.

The untested suggested code is as follows:

! A simple OpenMP example to use SuperLU to solve multiple independent linear systems.
! Contributor: Ed D'Azevedo, Oak Ridge National Laboratory
!
program tslu_omp
    implicit none
    integer, parameter :: maxn = 10*1000
    integer, parameter :: maxnz = 100*maxn
    integer, parameter :: nsys = 6 !! 64

    ! convert the arrays using maxn and/or maxz from static/stack to allocatables
    real*8, allocatable :: values(:), b(:)          ! allocate(values(maxnz), b(maxn))
    integer, allocatable :: rowind(:), colptr(:)    ! allocate(rowind(maxnz), colptr(maxn))
    ! integer :: Ai(:, :), Aj(:,  ! Sherry added  ! allocate(Ai(maxnz, nsys), Aj(maxn, nsys))
    integer n, nnz, nrhs, ldb, info, iopt
    integer*8 :: factors, lufactors(nsys)
    real*8, allocatable :: A(:,                   ! allocate(A(maxnz, nsys))
    integer :: luinfo(nsys)
    real*8, allocatable :: brhs(:,:)                ! allocate(brhs(maxn,nsys))
    integer :: i,j
    real*8 :: err, maxerr
    integer :: nthread
    !$ integer, external :: omp_get_num_threads

    ! allocate the large arrays
    allocate(values(maxnz), b(maxn))
    allocate(rowind(maxnz), colptr(maxn))
    !allocate(Ai(maxnz, nsys), Aj(maxn, nsys))
    allocate(A(maxnz, nsys))
    allocate(brhs(maxn,nsys))
    ! --------------
    ! read in matrix
    ! --------------
    print*, 'before hbcode1'
    call hbcode1(n,n,nnz,values,rowind,colptr)
    print*, 'after hbcode1'

    nthread = 1
    !$omp parallel
    !$omp master
    !$ nthread = omp_get_num_threads()
    !$omp end master
    !$omp end parallel
    write(*,*) 'nthreads = ',nthread
    write(*,*) 'nsys = ',nsys
    write(*,*) 'n, nnz ', n, nnz


    !$omp parallel do private(j)
    do j=1,nsys
        A(1:nnz,j) = values(1:nnz)
    enddo

    nrhs = 1
    ldb = n

    !$omp parallel do private(j)
    do j=1,nsys
        brhs(:,j) = j
    enddo

 


    ! ---------------------
    ! perform factorization
    ! ---------------------
    iopt = 1

    !$omp parallel do private(j,values,b,info,factors)
    do j=1,nsys
        values(1:nnz) = A(1:nnz,j)
        b(1:n) = brhs(1:n,j)
        info = 0

        call c_fortran_dgssv( iopt,n,nnz, nrhs, values, rowind, colptr, &
        & b, ldb, factors, info )

        A(1:nnz,j) = values(1:nnz)
        brhs(1:n,j) = b(1:n)
        luinfo(j) = info
        lufactors(j) = factors
    enddo

    do j=1,nsys
        info = luinfo(j)
        if (info.ne.0) then
            write(*,9010) j, info
9010        format(' factorization of j=',i7,' returns info= ',i7)
        endif
    enddo

    ! ---------------------------------------
    ! solve the system using existing factors
    ! ---------------------------------------
    iopt = 2
    !$omp parallel do private(j,b,values,factors,info)
    do j=1,nsys
        factors = lufactors(j)
        values(1:nnz) = A(1:nnz,j)
        info = 0
        b(1:n) = brhs(1:n,j)
        call c_fortran_dgssv( iopt,n,nnz,nrhs,values,rowind,colptr, &
        & b,ldb,factors,info )
        lufactors(j) = factors
        luinfo(j) = info
        brhs(1:n,j) = b(1:n)
    enddo

    ! ------------
    ! simple check
    ! ------------
    err = 0
    maxerr = 0

    do j=2,nsys
        do i=1,n
            err = abs(brhs(i,1)*j - brhs(i,j))
            maxerr = max(maxerr,err)
        enddo
    enddo
    write(*,*) 'max error = ', maxerr

    ! -------------
    ! free storage
    ! -------------

    iopt = 3
    !$omp parallel do private(j)
    do j=1,nsys
        call c_fortran_dgssv(iopt,n,nnz,nrhs,A(:,j),rowind,colptr, &
        & brhs(:,j), ldb, lufactors(j), luinfo(j) )
    enddo

    stop
end program

Note, when pasting code sample click on the "..." icon to open additional icons. Then click on the </> icon to insert marked up code/text, select Fortran Markup, paste in the code (edit if necessary)

Jim Dempsey

tat0x · ‎12-01-2022

Thanks jim dempsey

it revised code worked. Must be:

1. Compiled by " ifort superlu_OMP.f90 hbcode1.f superlu.lib /Qmkl:sequential /Qopenmp /heap-arrays:0 /F20000000 "

Must defined stack here /F20000000, otherwise still stack overflow even for OMP_NUM_THREADS=1

The allocatable variables in modified code removes the stack overflow problem.

2. OMP_STACKSIZE=200M or KMP_STACKSIZE=200M must be stated.

3. As you stated, "!$omp workshare parallel is the main problem, why it stalled or ended without result.

Note: OMP_NUM_THREADS environment var to push 1-max thread can be given to test.

Ary.

jimdempseyatthecove · ‎11-30-2022

By the way... It is required/presumed that c_fortran_dgssv be thread-safe

Jim Dempsey

tat0x · ‎01-09-2023

Just found out to use slu_Cnames.h that have multi machine

#if (F77_CALL_C == UPCASE)

...

#endif

just go to src create static superlu lib and copy also the c_fortran_dgssv.c and other c_fortran wrapper to SRC

icl *.c /c /QaxCORE-AVX2 -DF77_CALL_C=UPCASE /Ox /Qpar
lib *.obj /OUT:libsuperlu.lib

this eliminates conflict with libucrt that use cmake.

the upcase with intel fortran call c_fortran_dgssv is working.

while waiting perhaps someone create fortran c_Fortran_dgssv.f90 use binding c