Edit: I found the proper

OP1 · ‎05-19-2016

The code below behaves differently when built with the /Qmkl:parallel and /Qmkl:cluster. In both cases the code is built for Win 7 64-bit, using the latest Intel compiler and libraries. It is launched as a mpi process with mpiexec.exe -n 2 (that is, using only two ranks) on a dual 6-core workstation.

When /Qmkl:parallel is used, the call to the MKL functions on rank 0 do take advantage of the 6 OpenMP threads there.
When /Qmkl:cluster is used, only one thread on rank 0 is being used and therefore it is six times slower.

Any idea on how to have threaded behavior of /Qmkl:cluster?

Also, why is LWORK double in the /Qmkl:parallel case??

PROGRAM MAIN
USE OMP_LIB
USE MPI
IMPLICIT NONE

INTEGER(KIND=4)             :: N,ALLOC_ERROR,INFO,LWORK,SEED_SIZE,I,IERR
INTEGER(KIND=8)             :: CLOCK_START,CLOCK_STOP,CLOCK_RATE,CLOCK_MAX
INTEGER(KIND=4),ALLOCATABLE :: SEEDS(:)
LOGICAL                     :: MPI_IS_INITIALIZED
REAL(KIND=8)                :: W(1)
REAL(KIND=8),ALLOCATABLE    :: A(:,:),TAU(:),WORK(:)

CALL MPI_INITIALIZED(MPI_IS_INITIALIZED,IERR)
IF (.NOT.MPI_IS_INITIALIZED) THEN
    CALL MPI_INIT(IERR)
END IF

WRITE(*,*) 'I am image ',THIS_IMAGE(),' and I can span ',OMP_GET_MAX_THREADS(),' OpenMP threads.'

IF (THIS_IMAGE()==1) THEN

    N = 3000
    WRITE(*,*) 'N     = ',N
    ALLOCATE(A(N,N),STAT=ALLOC_ERROR)
    IF (ALLOC_ERROR/=0) THEN
        ERROR STOP
    END IF
    CALL RANDOM_SEED(SIZE=SEED_SIZE)
    ALLOCATE(SEEDS(SEED_SIZE))
    SEEDS=123456
    CALL RANDOM_SEED(PUT=SEEDS)
    CALL RANDOM_NUMBER(A)

    ALLOCATE(TAU(N),STAT=ALLOC_ERROR)
    LWORK = -1
    CALL DGEQRF(N,N,A,N,TAU,W,LWORK,INFO)
    WRITE(*,*) 'LWORK = ',W(1)
    LWORK = INT(W(1))
    ALLOCATE(WORK(LWORK),STAT=ALLOC_ERROR)

    CALL SYSTEM_CLOCK(CLOCK_START,CLOCK_RATE,CLOCK_MAX)
    CALL DGEQRF(N,N,A,N,TAU,WORK,LWORK,INFO)
    CALL DORGQR(N,N,N,A,N,TAU,WORK,LWORK,INFO)
    CALL SYSTEM_CLOCK(CLOCK_STOP,CLOCK_RATE,CLOCK_MAX)
    WRITE(*,*) 'INFO  = ',INFO
    WRITE(*,*) 'TIME  = ',(REAL(CLOCK_STOP-CLOCK_START,KIND=8))/REAL(CLOCK_RATE,KIND=8)

END IF

END PROGRAM MAIN

Here is the output when using /Qmkl:cluster

I am image  2  and I can span  6  OpenMP threads.
I am image  1  and I can span  6  OpenMP threads.
N      =  3000
LWORK  =  288096
INFO   =  0
TIME   =  6.38000000000000
A(N,N) =  -2.110006751937421E-002

Here is the output when using /Qmkl:parallel

I am image  2  and I can span  6  OpenMP threads.
I am image  1  and I can span  6  OpenMP threads.
N      =  3000
LWORK  =  742977
INFO   =  0
TIME   =  0.920000000000000
A(N,N) =  -2.110006751937324E-002

Here is the build log (when using /Qmkl:cluster)

Compiling with Intel(R) Visual Fortran Compiler 17.0 [Intel(R) 64]...
ifort /nologo /O2 /I"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.0.048\windows\mpi\intel64\include" /Qopenmp /standard-semantics /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc120.pdb" /libs:dll /threads /Qmkl:cluster /c /Qcoarray:single /Qlocation,link,"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\\bin\amd64" /Qm64 "D:\TEMP\QR_PERFORMANCE\MAIN.F90"
Linking...
Link /OUT:"x64\Release\QR_PERFORMANCE.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.0.048\windows\mpi\intel64\lib\release_mt" /MANIFEST /MANIFESTFILE:"x64\Release\QR_PERFORMANCE.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /IMPLIB:"D:\TEMP\QR_PERFORMANCE\x64\Release\QR_PERFORMANCE.lib" impi.lib -qm64 /qoffload-ldopts="-mkl=cluster" "x64\Release\MAIN.obj"
Embedding manifest...
mt.exe /nologo /outputresource:"D:\TEMP\QR_PERFORMANCE\x64\Release\QR_PERFORMANCE.exe;#1" /manifest "x64\Release\QR_PERFORMANCE.exe.intermediate.manifest"

QR_PERFORMANCE - 0 error(s), 0 warning(s)

OP1 · ‎05-19-2016

Edit: I found the proper threading libraries to solve the problem:

mkl_scalapack_lp64_dll.lib mkl_intel_lp64_dll.lib mkl_core_dll.lib mkl_intel_thread_dll.lib mkl_blacs_lp64_dll.lib impi.lib

Poor (non-threaded) performance of /Qmkl:cluster compared to /Qmkl:parallel