You may have to watch memory,

Vishal1 · ‎07-13-2013

I'm having some trouble getting Automatic Offload to work with the MKL dgetrf & dgetri routines on our server with two Phi cards. dgemm routines in this code work just fine. Here's the build code -

icpc -c -fpic -shared -std=c++11 -O3 -xHost -ip -ipo3 -parallel -funroll-loops -fno-alias -fno-fnalias -fargument-noalias -mkl -I include/ -I ~/Documents/Boost/boost_1_53_0/ src/PRH.cpp -o src/obj/PRH.o

Here's the OFFLOAD_REPORT generated when the code runs -

Reading in data...
Data read in.
Beginning PRH computation...
[MKL] [MIC --] [AO Function]    DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision]    0.12 0.44 0.44
[MKL] [MIC 00] [AO DGEMM CPU Time]    23.427545 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time]    19.126605 seconds
[MKL] [MIC 00] [AO DGEMM CPU->MIC Data]    7158788000 bytes
[MKL] [MIC 00] [AO DGEMM MIC->CPU Data]    22186080000 bytes
[MKL] [MIC 01] [AO DGEMM CPU Time]    23.427545 seconds
[MKL] [MIC 01] [AO DGEMM MIC Time]    19.060497 seconds
[MKL] [MIC 01] [AO DGEMM CPU->MIC Data]    7158788000 bytes
[MKL] [MIC 01] [AO DGEMM MIC->CPU Data]    22186080000 bytes
LnProb = -469708
Current Runtime (s): 2305.02
PRH computation finished.
Average Runtime (s): 2305.02

real    2m58.020s
user    38m16.392s
sys    0m12.115s

Why aren't the dgetrf and dgetri calls being offloaded?

TimP · ‎07-13-2013

?getr? are documented in http://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf

as not being subject to automatic offload. dtrsm looks the most likely of the functions called by dgetf to gain performance by running on MIC. Did you test with explicit offload and find a gain? MIC is supported primarily on 16 and 24 core Xeon servers which have pretty good MKL performance over a wider range of problems.

Vishal1 · ‎07-14-2013

Hi Tim,

The top of Pg 2 in the link you sent me says

"In the current MKL release (11.0), the following Level-3 BLAS functions and LAPACK functions are AO-enabled: ?GEMM, ?SYMM, ?TRMM, and ?TRSM &LU, QR, Cholesky factorizations "

I'm using dgetrf - that's just LU decomposition isn't it (sorry, new to LAPACK)? Doesn't the document suggest that it's one of the functions that should be automatically offloaded?

Vishal1 · ‎07-14-2013

I seem to have run into another problem as well.

The same code written using dsymm seg faults when the version with dgemm runs fine. Is this a known issue? The matrix size is 24,000 X 24,000 and I'm offloading to 2 Phis. Disabling the MICs (using mkl_mic_disable() ) just before the offending dsymm call in the code lets the code run fine.

Vishal1 · ‎07-16-2013

Any idea why the dsygemm gives the segmentation fault on large-ish matrices when dgemm works fine for the same matrices? When I looked at the local terminal for the machine, it had manny lines of report that looked like...

micscif_rma_tc_can_cache 1540 total = 77319, current = 79624 reached max

Is there an offload bug with dsymm?

TimP · ‎07-16-2013

You may have to watch memory, total stack, and thread stack consumption. Your problem size seems large, particularly for the 8GB RAM coprocessors. I guess automatic offload should split up the matrix according to your specification, so you may be able to find a split which doesn't over-run the coprocessors. I've never run on a dual coprocessor system, so I'd say this is beyond my experieince.

Vishal1 · ‎07-17-2013

At the momennt, I'm letting automatic offload (AO) handle the matrix splitting. It seems to do that just fine when doing DGEMM. It's the DSYMM call that is causing problems with the exact same matrix. Shouldn't DYSMM be offloading less (almost half as less) data as DGEMM for this matrix? Does this sound liuke a bug? Is there a bug reporting service thart I should submit this to?

TimP · ‎07-17-2013

MKL issues may be submitted under the premier.intel.com account which is created automatically when you register the compiler. If you didn't register the compiler, you can do so at https://registrationcenter.intel.com.

Automatic Offload not working for dgetrf, dgetri