MKL automatic offload

D__Pan · ‎04-04-2013

I started to play with AO, using an example code dgemm_with_timing.F (attached). With MKL_MIC_ENABLE=1, OFFLOAD_REPORT=2, and matrix size M/N=4000 being large enough for AO, the code should automatically offload and provide the offload info, but I didn't see the report. Isn't OFFLOAD_REPORT=2 supposed to provide the offload profiling report level for any offload, including Intel MKL AO? Or is it possible that the code is not offloaded at all? The timing does not vary much with different MIC_OMP_NUM_THREADS I specified, so it could be. What did I miss?

I compiled with

ifort -mkl -offload-build -offload-copts="-vec-report3" dgemm_with_timing.F

and ran with

setenv MKL_MIC_ENABLE 1
setenv MIC_ENV_PREFIX MIC
setenv OFFLOAD_REPORT 2
setenv HOST_WORKDIVISION 0
setenv MKL_MIC0_WORKDIVISION 100
setenv MIC_OMP_NUM_THREADS 236
a.out

..........

Number of Target devices installed: 1

Intializing matrix data

Making the first run of matrix product using
Intel(R) MKL DGEMM subroutine to get stable
run time measurements

Measuring performance of matrix product using
Intel(R) MKL DGEMM subroutine

== Matrix multiplication using Intel(R) MKL DGEMM ==
== completed at 27.83520 milliseconds ==

It is highly recommended to set parameter LOOP_COUNT
for this example on your computer as 36 to have
total execution time about 1 second for reliability
of measurements

Example completed.

No offload report appear after the run complets. Any idea?

Sumedh_N_Intel · ‎04-04-2013

It seems you are using an older version of the compiler and the libraries. The -offload-build and -offload-copts switches have been deprecated. Also the environment variables have changed. Could you please upgrade your compiler and libraries and, check if you still see a problem?

Please refer to the following documentation for more information:

Fortran Reference Manual: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-lin/index.htm

MKL reference Manual: http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

MKL Environment Variables for Intel MIC architecture: http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm

Zhang_Z_Intel · ‎04-10-2013

OFFLOAD_REPORT is a common setting for Intel MKL and Intel Compilers. It specifies the profiling report level for any offload, including Intel MKL Automatic Offload. But OFFLOAD_REPORT support for AO is only available in very recent MKL versions (starting 11.0.2). Which version of MKL are you using?

D__Pan · ‎04-12-2013

Thank you both, Sumedh and Zhang. I wanted to post a message a couple of days ago, but my message was repeadedly rejected by spam filter. I will try one more time.

Yes I was using Intel Fortran13.0.1.117 and now I am using 13.1.163 per Sumedh's suggestion. But it turned out that the compiler version is NOT the problem -- the matrix size is.

According to the Intel article (http://software.intel.com/en-us/articles/types-of-applications-that-benefit-from-mkl-for-xeon-phi), ?GEMM routines enables AO only when M, N > 2048. So given C(M,N)=alpha*A(M,K)*B(K,N), I set matrix sizes of A (4000,200) and B(200,4000), thinking they were large enough for AO. But I was wrong. I found that K has to be at least 300 for the offloading to be triggered. In addition, even with M/N=1600 and K=300, the offload still occurs, which means the article is quite misleading.

Zhang_Z_Intel · ‎04-15-2013

For ?GEMM routines automatic offload, K has to be at least 256, both M and N have to be at least 2048. You are right, it's our bad not including the requirement for K in the online article. I'll fix it in the next version.

Thanks!

Sumedh_N_Intel · ‎04-15-2013

That explains it. When I had replied to this post, I had made it a point to check if the matices met the size requirements. So I figured it must be an older compiler. Thank you Zhang for answering that.

TimP · ‎04-15-2013

I tried the original non-automatic offload syntax recently, and it was still working. It allows for the case of making the "C" matrix explicitly out only for the case of pure matrix multitplication (beta==0). Of course, when the automatic offload criteria aren't met, it's no surprise when performance is unsatisfactory. The hand-optimized code on MIC can still be effective for K as small as 32 (compared with open source compiled matrix multiplication), but the data shuffling and the relatively high performance of MKL on multi-core host make off-loading of an individual ?gemm case unproductive.