- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I started to play with AO, using an example code dgemm_with_timing.F (attached). With MKL_MIC_ENABLE=1, OFFLOAD_REPORT=2, and matrix size M/N=4000 being large enough for AO, the code should automatically offload and provide the offload info, but I didn't see the report. Isn't OFFLOAD_REPORT=2 supposed to provide the offload profiling report level for any offload, including Intel MKL AO? Or is it possible that the code is not offloaded at all? The timing does not vary much with different MIC_OMP_NUM_THREADS I specified, so it could be. What did I miss?
I compiled with
ifort -mkl -offload-build -offload-copts="-vec-report3" dgemm_with_timing.F
and ran with
setenv MKL_MIC_ENABLE 1
setenv MIC_ENV_PREFIX MIC
setenv OFFLOAD_REPORT 2
setenv HOST_WORKDIVISION 0
setenv MKL_MIC0_WORKDIVISION 100
setenv MIC_OMP_NUM_THREADS 236
a.out
..........
Number of Target devices installed: 1
Intializing matrix data
Making the first run of matrix product using
Intel(R) MKL DGEMM subroutine to get stable
run time measurements
Measuring performance of matrix product using
Intel(R) MKL DGEMM subroutine
== Matrix multiplication using Intel(R) MKL DGEMM ==
== completed at 27.83520 milliseconds ==
It is highly recommended to set parameter LOOP_COUNT
for this example on your computer as 36 to have
total execution time about 1 second for reliability
of measurements
Example completed.
No offload report appear after the run complets. Any idea?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems you are using an older version of the compiler and the libraries. The -offload-build and -offload-copts switches have been deprecated. Also the environment variables have changed. Could you please upgrade your compiler and libraries and, check if you still see a problem?
Please refer to the following documentation for more information:
Fortran Reference Manual: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-lin/index.htm
MKL reference Manual: http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm
MKL Environment Variables for Intel MIC architecture: http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OFFLOAD_REPORT is a common setting for Intel MKL and Intel Compilers. It specifies the profiling report level for any offload, including Intel MKL Automatic Offload. But OFFLOAD_REPORT support for AO is only available in very recent MKL versions (starting 11.0.2). Which version of MKL are you using?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you both, Sumedh and Zhang. I wanted to post a message a couple of days ago, but my message was repeadedly rejected by spam filter. I will try one more time.
Yes I was using Intel Fortran13.0.1.117 and now I am using 13.1.163 per Sumedh's suggestion. But it turned out that the compiler version is NOT the problem -- the matrix size is.
According to the Intel article (http://software.intel.com/en-us/articles/types-of-applications-that-benefit-from-mkl-for-xeon-phi), ?GEMM routines enables AO only when M, N > 2048. So given C(M,N)=alpha*A(M,K)*B(K,N), I set matrix sizes of A (4000,200) and B(200,4000), thinking they were large enough for AO. But I was wrong. I found that K has to be at least 300 for the offloading to be triggered. In addition, even with M/N=1600 and K=300, the offload still occurs, which means the article is quite misleading.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For ?GEMM routines automatic offload, K has to be at least 256, both M and N have to be at least 2048. You are right, it's our bad not including the requirement for K in the online article. I'll fix it in the next version.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That explains it. When I had replied to this post, I had made it a point to check if the matices met the size requirements. So I figured it must be an older compiler. Thank you Zhang for answering that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried the original non-automatic offload syntax recently, and it was still working. It allows for the case of making the "C" matrix explicitly out only for the case of pure matrix multitplication (beta==0). Of course, when the automatic offload criteria aren't met, it's no surprise when performance is unsatisfactory. The hand-optimized code on MIC can still be effective for K as small as 32 (compared with open source compiled matrix multiplication), but the data shuffling and the relatively high performance of MKL on multi-core host make off-loading of an individual ?gemm case unproductive.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page