I have to write a program to process a large amount of data. Since most of the processing involves matrix and vector operations I wanted to use MKL to take advantage of the optimized library. I created a toy example in C++ and OpenMP that runs relatively fast on my desktop computer with an Intel core i7, 8 threads (it takes about 10 minutes to do all the computations using MKL, specifically, function cblas_dgemv()).
I was given access to a Linux node with a 12 cores Intel Xeon processor and 2 Intel Xeon Phi coprocessors (MICs) with 61 cores each to run the program once it is ready. When I moved this program to the Xeon node, it took about an hour to complete while I expected that the increased processing power would make quick work of the problem instead.
I think I read somewhere that MKL would transparently use the MICs when available. Is this true? Am I missing a compiler directive to make the compiler generate code or some other set up to run MKL functions in the MICs? What could be making my code run significantly slower, even if it is not using the MICs?
Finally, if it is not as transparent for MKL to run in the MICs what do I have to do to make MKL computations use them?
All help is appreciated. Thanks.
Let's break down your question one by one.
1. Intel core i7, 8 threads 10 minutes vs. 12 cores Intel Xeon processor 1 hours
You mentioned: "toy example in C++ and OpenMP that runs relatively fast on my desktop computer with an Intel core i7, 8 threads (it takes about 10 minutes to do all the computations using MKL, specifically, function cblas_dgemv()).
Could you please tell more details about your example, like MKL version, compiler. problem size, OS etc.
Or Here is one MKL tutorial and code sample in https://software.intel.com/sites/default/files/mkl_c_samples_09072016.tgz. You may run one of them and let us know the result? (please notes, use large workload to make sure utilize the power of Xeon processor ).
2. MKL on Core or on MIC.
MKL can support 3 model to work on MIC
1. native (all exe run on MIC)
2. Automatic offload (AO) ( exe in Xeon and part work on Xeon, parts of work on MIC) In this model, MKL automatically detects the presence of Xeon Phi coprocessors based on Intel MIC architecture and automatically offload computation that may benefit from Xeon Phi coprocessors. The only change needed to enable AO is either setting an environment variable or a single function call.
3. Compiler Assisted Offload (CAO): This usage model help you to use Intel Compiler and it’s offload pragmas to offload computations to the coprocessors. Within the offload section, you have to specify the input and output data for the Intel MKL functions to be offloaded. The compiler provided run-time libraries will transfer the functions with their data to the coprocessor to do the computations.
Please click on the below link to find many related articles, videos on how to use MKL on Intel Xeon Phi.
From your discription, "Intel Xeon processor and 2 Intel Xeon Phi coprocessors (MICs) with 61 cores", and question about transparently use to MIC.
I suppose you expect to see the AO model. But as the article mentioned, considering the functionality of Coprocessors, ( for highly computing-intensive workload) , the function like BLAS level 2 dgemv is not in that list.
The example you build may only run on Xeon and not on MIC. and dgemv is blas level 2 function with O(n^2), I may recommend you try dgemm at least for performance test.