Solved: Thank you all.

Hong-Hyun_P_ · ‎08-31-2015

Hi,
I see performance differences between AO and CAO models in calling MKL zgemm(or dgemm) routines. In my tests, AO is working well as expected but CAO shows poor performances compared to AO. For example, an AO routine calling zgemm with the matrix size of 10k takes 11.5 seconds, but a CAO routine calling the same zgemm within "#pragma offload target(mic)" takes 26.3 seconds. Also, I could see a difference in the MIC usage between the two models as can be seen in the attached capture where 2 MPI processors are running with an MIC card each. What can possibly make this difference? Could you please give me any advice?

Thanks,
Hong

## CAO code for zgemm ###

void cao_mkl_zgemm(int micid, char *transa, char *transb, int M, int N, int K, MKL_Complex16 *alpha, MKL_Complex16 *A, int lda, MKL_Complex16 *B, int ldb, MKL_Complex16 *beta, MKL_Complex16 *C, int ldc)
{
  #pragma offload target(mic: micid) \
  in(transa, transb, M, N, K, lda, ldb, ldc) \
  in(alpha[0:1] : into (amic[0:1]) align(64)) \
  in(beta[0:1] : into (bmic[0:1]) align(64)) \
  in(A[0:(M*K)] : into (Amic[0:(M*K)]) free_if(0) align(64)) \
  in(B[0:(K*N)] : into (Bmic[0:(K*N)]) free_if(0) align(64)) \
  in(C[0:(M*N)] : into (Cmic[0:(M*N)]) free_if(0) align(64)) \
  out(Cmic[0:(M*N)] : into (C[0:(M*N)]) align(64))
  {
    zgemm_(transa, transb, &M, &N, &K, amic, Amic, &lda, Bmic, &ldb, bmic, Cmic, &ldc);
  }
}

Roman_D_Intel1 · ‎09-01-2015

Let me answer your questions in reverse order...

Hong-Hyun P. wrote:
2. As far as I know about CAO, we should not let more than 1 CPU access to an MIC card at the same time because of some performance penalties. Is there any workaround for this limitation? I wonder if CAO has any control function like mkl_mic_set_resource_limit() of AO.

I am not aware of any such function. The other (but less flexible) thing that you can do is partition MICs statically by setting MIC-side KMP_AFFINITY. You would have to know how many ranks you have on a node, how many cores your MIC has, and then divide MIC cores among the ranks. See the ao_pin.sh script here: https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor for example.

Hong-Hyun P. wrote:
1. As far as I know, AO is not available for small matrices ( < 2048) because of some data transfer overhead issues. Is there any way to control the switching criteria?

They are not yet documented :) But the article referenced above mentions them. Please try MKL_MIC_THRESHOLDS_DGEMM=M,N,K to instruct MKL to offload DGEMM when all sizes are greater than the specified values. For best effect, you need to use mkl_mic_set_resource_limit() or do the static partitioning as mentioned above.

View solution in original post

Roman_D_Intel1 · ‎08-31-2015

Hi Hong, the main differences are that MKL AO uses host CPU in parallel with offload. Also, it uses double-buffering to overlap computations and communications -- this keeps the MIC side busy all the time. I think both are possible to implement with #pragma offload using OpenMP and async offload via the 'signal' clause.

Hong-Hyun_P_ · ‎08-31-2015

Thanks Roman for the helpful comments. I'll try the openMP approach first then.
Actually, my program involves so many zgemm operations and is parallelized by MPI. The problem is that the number of MPI processors is usually larger than the number of MIC cards and the size of matrices is usually smaller than the AO criteria (<2048). So, could I ask you two more questions?
1. As far as I know, AO is not available for small matrices ( < 2048) because of some data transfer overhead issues. Is there any way to control the switching criteria?
2. As far as I know about CAO, we should not let more than 1 CPU access to an MIC card at the same time because of some performance penalties. Is there any workaround for this limitation? I wonder if CAO has any control function like mkl_mic_set_resource_limit() of AO.

Frances_R_Intel · ‎09-01-2015

I noticed you are using 3 times the memory for your CAO version as for your AO version. (I am also curious as to why you chose to declare alpha and beta to be arrays of size 1, but I don't think that is significant.)

Without the rest of your test code, I am unclear as to how the CAO code is running on both cards at the same time, since the host will wait until the first offload is complete before it continues. Also, I am not sure if your call to the AO version is breaking the work down between the two coprocessor or if you are calling zgemm twice as you do for your CAO version. If the AO version is dividing the work between the two coprocessors, that might be part of the explanation for the difference in memory usage and a big part of the performance difference.

To limit the number of things that could be affecting your results, you might want to use just one coprocessor for now. I suspect the AO version will still be faster. MKL does not just call offload when it does the AO version. It can break the work down between coprocessors and/or choose to do some parts of the work on the host and/or change data alignment when it uploads. Unfortunately, I don't know what optimizations it is making in this case. There are, sadly, so many things I just don't know (yet.)

Roman_D_Intel1 · ‎09-01-2015

Let me answer your questions in reverse order...

Hong-Hyun P. wrote:
2. As far as I know about CAO, we should not let more than 1 CPU access to an MIC card at the same time because of some performance penalties. Is there any workaround for this limitation? I wonder if CAO has any control function like mkl_mic_set_resource_limit() of AO.

I am not aware of any such function. The other (but less flexible) thing that you can do is partition MICs statically by setting MIC-side KMP_AFFINITY. You would have to know how many ranks you have on a node, how many cores your MIC has, and then divide MIC cores among the ranks. See the ao_pin.sh script here: https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor for example.

Hong-Hyun P. wrote:
1. As far as I know, AO is not available for small matrices ( < 2048) because of some data transfer overhead issues. Is there any way to control the switching criteria?

They are not yet documented :) But the article referenced above mentions them. Please try MKL_MIC_THRESHOLDS_DGEMM=M,N,K to instruct MKL to offload DGEMM when all sizes are greater than the specified values. For best effect, you need to use mkl_mic_set_resource_limit() or do the static partitioning as mentioned above.

Rajiv_D_Intel · ‎09-01-2015

The pragma can be improved in several ways:

1. It is better to align the data on the CPU to a 64-byte boundary rather than using the align(64) modifier in the pragma. Doing so will improve data transfer performance.

2. The variables alpha and beta are being treated as one-element arrays. Instead they should be treated as scalar variables. As written there is a large overhead of creating and destroying a buffer for each of them on MIC with every invocation of the pragma.

3. The variables A, B, C are allocated each time the pragma is executed. This is also a bad idea. They should be allocated once and reused as often as necessary.

Regarding doing multiple offloads to the same MIC concurrently. The 16.0 compiler has support for this through the concept of "streams". Besides the ordering of operations within a stream, this feature allows you to conveniently subdivide the MIC and run concurrent offloads in each partition. At present only a single CPU process should offload to a MIC but in future we are considering allowing multiple CPU processes to each use an independent "slice" of a MIC, that is, a set of disjoint threads.

Hong-Hyun_P_ · ‎09-01-2015

Thank you all. Regarding the AO for zgemm, 'MKL_MIC_THRESHOLDS_ZGEMM=M,N,K' works. I have 16 CPUs and 4 MICs per node, and I got the following results (solve time for a single zgemm operation) for M=N=K=2000 case.
- 4 MPI processors (4 threads per CPU) : w/ AO (0.43 sec.), w/o AO (0.43 sec.)
- 8 MPI processors (2 threads per CPU) : w/ AO (0.54 sec.), w/o AO (0.86 sec.)
- 16 MPI processors (1 thread per CPU) : w/ AO (0.81 sec.), w/o AO (1.69 sec.)
I'm glad with this result but will need further optimization using CAO. As Frances and Rajiv mentioned, I need to consider memory and data transfer issues carefully.

Regarding the use of array copies in my sample code, I define static variables to deal with non-bitwise copyable data type,
__attribute__((target(mic))) MKL_Complex16 *Amic; // define static variable for MIC
Amic = (MKL_Complex16*)_mm_malloc(size * sizeof(MKL_Complex16), 64); // allocate enough size of memory in MIC
, and then, I call the cao_mkl_zgemm() code shown above. I have read some article(https://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features) which uses 'struct' but it's not clear to me why the method is better than my approach. Is there better way to copy MKL_Complex16(or complex<double>) data?

Frances_R_Intel · ‎09-15-2015

MKL_Complex16 variables are bitwise copyable. You could dereference the alpha and beta pointers and pass those values into the offload region. In the example in https://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features, one of the members in the structure is a pointer while in MKL_Complex16, both members are scalars. That is the difference between the two cases.

performance difference between AO and CAO