In my computation, I manually offload some computation to MIC using offload pragmas. Offloaded computation also involves a call to MKL's Double precision general matrix-matrix multiplication (dgemm). Work between host CPU and MIC is divided based on performance model. Performance model rely on DGEMM performance ( in Gigaflops/sec), which is recorded by running a microbenchmark for various operand sizes (m,n and k) (done offline) .
Before the actual computation is started, I run a warm up dgemm call on largest operand sizes I will encounter in our computation ( which in my case is n=m~10000 and k~200). Even after the warm up call, I observe that for some dgemm computation still performance is unexpectedly low.
k0 =2, m 2405 n 903 ,k 192, flop rate 67.2766
k0 =2, m 2405 n 903 ,k 192, flop rate 440.115
k0 =17, m 2422 n 1066 ,k 192, flop rate 67.5244
k0 =17, m 2422 n 1066 ,k 192, flop rate 599.45
k0 =346, m 2812 n 1280 ,k 2, flop rate 1.49697
k0 =346, m 2812 n 1280 ,k 2, flop rate 15.2189
Above are some anomalous performance observed. m,n,k are dimensions of dgemm call. ( k0 is iteration number (irrelevant for present discussion)). Note that I run each of them twice, and the second time the measured flop rate corroborate nicely with estimated value. However, in real computation, I may not have an option to do dgemm twice.
I am trying to understand what might cause such behaviour. Can such performance anomaly be mitigated by warming up dgemm for different sizes? If so, what sizes should I ran for warming up dgemm? What is minimum number of call that is required? (I'm presently trying trial and error, assuming that performance anomaly can be mitigated by performing a series of warm up of suitable sizes.)
( Computation is iterative in nature; thus a large number of offloads are performed. And if I incorrectly estimate of time taken by computation on MIC, this may cause a load imbalance between host CPU and MIC, that may have a cascade effect on subsequent iterations due to nature of computation )
Small values of k definitely will limit performance of MIC DGEMM. In a relatively naive implementation, the k value would limit the number of threads. Even though the current MIC DGEMM apparently has means to use a number of threads exceeding the value of k, it doesn't seem to be as effective as it is when k is several times the number of threads.
The recommended Automatic Offoad scheme is supposed to keep the DGEMM on host when m, n, or k aren't sufficiently large to overcome the overhead of offloading.
We have observed a warmup effect in MIC native operation as well. It seemed to be associated with serialization of memory allocation.
I understand if I get 16 GF/s for k=2 (as it is a memory bandwidth bound computation and you might utilize only 1/4 of simd ) but not 1.5 GF/s. Coming back to my original question, Given I'd encounter many dgemms of sizes 0<m,n<10000 and 0<k<200, what can I do to prevent such anomalous performance.