Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

Warming up strategy for MIC dgemm call

piyush_s_
Beginner
622 Views

In my computation, I manually offload some computation to MIC using offload pragmas.  Offloaded computation also involves a call to MKL's Double precision general matrix-matrix multiplication (dgemm). Work between host CPU and MIC is divided based on performance model. Performance model rely on DGEMM performance ( in Gigaflops/sec), which  is  recorded by running a microbenchmark for various operand sizes (m,n and k) (done offline) .  

Before the actual computation is started, I run a warm up dgemm call on largest operand sizes I will encounter in our computation ( which in my case is n=m~10000 and k~200). Even after the warm up call, I observe that for some dgemm computation  still performance is unexpectedly low.

k0 =2, m 2405 n 903 ,k 192, flop rate 67.2766
k0 =2, m 2405 n 903 ,k 192, flop rate 440.115
k0 =17, m 2422 n 1066 ,k 192, flop rate 67.5244
k0 =17, m 2422 n 1066 ,k 192, flop rate 599.45
k0 =346, m 2812 n 1280 ,k 2, flop rate 1.49697
k0 =346, m 2812 n 1280 ,k 2, flop rate 15.2189

Above are some anomalous performance observed. m,n,k are dimensions of dgemm call. (  k0 is iteration number (irrelevant for present discussion)). Note that I run each of them twice, and the second time the measured flop rate corroborate nicely with estimated value. However, in real computation, I may not have an option to do dgemm twice.

I am trying to understand what might cause such behaviour. Can such performance anomaly be mitigated by warming up dgemm for different sizes? If so, what sizes should I ran for warming up dgemm? What is minimum number of call that is required? (I'm presently trying trial and error, assuming that performance anomaly can be mitigated  by performing a series of  warm up of suitable sizes.)

( Computation is iterative in nature; thus a large number of offloads are performed. And if I incorrectly estimate of time taken by computation on MIC,  this may cause a load imbalance between host CPU and MIC, that may have a cascade effect on subsequent iterations due to nature of computation )

0 Kudos
2 Replies
TimP
Honored Contributor III
622 Views

Small values of k definitely will limit performance of MIC DGEMM.  In a relatively naive implementation, the k value would limit the number of threads.  Even though the current MIC DGEMM apparently has means to use a number of threads exceeding the value of k, it doesn't seem to be as effective as it is when k is several times the number of threads.

The recommended Automatic Offoad scheme is supposed to keep the DGEMM on host when m, n, or k aren't sufficiently large to overcome the overhead of offloading.

We have observed a warmup effect in MIC native operation as well.  It seemed to be associated with serialization of memory allocation.

0 Kudos
piyush_s_
Beginner
622 Views

I understand if I get 16 GF/s for k=2 (as it is a memory bandwidth bound computation and you might utilize only 1/4 of simd ) but not 1.5 GF/s. Coming back to my original question, Given I'd encounter many dgemms of sizes  0<m,n<10000 and 0<k<200, what can I do to prevent such anomalous performance. 

0 Kudos
Reply