Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- Warming up strategy for MIC dgemm call

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

piyush_s_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-25-2014
03:40 AM

43 Views

Warming up strategy for MIC dgemm call

In my computation, I manually offload some computation to MIC using offload pragmas. Offloaded computation also involves a call to MKL's Double precision general matrix-matrix multiplication (dgemm). Work between host CPU and MIC is divided based on performance model. Performance model rely on DGEMM performance ( in Gigaflops/sec), which is recorded by running a microbenchmark for various operand sizes (m,n and k) (done offline) .

Before the actual computation is started, I run a warm up dgemm call on largest operand sizes I will encounter in our computation ( which in my case is n=m~10000 and k~200). Even after the warm up call, I observe that for some dgemm computation still performance is unexpectedly low.

k0 =2, m 2405 n 903 ,k 192, flop rate 67.2766

k0 =2, m 2405 n 903 ,k 192, flop rate 440.115

k0 =17, m 2422 n 1066 ,k 192, flop rate 67.5244

k0 =17, m 2422 n 1066 ,k 192, flop rate 599.45

k0 =346, m 2812 n 1280 ,k 2, flop rate 1.49697

k0 =346, m 2812 n 1280 ,k 2, flop rate 15.2189

Above are some anomalous performance observed. m,n,k are dimensions of dgemm call. ( k0 is iteration number (irrelevant for present discussion)). Note that I run each of them twice, and the second time the measured flop rate corroborate nicely with estimated value. However, in real computation, I may not have an option to do dgemm twice.

I am trying to understand what might cause such behaviour. Can such performance anomaly be mitigated by warming up dgemm for different sizes? If so, what sizes should I ran for warming up dgemm? What is minimum number of call that is required? (I'm presently trying trial and error, assuming that performance anomaly can be mitigated by performing a series of warm up of suitable sizes.)

( Computation is iterative in nature; thus a large number of offloads are performed. And if I incorrectly estimate of time taken by computation on MIC, this may cause a load imbalance between host CPU and MIC, that may have a cascade effect on subsequent iterations due to nature of computation )

Link Copied

2 Replies

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-25-2014
04:15 AM

43 Views

Small values of k definitely will limit performance of MIC DGEMM. In a relatively naive implementation, the k value would limit the number of threads. Even though the current MIC DGEMM apparently has means to use a number of threads exceeding the value of k, it doesn't seem to be as effective as it is when k is several times the number of threads.

The recommended Automatic Offoad scheme is supposed to keep the DGEMM on host when m, n, or k aren't sufficiently large to overcome the overhead of offloading.

We have observed a warmup effect in MIC native operation as well. It seemed to be associated with serialization of memory allocation.

piyush_s_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-25-2014
12:33 PM

43 Views

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.