Small dgemm call poisons OpenMP performance

Peter_B_9 · ‎10-04-2013

I've been hunting down a performance problem in my native Phi MKL application and have discovered a surprising culprit. If I use dgemm to calculate the product of two small matrices, the call does something which causes my next OpenMP loop to be very slow. I know that small dgemm calls are inexplicably slow on Phi, but I am very surprised to discover that they're also making subsequent code slow.

I've attached a test case which demonstrates the problem. It does the following steps:

Invoke an empty OpenMP loop, timing how long it takes for each thread to start (warmup)
Invoke a second empty OpenMP loop now that the threads are started and warmed up
Use dgemm to calculate the product of a 72x8 matrix and an 8x8 matrix
Invoke a third empty OpenMP loop <-- problem shows up here
Invoke a fourth OpenMP loop

The 1st parallel loop is slow (due to warmup), and the 2nd and 4th loops are reasonably fast, completing in less than 70us. However the third loop (the one right after the dgemm call) takes over 1400us. In my real application I do this in a loop, and the extra milliseconds quickly add up.

Suspicious that the problem might be caused by idle threads going to sleep during the dgemm calculation, I ran with KMP_BLOCKTIME=1000000000 OMP_WAIT_POLICY=active, but saw no difference. Curiously, multiplying all of the dimensions by 100 makes the problem go away. Using gdb, I confirmed that it's not terminating and restarting the OpenMP threads, and I'm having trouble imagining what else it could be doing.

Since MKL dgemm performance for small matrices is unacceptable, I plan to replace these calls with my own implementation. However, I would like to know the answers to two questions:

what is dgemm doing to poison the next OpenMP invocation?
what other MKL functions have the same problem, so that I can know to test or avoid them?

(I'm using MKL 2013_sp1.0.080)

TaylorIoTKidd · ‎02-27-2014

Somehow, your post was lost.

One of our OpenMP experts tells me that what could be happening is the team size is being changed, an expensive operation.

Changing the team size has a cost that is amortized over total time spent in the parallel section. A sort section increases the proportion of time spent in background overhead.

Regards
--
Taylor

jimdempseyatthecove · ‎02-27-2014

Note, a multi-threaded program (OpenMP) should call the single threaded MKL library. Each app OpenMP thread having separated thread context in the call to MKL.

When a multi-threaded program (OpenMP) erroneously calls the multi-threaded MKL library. Each app OpenMP thread having separated thread context in the call to MKL, causes MKL to spawn additional thread teams. Thus if your platform has N hardware threads, you could potentially construct N*N number of threads.

Jim Dempsey

TaylorIoTKidd · ‎02-27-2014

OK, here is a little more detail.

The dgemm cost results from an optimization of Intel's OpenMP implementation. Yes, I know that sounds a little strange.

What is happening is that the omp implementation is doing what amounts to "lazy" release of the team. To say it another way, the release of the team needed to execute the dgemm call does not happen at the end of the call, but is deferred to the beginning of the next omp parallel section. This is why you see the impact after you leave the MKL call.

The logic behind this is that a parallel section is often entered many times; an example is when it is embedded in a for/do loop. Since this is a very common use case, it makes sense to defer release of the team to the next section. At that point, if the team size is identical, you save the cost of releasing the old team and allocating a new one.

Regards
--
Taylor

TimP · ‎02-27-2014

Would Taylor's explanation imply that kmp_blocktime should be reduced while working the small matrices? Evidently a small team is needed.

James_C_Intel2 · ‎02-28-2014

I don't believe that KMP_BLOCKTIME is relevant here. The issue here is not related to how long threads wait before going to sleep, but the cost of building and destroying the data structures in the OpenMP runtime that depend on the team size.