Hi jim

Greg_C_ · ‎04-11-2014

Hi,

I'm doing performance testing using mkl composer_xe_2013_sp1.2.144 with the icc version 14.0.2 20140120 compiler and randomly generatoed matrices.

when the matrix size gets above 10000, I notice a performance drop of around 100GFlops ( size 9217 peforms wih 211GFlops but 10241 only has 109)

I tried using the align option to try aligning the data to 64 byte boundaries and this did offer a general speed up and reduced the drop to around 50GFlops but the drop is still there.

above 10241 the performance increase but at around 15000 is still below the pre drop level.

Doe anyone have any insight as to why this might be happening?

jimdempseyatthecove · ‎04-11-2014

Can you please state sufficient information for us to help you with your problem.

Define what your matrix is (sizeof REAL, number of dimensions, is it real or complex)
Define the number of matrices involved
Define the general operations
Describe how you partition the work amongst threads

When you see a precipitous drop, the usual suspects are

Exceeding cache capacity (you see this as you exceed each level)
False sharing evictions
Memory bandwidth exceeded

>>above 10241 the performance increase but at around 15000 is still below the pre drop level

This is suspicious of false sharing issues.

Many of these issues can be resolved with problem tiling (partition large problem into multiple smaller problems).

Jim Dempsey

Greg_C_ · ‎04-14-2014

Hi jim

In answer to your questions:

Define what your matrix is (sizeof REAL, number of dimensions, is it real or complex)

the matrix is sizeof double, 1 dimensional and real

Define the number of matrices involved

just the one (excluding the permutation vector which is sizeof int, 1 dimensional and real as well)

Define the general operations

the only operation performed is the dgetrf one

Describe how you partition the work amongst threads

I use the following environmental settings

OMP_NUM_THREADS=32
KMP_AFFINITY=compact,granularity=fine

This drop occurs when trying to test dgetrf

100% on the host
auto offloaded
100% offloaded onto the phi

but when I use the -mmic option to compile directly for the phi and run using 240 threads, there is no drop performance (see graph attached showing estimated gflops vs matrix size)

dgetrf performance drop above 10000