- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm doing performance testing using mkl composer_xe_2013_sp1.2.144 with the icc version 14.0.2 20140120 compiler and randomly generatoed matrices.
when the matrix size gets above 10000, I notice a performance drop of around 100GFlops ( size 9217 peforms wih 211GFlops but 10241 only has 109)
I tried using the align option to try aligning the data to 64 byte boundaries and this did offer a general speed up and reduced the drop to around 50GFlops but the drop is still there.
above 10241 the performance increase but at around 15000 is still below the pre drop level.
Doe anyone have any insight as to why this might be happening?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you please state sufficient information for us to help you with your problem.
Define what your matrix is (sizeof REAL, number of dimensions, is it real or complex)
Define the number of matrices involved
Define the general operations
Describe how you partition the work amongst threads
When you see a precipitous drop, the usual suspects are
Exceeding cache capacity (you see this as you exceed each level)
False sharing evictions
Memory bandwidth exceeded
>>above 10241 the performance increase but at around 15000 is still below the pre drop level
This is suspicious of false sharing issues.
Many of these issues can be resolved with problem tiling (partition large problem into multiple smaller problems).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi jim
In answer to your questions:
Define what your matrix is (sizeof REAL, number of dimensions, is it real or complex)
the matrix is sizeof double, 1 dimensional and real
Define the number of matrices involved
just the one (excluding the permutation vector which is sizeof int, 1 dimensional and real as well)
Define the general operations
the only operation performed is the dgetrf one
Describe how you partition the work amongst threads
I use the following environmental settings
OMP_NUM_THREADS=32
KMP_AFFINITY=compact,granularity=fine
This drop occurs when trying to test dgetrf
- 100% on the host
- auto offloaded
- 100% offloaded onto the phi
but when I use the -mmic option to compile directly for the phi and run using 240 threads, there is no drop performance (see graph attached showing estimated gflops vs matrix size)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page