Linpack: Why do we run with LDA greater than the problem size N?

Adrian_C_ · ‎03-07-2017

I am new to running Linpack. From the extended help, I see:

The leading dimension must be no less than the number of equations. Experience has shown that the best performance for a given problem size is obtained when the leading dimension is set to the nearest odd multiple of 8 (16 for Intel(R) Itanium(R) 2 processors) equal to or larger than the number of equations (divisible by 8 but not by 16, or divisible by 16 but not 32 for Intel(R) Itanium(R) 2 processors).

Can someone explain to me why this is the case? Thanks!

mecej4 · ‎03-08-2017

On modern CPUs the performance is quite sensitive to correct cache usage. There are, in fact, separate instructions for aligned and unaligned memory acceses. If data structures and algorithmic patterns cause many unaligned accesses, performance will drop.

The penalty for unaligned access was particularly high on the Itanium. You are probably reading old documentation if it devotes much space to Itanium, so the prescriptions given in that may not apply to current processors.

McCalpinJohn · ‎03-09-2017

The execution time of optimized versions of the LINPACK benchmark is dominated by execution of the DGEMM routine. In its standard form, the DGEMM routine has to transpose one of its input arrays. With the usual dense array storage format, this transposition requires that you gather array elements that are separated by a stride of LDA (times sizeof(double)). If LDA is very close to an integral multiple of a power of 2, the elements being gathered will all be mapped to the same "congruence class" in the L1 cache (and often also in the L2 cache). This limits the number of entries that can be held to the associativity of the cache (typically 8), which in turn makes it very difficult to get keep the desired data in the cache. If LDA is made slightly larger, then the elements that are gathered will map to different "congruence classes" in the L1 and L2 caches, so the cache is much more effective as a buffer for holding the data for the array that is accessed in the transposed order.

There are other performance problems with accessing elements that are separated by multiples of powers of 2, but cache associativity conflicts are probably the most important.

Adrian_C_ · ‎03-09-2017

Thank you!

McCalpin, John wrote:

The execution time of optimized versions of the LINPACK benchmark is dominated by execution of the DGEMM routine. In its standard form, the DGEMM routine has to transpose one of its input arrays. With the usual dense array storage format, this transposition requires that you gather array elements that are separated by a stride of LDA (times sizeof(double)). If LDA is very close to an integral multiple of a power of 2, the elements being gathered will all be mapped to the same "congruence class" in the L1 cache (and often also in the L2 cache). This limits the number of entries that can be held to the associativity of the cache (typically 8), which in turn makes it very difficult to get keep the desired data in the cache. If LDA is made slightly larger, then the elements that are gathered will map to different "congruence classes" in the L1 and L2 caches, so the cache is much more effective as a buffer for holding the data for the array that is accessed in the transposed order.

There are other performance problems with accessing elements that are separated by multiples of powers of 2, but cache associativity conflicts are probably the most important.

Adrian_C_ · ‎03-09-2017

Thank you!

mecej4 wrote:

On modern CPUs the performance is quite sensitive to correct cache usage. There are, in fact, separate instructions for aligned and unaligned memory acceses. If data structures and algorithmic patterns cause many unaligned accesses, performance will drop.

The penalty for unaligned access was particularly high on the Itanium. You are probably reading old documentation if it devotes much space to Itanium, so the prescriptions given in that may not apply to current processors.