Re: Efficient Arrays

Dishaw__Jim · ‎05-31-2006

According to Intel's documentation

With multidimensional arrays where access to array elements will be noncontiguous, avoid leftmost array dimensions that are a power of two (such as 256, 512).

Since the cache sizes are a power of 2, array dimensions that are also a power of 2 may make inefficient use of cache when array access is noncontiguous. If the cache size is an exact multiple of the leftmost dimension, your program will probably make use of the cache less efficient. This does not apply to contiguous sequential access or whole array access.

One work-around is to increase the dimension to allow some unused elements, making the leftmost dimension larger than actually needed. For example, increasing the leftmost dimension of A from 512 to 520 would make better use of cache:

I think I am having a dense moment here, because I cannot see why a power of 2 sized array would make less efficient use of the cache than a non-power of 2 sized array.

My next question is the implementation of array sections by IVF. Does an array section result in a temporary copy somtimes, all the times, or never--or does it depend? Consider the following example:

FORALL(i=1:N) D(:,i) = MATMUL(A(:,:,E(i)), B(:,i)) + C(:,i)

Would A(:,:,E(i)) result in a temporary array being created? How about B(:,i) and C(:,i)? Would it be better to avoid using MATMUL in this case and implement it as loops?

Thanks

Jugoslav_Dujic · ‎05-31-2006

Re power-of-two: I agree this is counter-intuitive, (and I admit that I don't quite get the explanation), but here's one. I'll let the compiler people comment on the second question.

Steven_L_Intel1 · ‎05-31-2006

The pointer Jugoslav gave is a good one.

The answer to the second question is "it depends". The compiler tries to eliminate unnecessary array copies but there are cases it can miss. The best you can do is try it and perhaps examine the assembly code to see if a copy is made. The compiler does have a lot of smarts about MATMUL - I don't recommend coding that yourself.

Dishaw__Jim · ‎06-19-2006

This is a "tricky" issue and the Intel documentation is contradictory. The MKL documentation states

To obtain the best performance with Intel MKL, make sure the following conditions are met:

arrays are aligned on a 16-byte boundary
leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16
for two-dimensional arrays, leading dimension values divisible by 2048 are avoided

This appears to contradict the previously posted documentation that states to avoid powers of 2. If I recall correctly, I thought all Intel CPU's after the 486 are quad-byte aligned--wouldn't a 32-byte boundary be better?

Steven_L_Intel1 · ‎06-19-2006

It's not contradictory. 16-byte alignment is the most that is needed, and helps when the compiler could generate SSE2/3 instructions. You want to avoid a leading dimension that is a larger power of 2 in bytes (something like 16 is not a problem.)