Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

Efficient Arrays

Dishaw__Jim
Beginner
776 Views
According to Intel's documentation





With multidimensional arrays where access to array elements will be noncontiguous, avoid leftmost array dimensions that are a power of two (such as 256, 512).

Since the cache sizes are a power of 2, array dimensions that are also a power of 2 may make inefficient use of cache when array access is noncontiguous. If the cache size is an exact multiple of the leftmost dimension, your program will probably make use of the cache less efficient. This does not apply to contiguous sequential access or whole array access.



One work-around is to increase the dimension to allow some unused elements, making the leftmost dimension larger than actually needed. For example, increasing the leftmost dimension of A from 512 to 520 would make better use of cache:



I think I am having a dense moment here, because I cannot see why a power of 2 sized array would make less efficient use of the cache than a non-power of 2 sized array.

My next question is the implementation of array sections by IVF. Does an array section result in a temporary copy somtimes, all the times, or never--or does it depend? Consider the following example:

FORALL(i=1:N) D(:,i) = MATMUL(A(:,:,E(i)), B(:,i)) + C(:,i)

Would A(:,:,E(i)) result in a temporary array being created? How about B(:,i) and C(:,i)? Would it be better to avoid using MATMUL in this case and implement it as loops?

Thanks
0 Kudos
4 Replies
Jugoslav_Dujic
Valued Contributor II
776 Views
Re power-of-two: I agree this is counter-intuitive, (and I admit that I don't quite get the explanation), but here's one. I'll let the compiler people comment on the second question.
0 Kudos
Steven_L_Intel1
Employee
776 Views
The pointer Jugoslav gave is a good one.

The answer to the second question is "it depends". The compiler tries to eliminate unnecessary array copies but there are cases it can miss. The best you can do is try it and perhaps examine the assembly code to see if a copy is made. The compiler does have a lot of smarts about MATMUL - I don't recommend coding that yourself.
0 Kudos
Dishaw__Jim
Beginner
776 Views
This is a "tricky" issue and the Intel documentation is contradictory. The MKL documentation states



To obtain the best performance with Intel MKL, make sure the following conditions are met:

  • arrays are aligned on a 16-byte boundary
  • leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16
  • for two-dimensional arrays, leading dimension values divisible by 2048 are avoided
This appears to contradict the previously posted documentation that states to avoid powers of 2. If I recall correctly, I thought all Intel CPU's after the 486 are quad-byte aligned--wouldn't a 32-byte boundary be better?
0 Kudos
Steven_L_Intel1
Employee
776 Views
It's not contradictory. 16-byte alignment is the most that is needed, and helps when the compiler could generate SSE2/3 instructions. You want to avoid a leading dimension that is a larger power of 2 in bytes (something like 16 is not a problem.)
0 Kudos
Reply