- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
According to Intel's documentation
I think I am having a dense moment here, because I cannot see why a power of 2 sized array would make less efficient use of the cache than a non-power of 2 sized array.
My next question is the implementation of array sections by IVF. Does an array section result in a temporary copy somtimes, all the times, or never--or does it depend? Consider the following example:
FORALL(i=1:N) D(:,i) = MATMUL(A(:,:,E(i)), B(:,i)) + C(:,i)
Would A(:,:,E(i)) result in a temporary array being created? How about B(:,i) and C(:,i)? Would it be better to avoid using MATMUL in this case and implement it as loops?
Thanks
With multidimensional arrays where access to array elements will be noncontiguous, avoid leftmost array dimensions that are a power of two (such as 256, 512).
Since the cache sizes are a power of 2, array dimensions that are also a power of 2 may make inefficient use of cache when array access is noncontiguous. If the cache size is an exact multiple of the leftmost dimension, your program will probably make use of the cache less efficient. This does not apply to contiguous sequential access or whole array access.
One work-around is to increase the dimension to allow some unused elements, making the leftmost dimension larger than actually needed. For example, increasing the leftmost dimension of A from 512 to 520 would make better use of cache:
I think I am having a dense moment here, because I cannot see why a power of 2 sized array would make less efficient use of the cache than a non-power of 2 sized array.
My next question is the implementation of array sections by IVF. Does an array section result in a temporary copy somtimes, all the times, or never--or does it depend? Consider the following example:
FORALL(i=1:N) D(:,i) = MATMUL(A(:,:,E(i)), B(:,i)) + C(:,i)
Would A(:,:,E(i)) result in a temporary array being created? How about B(:,i) and C(:,i)? Would it be better to avoid using MATMUL in this case and implement it as loops?
Thanks
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Re power-of-two: I agree this is counter-intuitive, (and I admit that I don't quite get the explanation), but here's one. I'll let the compiler people comment on the second question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The pointer Jugoslav gave is a good one.
The answer to the second question is "it depends". The compiler tries to eliminate unnecessary array copies but there are cases it can miss. The best you can do is try it and perhaps examine the assembly code to see if a copy is made. The compiler does have a lot of smarts about MATMUL - I don't recommend coding that yourself.
The answer to the second question is "it depends". The compiler tries to eliminate unnecessary array copies but there are cases it can miss. The best you can do is try it and perhaps examine the assembly code to see if a copy is made. The compiler does have a lot of smarts about MATMUL - I don't recommend coding that yourself.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a "tricky" issue and the Intel documentation is contradictory. The MKL documentation states
To obtain the best performance with Intel MKL, make sure the following conditions are met:This appears to contradict the previously posted documentation that states to avoid powers of 2. If I recall correctly, I thought all Intel CPU's after the 486 are quad-byte aligned--wouldn't a 32-byte boundary be better?
- arrays are aligned on a 16-byte boundary
- leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16
- for two-dimensional arrays, leading dimension values divisible by 2048 are avoided
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's not contradictory. 16-byte alignment is the most that is needed, and helps when the compiler could generate SSE2/3 instructions. You want to avoid a leading dimension that is a larger power of 2 in bytes (something like 16 is not a problem.)

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page