I am using vtune on a numerically intensive Fortran code with input parameters JD and KD which control the problem size. When I run with input parameters JD=41 and KD=41, vtune highlighted "4K Aliasing". This was new to me so I educated myself a bit about write-after-read hazards. So far, so good. Inside vtune, there are two subroutines which show 4K aliasing numbers of 1.000. One of the subroutines is essentially this:
SUBROUTINE DECJ ( JPER,B,D,H,XSC,JD,KD ) LOGICAL, INTENT (IN) :: JPER INTEGER, INTENT (IN) :: JD,KD REAL*8, DIMENSION(JD,KD), INTENT (INOUT) :: B,D REAL*8, DIMENSION(JD,KD), INTENT (IN) :: H,XSC INTEGER :: J,JP,JM,K DO K = 1,KD DO J = 2,JD-1 JP = J+1 JM = J-1 B(JP,K) = B(JP,K) - H(JP,K)*(0.5*XSC(J,K)) D(JM,K) = D(JM,K) + H(JM,K)*(0.5*XSC(J,K)) ENDDO ENDDO This is called twice: CALL DECJ ( JPER,B,D,H,XSCP,JD,KD )
CALL DECJ ( JPER,BT,DT,H,XSCM,JD,KD )
The arguments here are automatic arrays in the calling routine The calling routine has several automatic arrays, declared like this:
REAL*8, DIMENSION(JD,KD) :: A,B,C,D,E REAL*8, DIMENSION(JD,KD) :: AT,BT,CT,DT,ET REAL*8, DIMENSION(JD,KD,5) :: G REAL*8, DIMENSION(JD,KD) :: H,UU,XSCP,XSCM
My basic question is, what specifically triggers 4K aliasing in the case JD=41, KD=41 and not in the case JD=41, KD=40 (experimentally, with JD=41 and KD=40, vtune shows minimal 4K aliasing in subroutine decj, aliasing number is 0.109).
Compilation was with ifort 2015.3.187 using the options
-O3 -axCORE-AVX2,AVX -xSSE4.2 -g -ip -pad -align -auto -fpe0 -ftz -traceback
The loop in decj is unrolled 4 times by the compiler, so presumably after unrolling it looks something like this:
B(J+1,K) = B(J+1,K) - H(J+1,K)*(0.5*XSC(J, K)) D(J-1,K) = D(J-1,K) + H(J-1,K)*(0.5*XSC(J, K)) B(J+2,K) = B(J+2,K) - H(J+2,K)*(0.5*XSC(J+1,K)) D(J, K) = D(J, K) + H(J, K)*(0.5*XSC(J+1,K)) B(J+3,K) = B(J+3,K) - H(J+3,K)*(0.5*XSC(J+2,K)) D(J+1,K) = D(J+1,K) + H(J+1,K)*(0.5*XSC(J+2,K)) B(J+4,K) = B(J+4,K) - H(J+4,K)*(0.5*XSC(J+3,K)) D(J+2,K) = D(J+2,K) + H(J+2,K)*(0.5*XSC(J+3,K)) I did some testing and couldn't find any addresses that differed by a multiple of 4096. The worst I could find was some addresses that differed by a multiple of 256.
4K aliasing occurs when you store one memory location, then load from another memory location which is 4KB offset from original, they are linear addresses, for example 0x400000, 0x401000, 0x402000...etc
Because penalty is not high (vs. cache miss), ~5 cycles. If it didn't occur in depth-loop, and called not frequent in subroutine, just ignore this.
Solving 4K aliasing, change offset for these memory. We hope to use memory such 0x400000, 0x401010, 0x402020,... - 5-11 bits are different.
In my view, you operated among B,D,H, you may add offset such as:
(JD,KD-1) :: B
(JD,KD) :: D
(JD,KD+1) :: H
That is, use offset to avoid 4K aliasing.
Also, see the product documentation for the definition:
The key here is that the low-order 12 bits of the address are ignored, so it's not that the address are offset by 4K, but that the address masked with 0xFFFFF000 is offset by 4K. :\
Here are some more details from the tuning guide for the 4th generation Intel® Core™ processors (btw, tuning papers for publicly available processors are available):
"This occurs when a load is issued after a store and their memory addresses are offset by (4K). When this is processed in the pipeline, the issue of the load will match the previous store (the full address is not used at this point), so pipeline will try to forward the results of the store and avoid doing the load (this is store forwarding). Later on when the address of the load is fully resolved, it will not match the store, and so the load will have to be re-issued from a later point in the pipe. This has a 5-cycle penalty in the normal case, but could be worse in certain situations, like with un-aligned loads that span 2 cache lines."
Hope that helps.
I find it helpful to print all the base pointers in hexadecimal at the beginning of the code (or as soon as the pointers are allocated). 4KiB aliasing will occur a lot if any two pointers end with the same 3 hex digits.
The penalty for 4KiB aliasing depends on the processor as well as on the details of the code (such as how many reads and writes are in the inner loop and where in the memory hierarchy the data ends up being found). It is often relatively easy to avoid, and even when it is inconvenient to avoid it entirely, it is usually possible to do a quick check (modifying only a few arrays) to get an estimate of the magnitude of the performance penalty.