Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring

reference has unaligned access

The vectorization support outpt for the following loops:
      DO K = 1, NZ
         DO J = 1, NY
            DO I = 1, NX

               SLICE_BACK =  GRID(I-1,J-1,K-1) + GRID(I-1,J,K-1) + GRID(I-1,J+1,K-1) + &                                  
                             GRID(I  ,J-1,K-1) + GRID(I  ,J,K-1) + GRID(I  ,J+1,K-1) + &                                  
                             GRID(I+1,J-1,K-1) + GRID(I+1,J,K-1) + GRID(I+1,J+1,K-1)                                      

	           SLICE_MINE =  GRID(I-1,J-1,K)   + GRID(I-1,J,K)   + GRID(I-1,J+1,K) + &                                    
                             GRID(I  ,J-1,K)   + GRID(I  ,J,K)   + GRID(I  ,J+1,K) + &                                    
                             GRID(I+1,J-1,K)   + GRID(I+1,J,K)   + GRID(I+1,J+1,K)                                        

               SLICE_FRONT = GRID(I-1,J-1,K+1) + GRID(I-1,J,K+1) + GRID(I-1,J+1,K+1) + &                                  
                             GRID(I  ,J-1,K+1) + GRID(I  ,J,K+1) + GRID(I  ,J+1,K+1) + &                                  
                             GRID(I+1,J-1,K+1) + GRID(I+1,J,K+1) + GRID(I+1,J+1,K+1)     

              WORK(I,J,K) = ( SLICE_BACK + SLICE_MINE + SLICE_FRONT ) / 27.0 

             END DO
         END DO
      END DO

   DO K = 1, NZ
         DO J = 1, NY
            DO I = 1, NX
               GRID(I,J,K) = WORK(I,J,K)
            END DO
         END DO
      END DO


reports the following reference unaligned accesses.   

remark #15389: vectorization support: reference grid(i,j-1,k+1) has unaligned access   
remark #15389: vectorization support: reference grid(i,j,k+1) has unaligned access   
remark #15389: vectorization support: reference grid(i,j+1,k+1) has unaligned access   
remark #15389: vectorization support: reference grid(i+1,j-1,k+1) has unaligned access   
remark #15389: vectorization support: reference grid(i+1,j,k+1) has unaligned access   
remark #15389: vectorization support: reference grid(i+1,j+1,k+1) has unaligned access   

remark #15389: vectorization support: reference grid(i,j,k) has unaligned access  
remark #15389: vectorization support: reference work_(i,j,k) has unaligned access   

I think I could understand the i+/i1, j+/-i and k+/-1 index access to grid, however, I can't understand why grid has unaligned access in grid(i,j,k) or work(i,j,k) even though grid has been declared aligned

!DEC$ ATTRIBUTES ALIGN: 32 :: GRID1,  GRID2,  GRID3,  GRID4,  GRID5,  GRID6,  GRID7,  GRID8,    & !                       
   GRID9,  GRID10, GRID11, GRID12, GRID13, GRID14, GRID15, GRID16, GRID17, GRID18, GRID19,      & !                       
   GRID20, GRID21, GRID22, GRID23, GRID24, GRID25, GRID26, GRID27, GRID28, GRID29, GRID30,      & !                       


1) Does it mean that GRID has not been aligned?

2) How can I make GRID and WORK access aligned so I can improve performance vectorization??

0 Kudos
1 Reply
Black Belt

Even though grid(1,1,1) may be aligned, grid(1,2,1) is not aligned except for specific values of SIZE(grid,1).  If grid is ALLOCATABLE, the compiler could not expect those sizes, and I doubt that it will check for them anyway, even if the declarations are visible (e.g. by IPO compilation).  To have any chance of suppressing remainder loops, you would need to assert alignment immediately prior to the inner loop.  In a compilation for AVX or wider, the code should still run even if your assertion isn't valid; of course the report will tell you whether the compiler observed your assertion; if so, whether it expects a performance improvement.  Likewise, hardware supporting AVX or newer has been designed so that any penalty for misalignment is incurred only for stores (and the compiler can use code which doesn't fault if your alignment assertion isn't valid).  In your example, the compiler could adjust alignment only for WORK, which is the only one where alignment matters. You need only be concerned about alignment of work unless NX is small enough that the time required to check alignment once per J iteration is significant.  Then your effort to assure there is alignment still would pay off as the remainder loop would be skipped.  You may be able to determine whether you are losing performance there by running Advisor or a Vtune equivalent utility.

In your case, your inner loop has enough work that unrolling doesn't appear to be needed, so you could turn off unrolling (by compile option or directive), so the required SIZE(grid,1) is only a multiple of cache size.  Your compiler report at opt_report4 would tell about unrolling and optimization of remainder loops.