Solved: Cache Blocking and Prefetching

Chiarelli__Marco · ‎04-27-2018

I'm trying to test the effectiveness of a manual cache blocking or loop tiling optimization that has been applied on some Fortran scientific code routine. Concerning Tile Size Selection, I used an algorithm based on classical Distinct Lines Estimation. I am using Intel Fortran Compiler ifort 18.0.1 (2018)

The code is compiled with O3 xHost compilation flags. To observe any speed-up between base version and tiled version I have to switch the prefetching level to 2 (by using -qopt-prefetch=2). By doing that I actually obtain a 27% of speedup (24 seconds versus 33 seconds). Tile Size is more or less correct (verified by various other runs). With normal O3 xHost the Execution Time remains unimproved (20 seconds) - so I get no difference between base and tiled.

I have multiple loop nests, that more or less have the same structure. A simple loop nest is the following, the base version:

DO jk = 2, jpkm1        ! Interior value ( multiplied by wmask)
    DO jj = 1, jpj
        DO ji = 1, jpi
            zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
            zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
            zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
        END DO
    END DO
END DO

and the optimized one:

DO jltj = 1, jpj, OBS_UPSTRFLX_TILEY    
    DO jk = 2, jpkm1
        DO jj = jltj, MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)
            DO ji = 1, jpi
                zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
                zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
                zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
            END DO
        END DO  
    END DO  
END DO

Why can't I observe any speedup with the O3 xHost normal run? The problem should be the aggressive SW prefetching introduced by O3 (which should be the effect of the -qopt-prefetch=3 O3 optimization flag), but I would know whether I can further optimize with cache blocking. I have tried some handmade SW prefetching like this:

DO jltj = 1, jpj, OBS_UPSTRFLX_TILEY    
     DO jk = 2, jpkm1
        DO jj = jltj, MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)
           DO ji = 1, jpi
              zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
              zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
              zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
              IF(jk== jpkm1 .AND. jj == MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)-2) THEN
                    CALL mm_prefetch(pwn(1,jltj+OBS_UPSTRFLX_TILEY,1), 1)               
                    CALL mm_prefetch(zwz(1,jltj+OBS_UPSTRFLX_TILEY,1), 1)               
                    CALL mm_prefetch(ptb(1,jltj+OBS_UPSTRFLX_TILEY,1,jn), 1)
                    CALL mm_prefetch(wmask(1,jltj+OBS_UPSTRFLX_TILEY,1), 1)
              ENDIF
            END DO
        END DO  
    END DO  
 END DO

but this doesn't seem to help me. The strange thing is that with O3 xHost, varying the tile size does not affect in any case the Execution Time, which remains unimproved and equal to 20s Any kind of suggestion will be greatly thankful.

Best regards.

McCalpinJohn · ‎04-27-2018

At higher optimization levels the compiler may significantly re-write a blocked loop nest. This may include loop reordering or loop fusion followed by the compiler's decisions on blocking.

I often use a "pragma nounroll" directive when I want the compiler to pay a bit more attention to the structure that I have chosen for the loops, but that is not always enough.

Reviewing the assembly output can be helpful for understanding the inner loop, but it can be a lot of work to understand the blocking factors and outer loop structure from looking at the assembly code.

Vectorization is almost always limited to the contiguous (hopefully inner) loop, so rectangular blocks that are larger in the contiguous index are preferred. In addition, the hardware prefetchers won't stop at the end of each row of your block, so your cache footprint will be larger than you might expect.

View solution in original post

McCalpinJohn · ‎04-27-2018

At higher optimization levels the compiler may significantly re-write a blocked loop nest. This may include loop reordering or loop fusion followed by the compiler's decisions on blocking.

I often use a "pragma nounroll" directive when I want the compiler to pay a bit more attention to the structure that I have chosen for the loops, but that is not always enough.

Reviewing the assembly output can be helpful for understanding the inner loop, but it can be a lot of work to understand the blocking factors and outer loop structure from looking at the assembly code.

Vectorization is almost always limited to the contiguous (hopefully inner) loop, so rectangular blocks that are larger in the contiguous index are preferred. In addition, the hardware prefetchers won't stop at the end of each row of your block, so your cache footprint will be larger than you might expect.

Chiarelli__Marco · ‎05-04-2018

Dear Dr. Bandwidth,

Thanks for the instant and precise reply. Effectively this seems to be the situation for the O3 flag. However, is there any chance to see what are the blocking factors used in the internal heuristics for Tile Size Selection? Or simply how can I confirm that automatic cache blocking is actually applied rather than my cache blocking, by not looking at the assembly code? I've already tried to analyze the base and my (manual) cache blocking version with the Advisor XE, and the only interested thing I've found is that, unlike the base version, in my blocked version the interested loops are affected by Loop Distribution as Advanced Optimization. I guess that it is applied on the intra-tile loop. However, I cannot see any blocking on the Advisor, even if the situation perfectly seems to be the one you described.

And if an automatic blocking is applied, in your professional opinion, is It worthy to investigate the performance gain of a manual cache blocking, or do you think the Intel Internal Heuristics for choosing the block size might outperform my Distinct Lines based TSS due to an higher degree of knowledge of the other optimizations that the compiler applies at O3?

McCalpinJohn · ‎05-04-2018

I don't do manual cache blocking very often, but that is because it is so hard to control the compiler's optimizations, and not because I think that the compiler is going to make better choices. When I do manual blocking, I never use O3 optimization, for the same reasons that you have experienced.

Unfortunately, the only way to really guarantee that compiler does not change the blocking is to write in assembler. This is generally considered too painful to justify the effort, but there are some special cases where the speedup makes it worthwhile. It looks like Intel's DGEMM implementations are built in assembly language, for example. (Probably by an assembly-writing tool, rather than completely by hand, but the structure of the code and the pattern of register name usage does not look like code that the compiler generates.)

TimP · ‎05-04-2018

Advisor extracts messages from compiler optreport it simply relates them in a more convenient way. If your questions aren't answered in Advisor, it takes more research with cache event counters in V tune to form an opinion on whether there may be opportunitie.s for further gain other than Advisor hints which may be most relevant to code which hasn't yet been analyzed. One of the useful points of advisor is showing when you spend too much time setting up blocking or remainder loops

As John just mentioned, when months are spent tuning up mkl functions they are likely to result in simd intrinsic code which prevents the compiler from doing its own cache blocking and (possibly intentionally) hinders our efforts to see what was done.

dkokron · ‎05-17-2018

Marco,

The stencil in your code may not cover enough volume for blocking to be useful. What are the data types of the arrays and upper bounds of the loop nest?

Dan