I'm trying to test the effectiveness of a manual cache blocking or loop tiling optimization that has been applied on some Fortran scientific code routine. Concerning Tile Size Selection, I used an algorithm based on classical Distinct Lines Estimation. I am using Intel Fortran Compiler ifort 18.0.1 (2018)
The code is compiled with O3 xHost compilation flags. To observe any speed-up between base version and tiled version I have to switch the prefetching level to 2 (by using -qopt-prefetch=2). By doing that I actually obtain a 27% of speedup (24 seconds versus 33 seconds). Tile Size is more or less correct (verified by various other runs). With normal O3 xHost the Execution Time remains unimproved (20 seconds) - so I get no difference between base and tiled.
I have multiple loop nests, that more or less have the same structure. A simple loop nest is the following, the base version:
DO jk = 2, jpkm1 ! Interior value ( multiplied by wmask) DO jj = 1, jpj DO ji = 1, jpi zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) ) zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) ) zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk) END DO END DO END DO
and the optimized one:
DO jltj = 1, jpj, OBS_UPSTRFLX_TILEY DO jk = 2, jpkm1 DO jj = jltj, MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1) DO ji = 1, jpi zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) ) zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) ) zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk) END DO END DO END DO END DO
Why can't I observe any speedup with the O3 xHost normal run? The problem should be the aggressive SW prefetching introduced by O3 (which should be the effect of the -qopt-prefetch=3 O3 optimization flag), but I would know whether I can further optimize with cache blocking. I have tried some handmade SW prefetching like this:
DO jltj = 1, jpj, OBS_UPSTRFLX_TILEY DO jk = 2, jpkm1 DO jj = jltj, MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1) DO ji = 1, jpi zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) ) zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) ) zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk) IF(jk== jpkm1 .AND. jj == MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)-2) THEN CALL mm_prefetch(pwn(1,jltj+OBS_UPSTRFLX_TILEY,1), 1) CALL mm_prefetch(zwz(1,jltj+OBS_UPSTRFLX_TILEY,1), 1) CALL mm_prefetch(ptb(1,jltj+OBS_UPSTRFLX_TILEY,1,jn), 1) CALL mm_prefetch(wmask(1,jltj+OBS_UPSTRFLX_TILEY,1), 1) ENDIF END DO END DO END DO END DO
but this doesn't seem to help me. The strange thing is that with O3 xHost, varying the tile size does not affect in any case the Execution Time, which remains unimproved and equal to 20s Any kind of suggestion will be greatly thankful.
Best regards.