Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28456 Discussions

Manual cache blocking issues with ifort

akshan
Beginner
637 Views
Hi,

When I perform manual cache blocking and turn on the -O3 option, performance degrades.

First in a standard matrix multiplication example, the runtimes before optimization for 512-by-512 matrices:
naive version : 5.9s
blocked version: 2.3s
After optimization:
naive version: 36.21s
blocked version: 50.16s


That was just to give an extreme example, which was intensive in cache access. The code I'm interested in(called viscousFlux_kloop) doesnt have such a pronounced difference, but the problem is still there.

The runtimes, L2 cache MISS rate before optimization:
naive version : 0.05s 0.2%
blocked version: 0.04s 0.19%

After optimization:
naive version : 0.013s 2.7%
blocked version: 0.015s 2.75%

Also the ITLB performance is better for the unoptimized versions.
I believe the decreased runtimes are due to non-cache optimizations. Clearly the cache performance is better with -O0.

I've read that manual cache blocking is to be a last resort. The optimization reports for the program viscousFlux_kloop showed no high level optimization(hlo) activity for blocking, so I assumed one could get better performance by blocking.

My question is:
What is causing the bad cache behaviour with the -O3 option?
Is there a switch to control that specific behaviour?
Is there any other way/hack to get both better cache performance from manual blocking and decreased runtimes from -O3 ?

My machine is a Xeon dual core with EM64T ISA, running Red Hat enterprise 4.0 and ifort version 9.0

The output of uname -a on my machine:
Linux dn800c9608.stanford.edu 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST 2005 i686 i686 i386 GNU/Linux

I've included two tar files inside the 'both.tar':

1)blocking_test.tar.gz
Contains the matrix example.
Untar and type 'make demo'
edit 'Makefile' (change optimization switch)and type 'make demo'

2)optimization.tar.gz
Contains the viscousFlux_kloop.f90 code.
Untar and type 'make kloop_naive'
type 'make kloop_blocked'
edit 'config.mk' for optimization levels/reports and repeat
The optimization reports are included in 'rep1' files in the subdirectories.

Any help on this would be greatly appreciated by us.

Thank you,
Aravind.
0 Kudos
1 Reply
Steven_L_Intel1
Employee
637 Views
I am by no means an expert in this area, but I would assume that at -O3 the compiler is doing loop transformations that interfere with your manual unblocking. My general advice in cases such as this is to run the program with VTune and look at events relating to cache misses. Ask in the VTune forum for help with this. You can get an evaluation copy of VTune.
0 Kudos
Reply