I am analyzing a program that spends most of it's time doing linear algebra. I suspect the algorithm is inefficient in the way it accesses matrices. I have sampled the program measuring L1 and L2 cache load misses.Are these the appropriate events to measure? I also see that L2 and L3 read misses can be measured. How are these different?
The tuning assistantshows that the program spends 10% and 25% of its time on the L1 and L2 misses respectively. This seems very large. Never having done a measurement like this, Iam not sure what to expect.
Currently, only enterprise MP platforms (usually 4 CPU sockets) have L3.
If you have many L1 and L2 misses, TLB misses may be more (or less) important to performance. We can't guess your CPU model. If you have an option to select cache misses retired, those are the ones which count. Some of the events may be available separately for instructions and data,or by cache lines.
High data cache miss rates are possible, if you didn't organize your program to use sequential memory access. You would want to do that anyway, to enable vectorization. Intel Fortran will make a few of those changes automatically, at -O3.
Tuning assistant performance assessments are rough, could be off by a factor of 2.