Vtune has reported a very high value of PageWalkDTLBAllMisses performace impact from running an application? What does this impact tell me about the memory access patten of this application? I read somewhere that this value is high does not necessary mean high L1 cache miss rate. What is the next step to investigate?
When you have high performance impact of DTLB misses, it means your program doesn't exhibit much cache locality. You may be using 20% or less of the data from each cache line (and page), before the DTLB entry is evicted. The next time data are needed from that page, even though you accessed the page recently, the DTLB entry has to be recovered by page walk, before the hardware can deal with L1 or L2 miss. It is possible the L1 miss rate may not be much higher than the DTLB miss rate. In that case, the performance impact of L1 miss is negligible, in comparison with DTLB miss. L1 miss is usually not significant unless the rate is much higher than L2 miss rate.
On Itanium, if I understand correctly, DTLB eviction causes L3 eviction, so DTLB miss automatically implies L3 miss. On EM64T models, in-cache DTLB miss (data still available in L2) may become a problem. It has much more performance impact on EM64T than on Opteron.
Examine the source code at the location of the DTLB misses. If you have nested loops, as you know, ideally, the inner loop is stride 1 vectorizable, hardware prefetch is fully effective, and you should not see significant DTLB misses. With an inner loop of stride 5 or more (on P4), DTLB misses are likely to begin to have an impact. If the loops can't be interchanged, unrolling an outer smaller stride loop ("unroll and jam") may help.
If your source code provides for cache blocking, you must take the size of DTLB, as well as cache size, into account in setting up the blocking. On several of the more popular processor models, DTLB is only large enough to cover 512KB, so you want to organize operations so as to complete all operations in each 512K memory block before tackling the next. If you block for L1 locality, you should have no problem with DTLB. For Itanium, with minimum page size, that increases to 2MB, and you are unlikely to see striding problems with DTLB until strides are large.
The more difficult case is where you have indirection, and the access pattern is not regular enough for hardware prefetch.
If your OS provides huge pages, that may be a solution for DTLB performance problems.