We will need more information to better understand the issue. What is the program doing? How big is the allocated memory, the active region, what does the vmstat output say, what are the other Vtune counters looking like. What is the processor and how much memory, what is the OS, etc...
Please provide some more details, and I will be happy to have a look at your issue. Regards, Hussam
The code is set up with unit strides on all accesses, so the page tables should be used efficiently. Of course the arrays are huge, so you will be rolling the TLBs, but you should only get one TLB miss for every 4kB of memory traffic, which should not be a performance issue. How big was DTLB_LOAD_MISSES.WALK_DURATION compared to the total execution time?
The only other way to get significant TLB misses is via DTLB address conflicts. If I am reading the CPUID information correctly on my Xeon E5 systems, the level 0 DTLB is 4-way set associative for either 4kB pages or 2MB pages. The loop nest above will access at least 15 pages at a time (one for each of the last indices of each of the three arrays), so it is at least possible that there is a TLB addressing conflict.
The easiest way to check this is to simply split the loop nest into three separate loop nests -- one for ux, one for uy, one for uz. This will reduce the number of pointers to ~5 per loop, which should eliminate any conflicts. If there is a systematic TLB conflict, this should improve the execution time as well as the TLB miss counts.
Note that the array sizes don't appear to be "bad" for conflicts -- 640x586x536 for the first three dimensions does not end up close to an even multiple of 2MB, but the code fragment does not show the relative alignment of the three arrays, so it is not possible to rule out conflicts.