Unusual Pointer Chasing Memory Latency on SLES 11 SP2 with E5-2670
I am running on a 2S Intel motherboard, S2600GZ, with 2 x E5-2670 cpus. I'm measuring my expected cache latencies (4 from L1, 12 from L2, 40 from L3) but when I try to measure, using huge pages, the latency of the test illustrated below (again using huge pages) on SLES 11 SP2, I observe that I'm getting either X or 2X the latency from run to run. In some cases the latency is 80-90 ns and in others it's 160-180 ns. I'm sure the latency isn't the later, but I've pulled 1 CPU out of the motherboard thinking I may be inadvertantly accessing it's memory but that's rectified this issue.
Do you have any idea why I'm observing this behavior? The test does the following:
1) allocates a large span of memory, say 32MB.
2) accesses randomly a 8B element every 4096 B, but only 1 access every 4096B block
3) that access then contains the pointer to the next access.. and so on.
4) once you've made the measurement you flush every step of the walk, using CLFLUSH.
5) repeat till you get a good memory latency estimate.
I've affinitized the process with "numactl" to no avail.
Lastly, I've accessed every 128KB of a likewise 32MB array, and measure the latency of that pointer chase and don't observe this behavior. I get a reproducible number for the latency in that test.
Any pointers or information as to things I should be aware of is greatly appreciated..
Just replying to see if anyone has any ideas. I've removed 1 socket from my 2P server platform and still see the same results. Is there a program you suggest to measure the memory latency on this platform.
Seems to me this would be a basic thing someone could answer.. any help.. is very much appreciated.
I forgot to mention. I'm using huge pages to avoid TLB refresh from interferring with my measurements.
I get the following as latencies of the caches for this server processor:
what's the expected memory latency (if it's publicly available, and can you confirm the L3 latency?). On my SB desktop system the latency ranges from 25-30 clocks but is typically 29. I was surprised to see the increase in latency, but now there's 8 cores on the ring bus.
Update... I just downloaded Lmbench3, and it's reporting 127 ns. I know that's bogus. So any pointers to what's expected, with 1S populated in this 2S motherboard with the processor in my first post, with 1600 DDR3, is greatly appreciated, if anyone knows.
Hello perfwise, Sorry to not respond earlier. Are you able to turn off the prefetchers in the bios and then just try a 64 byte stride, sequential, linked list, dependent, standard memory latency test? With 2MB pages, the TBL penalty should be negiglible. This would eliminate any question about your methodology. You might be doing evrything right, I'm just not so familiar with random loads and clflushing.
I would probably check (with prefetchers off), using a standard latency test,the result for regular (4KB) pages, then 2MB pages using 64 byte stride and something like a 40MB array size (or even an 80 MB array).
I would expect the results to be 4KB latency to be similar to the 2MB latency but I personally haven't used 2MB pages very much. Pat
Hello Perfwise, Sorry for the delay. I'm getting about 76 ns/LLC_miss for a page-hit. This is a 'load to use' latency.
This test has the config: prefetchers off, turbo off, run for 20 seconds, 40MB array stride 64 bytes linked list, dependent load run the test on 1 cpu from each node simutaneously I used cpu 3 and cpu 18 (I try to avoid running on cpu 0). memory malloc'd on same cpu on which test was run so I malloc'd a 40MB array on cpu 3 and ran the latency test on cpu 3. Same method for cpu 18.
For the page-miss case, I used a stride of 4096 bytes and a 512MB array size. I got a latency of 87.3 ns/miss.
For an L3/LLC hit, using a 10MB array size, I get a latency of 40.2 clockticks.