Ok... these graphs look

Zack_-_ · ‎10-30-2012

Hello,

Running a benchmark for memory performance that tests different chunk transfer sizes on an E5 that has 20MB of L3 cache and 8 cores, I am seeing five "shelves" if you will:

Shelf 1 is the L1 cache.
Shelf 2 is the L2 cache.
Shelf 3 extends up to 12 MB.
Shelf 4 extends to 20 MB.
Shelf 5 is main memory.

Why is there a shelf that goes to 12 MB? Has Intel partitioned the L3 somehow? Or does the L3 actually have two sections running at different speeds?

Thanks for any clues.

Ron_Green · ‎11-02-2012

moving this to a more appropriate User Forum

Patrick_F_Intel1 · ‎11-02-2012

Hello Zack, Can you tell us the full name of the chip from the cpuid info? And which software is reporting the different cache levels? Thanks, Pat

Zack_-_ · ‎11-02-2012

Hi Patrick, It's the Intel Xeon E5-2690. We were running this program: http://zsmith.co/bandwidth.html I think the owner of the Xeon system (it's not mine) said that he got different results depending on which core he ran the code on. Here's the best case graph, where the L3 performance drops at ~17MB: http://zsmith.co/images/Xeon-E5-2690.png

Patrick_F_Intel1 · ‎11-02-2012

Ok... these graphs look pretty reasonable. I can't attest to the MB/sec values but the shape of the graphs looks about right. The reason that all the graphs don't just have a shoulder right at 20 MBs is because the L3 starts getting way conflicts when you get close to filling up the L3. The L3 uses a pseudo-least recently used (LRU) algorithm to decide which line gets evicted when you bring in a new line from memory. When you stream memory through the L3, you get some inefficiencies due to the LRU method. Does that make sense? I see that you already have results using the nice command. Cpu 0 is usually the busiest cpu. Pinning the test to some cpu besides cpu 0 and ramping up the priority usually gets the best results. In my memory test utility, I have a pure read test (for caches above first level cache) which just a loop like: jnk=someval; for (i=0; i < int_array_size/4; i+= 16) { jnk += int_array; } which just loads 1 int per cache line... this gives me a pretty good estimate of the max bw you can achieve (with a single thread). This works okay if testing L2, L3 and memory, where reading from a cache line moves the whole cache line to L1. It is not a very realistic test since you aren't really doing much with values. I have a similar test for writes (updating just 1 int per cacheline). Some care has to be taken to make sure the compiler doesn't optimize away the loop. Pat

Patrick_F_Intel1 · ‎11-02-2012

Ok... these graphs look pretty reasonable. I can't attest to the MB/sec values but the shape of the graphs looks about right. The reason that all the graphs don't just have a shoulder right at 20 MBs is because the L3 starts getting way conflicts when you get close to filling up the L3. The L3 uses a pseudo-least recently used (LRU) algorithm to decide which line gets evicted when you bring in a new line from memory. When you stream memory through the L3, you get some inefficiencies due to the LRU method. Does that make sense? I see that you already have results using the nice command. Cpu 0 is usually the busiest cpu. Pinning the test to some cpu besides cpu 0 and ramping up the priority usually gets the best results. In my memory test utility, I have a pure read test (for caches above first level cache) which just a loop like: jnk=someval; for (i=0; i < int_array_size/4; i+= 16) { jnk += int_array; } which just loads 1 int per cache line... this gives me a pretty good estimate of the max bw you can achieve (with a single thread). This works okay if testing L2, L3 and memory, where reading from a cache line moves the whole cache line to L1. It is not a very realistic test since you aren't really doing much with values. I have a similar test for writes (updating just 1 int per cacheline). Some care has to be taken to make sure the compiler doesn't optimize away the loop. Pat

Zack_-_ · ‎11-04-2012

Thanks for the clarification.

Andres_M_Intel4 · ‎11-05-2012

Just in case you find it useful, the following tool can pretty print cache hierarchy. http://code.google.com/p/likwid/wiki/LikwidTopology

Xeon E5 L3 cache is organized how?