Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Xeon E5 L3 cache is organized how?

Zack_-_
Beginner
679 Views

Hello,

Running a benchmark for memory performance that tests different chunk transfer sizes on an E5 that has 20MB of L3 cache and 8 cores, I am seeing five "shelves" if you will:

  • Shelf 1 is the L1 cache.
  • Shelf 2 is the L2 cache.
  • Shelf 3 extends up to 12 MB.
  • Shelf 4 extends to 20 MB.
  • Shelf 5 is main memory.

Why is there a shelf that goes to 12 MB? Has Intel partitioned the L3 somehow? Or does the L3 actually have two sections running at different speeds?

Thanks for any clues.

0 Kudos
7 Replies
Ron_Green
Moderator
679 Views
moving this to a more appropriate User Forum
0 Kudos
Patrick_F_Intel1
Employee
679 Views
Hello Zack, Can you tell us the full name of the chip from the cpuid info? And which software is reporting the different cache levels? Thanks, Pat
0 Kudos
Zack_-_
Beginner
679 Views
Hi Patrick, It's the Intel Xeon E5-2690. We were running this program: http://zsmith.co/bandwidth.html I think the owner of the Xeon system (it's not mine) said that he got different results depending on which core he ran the code on. Here's the best case graph, where the L3 performance drops at ~17MB: http://zsmith.co/images/Xeon-E5-2690.png
0 Kudos
Patrick_F_Intel1
Employee
679 Views
Ok... these graphs look pretty reasonable. I can't attest to the MB/sec values but the shape of the graphs looks about right. The reason that all the graphs don't just have a shoulder right at 20 MBs is because the L3 starts getting way conflicts when you get close to filling up the L3. The L3 uses a pseudo-least recently used (LRU) algorithm to decide which line gets evicted when you bring in a new line from memory. When you stream memory through the L3, you get some inefficiencies due to the LRU method. Does that make sense? I see that you already have results using the nice command. Cpu 0 is usually the busiest cpu. Pinning the test to some cpu besides cpu 0 and ramping up the priority usually gets the best results. In my memory test utility, I have a pure read test (for caches above first level cache) which just a loop like: jnk=someval; for (i=0; i < int_array_size/4; i+= 16) { jnk += int_array; } which just loads 1 int per cache line... this gives me a pretty good estimate of the max bw you can achieve (with a single thread). This works okay if testing L2, L3 and memory, where reading from a cache line moves the whole cache line to L1. It is not a very realistic test since you aren't really doing much with values. I have a similar test for writes (updating just 1 int per cacheline). Some care has to be taken to make sure the compiler doesn't optimize away the loop. Pat
0 Kudos
Patrick_F_Intel1
Employee
679 Views
Ok... these graphs look pretty reasonable. I can't attest to the MB/sec values but the shape of the graphs looks about right. The reason that all the graphs don't just have a shoulder right at 20 MBs is because the L3 starts getting way conflicts when you get close to filling up the L3. The L3 uses a pseudo-least recently used (LRU) algorithm to decide which line gets evicted when you bring in a new line from memory. When you stream memory through the L3, you get some inefficiencies due to the LRU method. Does that make sense? I see that you already have results using the nice command. Cpu 0 is usually the busiest cpu. Pinning the test to some cpu besides cpu 0 and ramping up the priority usually gets the best results. In my memory test utility, I have a pure read test (for caches above first level cache) which just a loop like: jnk=someval; for (i=0; i < int_array_size/4; i+= 16) { jnk += int_array; } which just loads 1 int per cache line... this gives me a pretty good estimate of the max bw you can achieve (with a single thread). This works okay if testing L2, L3 and memory, where reading from a cache line moves the whole cache line to L1. It is not a very realistic test since you aren't really doing much with values. I have a similar test for writes (updating just 1 int per cacheline). Some care has to be taken to make sure the compiler doesn't optimize away the loop. Pat
0 Kudos
Zack_-_
Beginner
679 Views
Thanks for the clarification.
0 Kudos
Andres_M_Intel4
Employee
679 Views
Just in case you find it useful, the following tool can pretty print cache hierarchy. http://code.google.com/p/likwid/wiki/LikwidTopology
0 Kudos
Reply