- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hello,
Running a benchmark for memory performance that tests different chunk transfer sizes on an E5 that has 20MB of L3 cache and 8 cores, I am seeing five "shelves" if you will:
- Shelf 1 is the L1 cache.
- Shelf 2 is the L2 cache.
- Shelf 3 extends up to 12 MB.
- Shelf 4 extends to 20 MB.
- Shelf 5 is main memory.
Why is there a shelf that goes to 12 MB? Has Intel partitioned the L3 somehow? Or does the L3 actually have two sections running at different speeds?
Thanks for any clues.
Link kopiert
7 Antworten
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
moving this to a more appropriate User Forum
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hello Zack,
Can you tell us the full name of the chip from the cpuid info?
And which software is reporting the different cache levels?
Thanks,
Pat
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hi Patrick,
It's the Intel Xeon E5-2690. We were running this program:
http://zsmith.co/bandwidth.html
I think the owner of the Xeon system (it's not mine) said that he got different results depending on which core he ran the code on.
Here's the best case graph, where the L3 performance drops at ~17MB:
http://zsmith.co/images/Xeon-E5-2690.png
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Ok... these graphs look pretty reasonable.
I can't attest to the MB/sec values but the shape of the graphs looks about right.
The reason that all the graphs don't just have a shoulder right at 20 MBs is because the L3 starts getting way conflicts when you get close to filling up the L3.
The L3 uses a pseudo-least recently used (LRU) algorithm to decide which line gets evicted when you bring in a new line from memory.
When you stream memory through the L3, you get some inefficiencies due to the LRU method.
Does that make sense?
I see that you already have results using the nice command. Cpu 0 is usually the busiest cpu. Pinning the test to some cpu besides cpu 0 and ramping up the priority usually gets the best results.
In my memory test utility, I have a pure read test (for caches above first level cache) which just a loop like:
jnk=someval; for (i=0; i < int_array_size/4; i+= 16) { jnk += int_array; }
which just loads 1 int per cache line... this gives me a pretty good estimate of the max bw you can achieve (with a single thread).
This works okay if testing L2, L3 and memory, where reading from a cache line moves the whole cache line to L1.
It is not a very realistic test since you aren't really doing much with values.
I have a similar test for writes (updating just 1 int per cacheline).
Some care has to be taken to make sure the compiler doesn't optimize away the loop.
Pat
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Ok... these graphs look pretty reasonable.
I can't attest to the MB/sec values but the shape of the graphs looks about right.
The reason that all the graphs don't just have a shoulder right at 20 MBs is because the L3 starts getting way conflicts when you get close to filling up the L3.
The L3 uses a pseudo-least recently used (LRU) algorithm to decide which line gets evicted when you bring in a new line from memory.
When you stream memory through the L3, you get some inefficiencies due to the LRU method.
Does that make sense?
I see that you already have results using the nice command. Cpu 0 is usually the busiest cpu. Pinning the test to some cpu besides cpu 0 and ramping up the priority usually gets the best results.
In my memory test utility, I have a pure read test (for caches above first level cache) which just a loop like:
jnk=someval; for (i=0; i < int_array_size/4; i+= 16) { jnk += int_array; }
which just loads 1 int per cache line... this gives me a pretty good estimate of the max bw you can achieve (with a single thread).
This works okay if testing L2, L3 and memory, where reading from a cache line moves the whole cache line to L1.
It is not a very realistic test since you aren't really doing much with values.
I have a similar test for writes (updating just 1 int per cacheline).
Some care has to be taken to make sure the compiler doesn't optimize away the loop.
Pat
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Thanks for the clarification.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Just in case you find it useful, the following tool can pretty print cache hierarchy.
http://code.google.com/p/likwid/wiki/LikwidTopology

Antworten
Themen-Optionen
- RSS-Feed abonnieren
- Thema als neu kennzeichnen
- Thema als gelesen kennzeichnen
- Diesen Thema für aktuellen Benutzer floaten
- Lesezeichen
- Abonnieren
- Drucker-Anzeigeseite