Multi-threaded L3 cache performance

Marcin_K_ · ‎12-02-2014

Hello,

I have found a lot of interesting reads about cache bandwidth performance modeling and benchmarking (e.g., https://software.intel.com/en-us/forums/topic/480004) and of course a lot to read about multi-threaded stream benchmark.

So here I am trying to understand the multi-threaded, or multi-core performance of the L3 cache. (too many posts about performance analysis start this way ;)

Let's say I want to check the speed of SSE2 vector transfers to the registers from various cache levels:

__m128d mread_sse2(double *addr, long size)
{
  long i=0;
  __m128d v1, v2, v3, v4;

  [ zero v1...]

  while(i<size){
    v1 = _mm_add_pd(v1, _mm_load_pd(addr+i+0));
    v2 = _mm_add_pd(v2, _mm_load_pd(addr+i+2));
    v3 = _mm_add_pd(v3, _mm_load_pd(addr+i+4));
    v4 = _mm_add_pd(v4, _mm_load_pd(addr+i+6));
    i+=8;
  }

  [sum v1... - avoid compiler opt ]
  return v1;
}

On all recent architectures, this code runs in 0.5 cycle per double when data is in L1 cache (size=1024). A step to L3 cache, single thread, size=102400:

E5-2697v3 (Hsw), 3600MHz Turbo    0.93 cycle/double
i7-4800MQ (Haw), 3700MHz Turbo    0.67 cycle/double
E5-2670v2 (Ivb), 3300MHz Turbo    0.87 cycle/double
E7- 4870  (Wsm), 2800MHz Turbo    0.98 cycle/double

Looks good to me (I do not dare to ask for a model ;), although at this point I wonder why is the mobile i7-4800MQ so much better than the E5?

But the real question is, what happens if I run the code on all cores of a single CPU? The openmp code is

#pragma omp parallel
  {
    void *_ptr;
    double *ptr, time;

    posix_memalign(&_ptr, 64, sizeof(double)*size);
    ptr = (double*)_ptr;
    mzero_sse2(ptr, size);

#pragma omp barrier
    tic();
    for(int t=0; t<nt; t++){
      mread_sse2(ptr, size);
    }

/* #pragma omp barrier */
    time = toc();
    {
      double ghz = (double)atoi(argv[3]);
      printf("clocks per double %lf\n", time*ghz/(size*nt));
    }
  }

Each thread allocates its own, private, aligned array and initializes it before the tests. The number of tests is large, so that the execution time is a few seconds. Thread creation and other OpenMP overhead should not play a role. Note the lack of barrier after the iterations loop - all threads are allowed to finish when they can, and the clocks/double result is a range of values:

E5-2697v3 (Hsw), 14 cores    0.99 - 1.12 cycle/double
i7-4800MQ (Haw),  4 cores    0.76 - 0.78 cycle/double
E5-2670v2 (Ivb), 10 cores    0.99 - 1.01 cycle/double
E7- 4870  (Wsm), 10 cores    1.33 - 1.38 cycle/double

The per-core bandwidth is systematically 15-20% worse than when only 1 core is running. On Wsm the performance is even worse.

I have verified that the results are the same results when I run e.g., 10 instances of a single-threaded test and bind the processes to different cores. So this is definitely not OpenMP-related. Seems like either OS, or hardware.

Could anyone tell me why that is?

Thank you!

Marcin

Marcin_K_ · ‎12-02-2014

I thought I would attach the source for your convenience. Compile with

gcc -std=c99 -O2 -msse2 -fopenmp cache_test.c -o cache_test

run as

OMP_NUM_THREADS=1 GOMP_CPU_AFFINITY=0 ./cache_test 102400 100000 3700

where first parameter is size of the array, second - number of repetitions, third - CPU clock frequency.

Rakhi_H_ · ‎12-02-2014

Hi,

There are two points

1. Was it ensured that the amount of memory fetched in both cases, i.e. the single threaded and the multi-threaded versions was the same?

2. Contention for shared cache is a well known fact. A search for "Contention Shared memory multicore" should give a lot of pointers.

Hope this helps.

Rakhi

McCalpinJohn · ‎12-02-2014

Any system with a shared L3 cache will show some degradation in performance due to contention for shared resources. For a bandwidth test like this, the details will depend on the topology of the on-chip interconnect, the effectiveness of the address hashing in eliminating "hot spots", and the ability of the cores to tolerate the increased latency of the L3 hits as the load increases.

On my Xeon E5-2680 (Sandy Bridge EP), I typically see throughput increases of between 7.0 and 7.5 when running 8 threads on an L3-contained array -- corresponding to a decrease in per-thread performance of 7%-14% relative to the single-thread case. (Single-thread performance is similar to the above -- slightly over 8 Bytes/cycle, or just under 1 cycle/double.) This seems like very good performance to me, and seems to be in pretty good agreement with the Haswell and Ivy Bridge results above.

The Westmere-based Xeon E7 system shows poorer scaling in the results above. I have not tested these systems personally, but based on the block diagram in Fig 1-1 of the Intel Xeon Processor E7 Family Uncore Performance Monitoring Guide (document 325294-001), it looks like the L3 is based on a single LLC Coherence Engine (rather than the ring-based approach used in the Sandy Bridge EP and newer products). It is extremely challenging to build a monolithic cache controller that can handle the coherence and data requirements for 10 cores, so it is not at all surprising that this shows some scaling losses. The results above correspond to speedups of 7.1x to 7.4x, which is pretty impressive for a monolithic controller.

Marcin_K_ · ‎12-02-2014

John D. McCalpin wrote:

On my Xeon E5-2680 (Sandy Bridge EP), I typically see throughput increases of between 7.0 and 7.5 when running 8 threads on an L3-contained array -- corresponding to a decrease in per-thread performance of 7%-14% relative to the single-thread case. (Single-thread performance is similar to the above -- slightly over 8 Bytes/cycle, or just under 1 cycle/double.) This seems like very good performance to me, and seems to be in pretty good agreement with the Haswell and Ivy Bridge results above.

This is good to know, indeed. I wonder, do you have any idea why the i7-4800MQ performs so much better than E5-2697 v3? Both are Haswel chips. But on the laptop reading from L3 delivers almost the L1 bandwidth.. Stunning.

I am also curious about your comment about E7 architecture I have found elsewhere:

I don't think that the non-temporal stores make much performance difference for the Xeon E7 parts.

The E7 I mentioned does see a proportional bandwidth increase in STREAM compiled with non-temporal stores. However, I have also tested the recent E7-8857 v2, and this one indeed does not have any performance benefit. The first configuration is a 4-socket, the second one is 8-socket - apart from the chip generation that is the most important difference, I guess.

McCalpinJohn · ‎12-02-2014

The Intel Core i7-4800MQ (Haswell) uses the "client" uncore. The client uncore typically has a monolithic L3 controller, since it only needs to support 4 cores. It is not surprising that a monolithic controller can provide lower latency and higher bandwidth than a ring -- but only for a small number of cores.

The Intel Xeon E5-2697 v3 (Haswell EP) uses a double-ring uncore. Looking at Figures 1-1, 1-2, and 1-3 of the Xeon E5-2600 v3 Uncore Performance Monitoring Guide (document 331051), it appears that this 14-core part uses the double-ring with (up to) 8 cores on the primary ring and (up to) 10 cores on the secondary ring. I have not done any testing on these 14+-core parts yet, but the results you are obtaining are very similar to the results I obtained on the 12-core Xeon E5-2690 v3 -- which also uses a double-ring uncore (though with fewer processors on the secondary ring). The ring topology scales well on the single ring of the Xeon E5 v1 and v2 parts, but the double-ring is likely to experience additional contention at the ring interfaces, as well as experiencing more contention because each cache line has to be transferred over a larger number of "hops" (so the average link utilization goes up).

My comment on the impact of non-temporal stores on the Intel Xeon E7 parts was based on experience with the earlier Nehalem-based Xeon 7500 series processors, and I see similar behavior with the Xeon E5-4600 series (Sandy Bridge EP) in 4-socket systems. I have not tested either the Westmere-based or Ivy-Bridge-based Xeon E7 processors, so I can't speculate on what might be the same or different....

TimP · ‎12-02-2014

In my testing on ivb dual 12 core cpus L2 misses which hit in L3 were much more expensive than with the same application on dual 8 core snb. I'm grateful for John explaining some of these issues with multicore scaling of L3 performance.

McCalpinJohn · ‎12-02-2014

The Xeon Processor E5 v2 and E7 v2 Product Families Uncore Performance Monitoring Reference Manual (document 329468) shows two different die configurations -- a single ring with up to 10 cores for the Xeon E5-2600 v2 products, and a dual ring with up to 15 cores for the Xeon E7-8800 v2 products.

On the other hand, http://www.anandtech.com/show/7285/intel-xeon-e5-2600-v2-12-core-ivy-bridge-ep shows three different die layouts for the Ivy Bridge EP -- a single ring with up to 6 cores, a single ring with up to 10 cores, and a dual-ring with 12 cores. The diagrams on the anandtech site (and similar diagrams at computerbase.de) started showing up at the same time as the announcement of the Xeon E5 v2 at IDF13, but I can't find these diagrams on Intel's web site.