L1-, L2- and L3-bandwidth of E5-2680 v3 (i.e., Haswell)

Paul_S_ · ‎02-04-2016

Hi all,

I am currently investigating the L1, L2 and L3 bandwidth of our latest Haswell
CPU (Xeon E5-2680 v3). The L1, L2 and L3 size of this CPU is 32 KiB, 256 KiB
and 32 MiB, respectively.

I am using a SAXPY-like kernel (i.e., X += Y) to measure the bandwidth. Please
find the benchmark results attached.

According to the Intel Optimization manual
(http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html)
the L1 and L2 bandwidth should be (64+32) bytes per cycle (i.e., 240 GiB/s for a frequency of 2.5Ghz) and 64 bytes per
cycle (i.e., 160 GiB/s for a frequency of 2.5Ghz), respectively.

However, the results suggest that the L1, L2 and L3 bandwidth respectively tops out at 170
GiB/s, 54 GiB/s and 32 GiB/s. Hence, far below the theoretical values.

I am using the latest intel compiler 16.0.1 with the following compiler flags:
-O3 -xhost. Moreover, Intel Speedstep as well as Intel Turboboost are disabled. Hence, the frequency should be fixed at 2.5 Ghz.

I also verified that the compiler generates efficient vectorized
code; it actually even unrolls the loop by 2.

What could be reasons for the "low" attained bandwidth?
Why does the L1 bandwidth increase at the very beginning?

Please note that these are reproducible results.

For sake of completeness I have added the code below:

void kernel(float * restrict a, float * restrict b) {
  for (int i = 0; i < N; i++) {
    a += b;
  }  
}
int main() {
   int bOffs = N;
   while (bOffs % 32 != 0) bOffs += 1;
   float * a = _mm_malloc(sizeof(float) * 2 * bOffs, 32);
   memset(a, 0, 2 * bOffs * sizeof(float));
   float * b = a + bOffs;
   uint64_t min = (uint64_t) 1e18;
   int innerLoop = 10e6 / (3*N*sizeof(float));
   for (int t = 0; t < 1000; t++) {
      uint64_t s = __rdtsc();
      for (int g = 0; g < innerLoop; g++) {
         kernel(a, b);
      }
      s = __rdtsc() - s;
      if (s < min) min = s;
   }
   double dataKB = 2*bOffs*sizeof(float)/(double)1024;
   double bwUsed = innerLoop*3*bOffs*sizeof(float);
   double timeSpent = min / 2.5e9;
   printf("%f %f\n", 2*N*sizeof(float)/(double)1024, bwUsed/timeSpent/1024/1024/1024);
   _mm_free(a);
}

Thanks you!

jimdempseyatthecove · ‎02-04-2016

One issue you are overlooking is the published specifications for L1 and L2 cache latencies is for cache only latency.

While the read portion of the "a += b" is (may be) held within cache, the write portion of "a +=" is writing not only to cache, but is also limited by the number of cache lines that can be buffered to be written at the same time.

What you might want to do for performance testing is to test the read performance against the write performance.

For read performance, I suggest performing a simd loop reduction(+:b_sum) on the array b (and use the result such that the compiler does not optimize the loop out of existence.

For write performance, do a simd loop set to value.

Also, your kernel should have the appropriate assume aligned pragma.

Jim Dempsey

McCalpinJohn · ‎02-04-2016

The apparent increase in bandwidth at the left side of the chart is due to the timer overhead and possibly a mispredicted branch overhead.
1. The RDTSC instruction takes about 25 core cycles (depending on the processor model and the operating frequency). I am not exactly sure how you are scaling things, but if you print the raw cycle counts you should be able to see whether a ~25 cycle timer overhead is negligible.
2. The Haswell core has a very clever branch predictor. For a loop that is executed repeatedly with the same loop trip count, the branch predictor will "remember" the trip count and correctly predict that the conditional branch will *not* be taken on the final iteration. However, this only works for loop trip counts up to values in the range of 32 to 38 iterations. Loops with more iterations will take an additional ~20 cycle branch mispredict penalty on the final loop iteration. You will need to look carefully at the assembly code to determine how many loop iterations are going to occur -- the value depends on how much the compiler unrolls the loop.
If I am reading your code correctly, you are computing and printing bandwidth in GiB/s, while the peak values (which you label as GiB/s) are actually in decimal GB/s. This correctly is not huge, but it is not negligible either. Your best reported value of a bit over 171 GiB/s is about 184 GB/s.
Using this corrected bandwidth value, your best result should correspond to 32KiB/184GB/s = 178ns (445 cycles), while the minimum theoretical value would be 32KiB/240GB/s = 137ns (341 cycles). This loop should be long enough to pay the mispredicted branch penalty of ~20 cycles, so I would estimate a corrected cycle count of 445-25-20=400 cycles = 160 ns, which is (137/160)=86% of the expected performance.
1. This pretty clearly shows that you are running the right code -- any alignment issues would cause much larger degradations.
2. I don't know of any way to model store performance on Haswell. As a rule of thumb, understanding performance losses that are less than 1/8 of peak can be extremely difficult. 86% of peak is awfully close to 8/9 of peak, so it may make sense to declare vicotry and move on.
The L2 to L1 bandwidth on Haswell is indeed 64 bytes/cycle, but the documentation does not make it clear that the L1 Data Cache can either provide data to the core or receive data from the L2 in any cycle -- not both! So it takes a minimum of 2 cycles to get 64 Bytes of data from the L2 through the L1 to the core's registers.
1. Stores are hard to model -- I don't think that there is enough published information to know how the ports are set up on the L2 cache at this level of detail, and there is certainly not enough information to understand the timing of the cache transactions that follow stores. The stored data is held in a store buffer for a while before being sent to the L1 cache, and when an L1 re-load causes a dirty line to be evicted to the L2, we don't know anything about the timing of the various transactions.
2. Even loads are a little tricky on Haswell. The L2 hardware prefetchers are activated by sequences of *accesses* to the L2 -- even if there are no misses (!!!) -- and the L2 cache tags appear to be single-ported. So in any cycle the L2 cache can either service a load request from the L1 *or* it can query the tags for an L2 HW prefetch, but it cannot do both. In my experiments of read-only L2 accesses at rates approaching one cache line per cycle, the L2 prefetches appear to block the load requests from the L1 about 30% to 40% of the time, but the number of L2 prefetches generated varies widely across iterations of the test. Overall, these extra stalls reduce the load rate from the expected value of ~50% of 1 cache line per cycle to about ~40% of one cache line per cycle. Flipping these numbers around gives a load rate of one cache line every 2.5 cycles from the L2, or ~64 GB/s at 2.5 GHz.
3. Your values of ~54 GB/s for L2-contained DAXPY is about 84% of the maximum load value that I have been able to sustain, which seems pretty good considering the extra complexities that are introduced by the stores & writebacks from L1 to L2.
L3 bandwidth is more complicated than L2 bandwidth :-(
1. The interface bandwidth is nominally 32 Bytes/cycle (at the "System Agent" interface between the core (+L1+L2) and the "ring".
2. The L2 hardware prefetchers will be active, but the algorithm that the hardware uses to determine whether L2 HW prefetches will bring the data into the L2 cache, or whether they will leave the data in the L3. (When the data is coming from memory, L2 HW prefetches will bring the data into the L2 if the L2 is not "too busy", otherwise they will only bring the data into the L3. It is not clear how this feature applies to data that is already in the L3.)
3. If the L2 hardware prefetchers bring the data into the L2, then every cache line has to be written to the L2 (by the HW prefetch) and later read from the L2 (by an L1 HW prefetch or by a demand load or RFO) -- effectively halving the L2 bandwidth.
4. Your results of ~33 GB/s corresponds to about 13 Bytes/cycle. This is pretty close to the best bandwidth that I see for read-only kernels -- approaching (but not quite reaching) 16 Bytes per cycle. I am still trying to understand the details, but without more specific information about the L2 implementation, it is not clear that it is possible to understand much more about how the L3 works.

I hope some of this helps!

Paul_S_ · ‎02-04-2016

Thank you both for your valuable comments.

I have attached a new plot showing the read, write and read-write bandwidth
separately. Using these kernels:

float kernel_read(const float * restrict a) {
  __assume_aligned(a,32);
  float sum = 0.0f;
  for (int i = 0; i < N; i++) {
    sum += a;
  }
  return sum;
}
void kernel_write(float * restrict a, float b) {
  __assume_aligned(a,32);
  for (int i = 0; i < N; i++) {
    a = b;
  }
}
void kernel(float * restrict a, float * restrict b) {
  __assume_aligned(a,32);
  __assume_aligned(b,32);
  for (int i = 0; i < N; i++) {
    a += b;
  }
}

Notice that I have corrected the bandwidth to Bytes/cycle.

__assume_aligned() improved the situation for very small sizes.

@John McCalpin: Let me briefly comment on your points.
(1) The rdtsc overhead should be negligible since the innermost loop
will be always move the same amount of data (e.g., the overhead of rdtsc for the 8 KiB test case will be amortized over 813 iterations of the innermost loop).
(2) Thanks for pointing this out; I'm using bytes/cycle now.
(3) Looking at the 8KiB test case more carefully shows the following: (a) the saxpy
kernel achieves 75.14 bytes/cycle (i.e., 327 cycles) as opposed to the optimum
of 96 bytes/cycle (i.e., 256 cycles); (b) the loop has a total
of 32 iterations, hence, there should be no branch prediction error, right?.
All in all this means that the 8 KiB test case only reaches 75.15/96 = 78.27%
of peak. Taking the branch miss prediction into account this would leave us
with 307/256 = 83.4% of peak.
(4) - (5) Thank you, that was very instructively.

The new plot shows some interesting features:

1) The write only BW for L1 reaches its optimum of 32 bytes/cycle.
2) The read bandwidth does not reach its optimum of 64 bytes/cycle and is stuck
at 32bytes/cycle, why?
3) The read bandwidth is bigger than the read-write bandwidth for values
larger than 32 KiB but smaller than 128 KiB, why?

Thank you for your help.

McCalpinJohn · ‎02-04-2016

You should double-check your alignment -- the Haswell L1 Data Cache can service two loads per cycle only if neither of them cross a cache line boundary.

But you probably already did that....

The next problem is sneakier: If you are using a simple sum reduction kernel for the read-only case you will run into the FP add limit of the processor. Both ports 0 and 1 can execute FMA instructions (or Multiplies), but only port 1 can issue FP Add instructions. This will limit you to one load per cycle even if the code is unrolled far enough to tolerate the FP Add pipe latency. (I.e., at least 3 partial sum variables).

The easiest way to work around this is to switch to a SDOT kernel to allow FMAs to issue to both pipelines. This increases the latency to 5 cycles and doubles the throughput, so you need 10 partial sum variables. The compiler will not always generate this many, and it is hard to understand exactly how it makes its decisions. For simple loops like this the compiler will typically follow your lead if you use intrinsics and specify 10 or 12 partial sums. For more complex loops the compiler often rewrites intrinsic-based code in incomprehensible ways, so you need to pay careful attention to the assembly code.

It is hard to be sure what is happening for the cases that are slightly larger than L1 -- I saw surprisingly large variability (~15%) when I was doing tests that pushed the L2 read rate to near its limit. This was correlated with large variability in the number of L2 prefetches (from about zero percent of lines prefetched to nearly 100% of lines prefetched). The crazy thing was that the number of L2 prefetches issued in an iteration of the test depended on how long it had been since the last 1 millisecond Linux kernel timer interrupt. Immediately after returning from the timer interrupt, the L2 HW prefetcher generated prefetches for almost 100% of the cache lines. The number of L2 prefetches decreased in subsequent iterations, typically dropping to ~20%-30% by the time the next timer interrupt occurs. Sometimes it would drop all the way to zero. Execution time increased by about 0.3 to 0.4 cycles for each of the L2 HW prefetches. The difference between the L2_TRANS events (0xF0) (which include retries) and the L2_RQSTS events (0x24) (which don't include retries) followed these values, suggesting that the L2 prefetches were causing the L1 demand load requests to be retried at the L2, which led to increased execution time. Disabling the L2 HW prefetchers gave performance that matched the best cases -- the few times that essentially zero L2 HW prefetches were issued.

Paul_S_ · ‎02-04-2016

Yes, it's quite subtle.

I changed the assembly code to use FMAs; this pushed the L1 read BW to ~45 bytes/cycle. Furthermore, increasing the number of partial sums to 12 and removing the final reduction of the partial sums increased L1 read BW to ~51 bytes/cycle.

At this point I'm satisfied with the results. Thanks again.