topic Questions: in Intel® Moderncode for Parallel Architectures

Cache performance differences between nodes running same application

Page__Mike — Fri, 31 Aug 2018 17:36:29 GMT

I am running a large application on an academic cluster (Haswell) and have observed performance differences between nodes running the exact same application on the exact same input data. I came to this point by comparing runs for a short simulated time vs a long simulated time and noticed that the time per iteration varied depending on the node (maybe that answers the question of why I am repeatedly running the same code with the same data.

Within the cluster now I have identified a set of 'slow' nodes and a smaller set of 'fast' nodes. I have been instrumenting a triply-nested set of lengthy floating point calculations with PAPI performance monitoring and have the following, slow node vs. fast node:

Total Instructions: 295627 vs. 295627
Level 1 cache misses: 36258 vs. 36340
Level 2 cache misses: 22726 vs. 12015
Level 3 cache misses: 21771 vs. 1454

The code being measured is:

do k = lo(3), hi(3)

do j = lo(2), hi(2)

do i = lo(1), hi(1)

y(i,j,k) = alpha*a(i,j,k)*x(i,j,k) &

- dhx * (bX(i+1,j,k)*(x(i+1,j,k) - x(i ,j,k)) &

& - bX(i ,j,k)*(x(i ,j,k) - x(i-1,j,k))) &

- dhy * (bY(i,j+1,k)*(x(i,j+1,k) - x(i,j ,k)) &

& - bY(i,j ,k)*(x(i,j ,k) - x(i,j-1,k))) &

- dhz * (bZ(i,j,k+1)*(x(i,j,k+1) - x(i,j,k )) &

& - bZ(i,j,k )*(x(i,j,k ) - x(i,j,k-1)))

end do

Is there a hardware explanation for this difference in cache performance?

Additional measurement, slow

Page__Mike — Fri, 31 Aug 2018 17:41:15 GMT

Additional measurement, slow vs. fast

Inst TLB misses: 325 vs. 444
Data prefetch cache misses: 8111 vs. 8

Very important:

Page__Mike — Fri, 31 Aug 2018 17:55:01 GMT

Very important:

I am running the code in serial mode with exclusive use of the node.

Questions:

McCalpinJohn — Fri, 31 Aug 2018 19:21:02 GMT

Questions:

Are the "slow" nodes slow every time you run on them, or just slow for the duration of one job?
Do the nodes have Transparent Huge Pages enabled?

If the slow nodes are only slow for the duration of a run, then this looks like a standard cache conflict due to unlucky combinations of physical addresses.

Your data is consistent with cache conflicts in either the L2 or L3.
Conflicts in the L2 cache are extremely common when using 4KiB pages.
- They arise because 3 bits of the L2 cache index are translated from contiguous virtual addresses to (pseudo-random) physical addresses. The cache can only hold its full capacity if the addresses being used are mapped to pages whose physical addresses have a completely uniform distribution of the values in these 3 bits.
- No L2 cache index bits are translated when using 2MiB pages, so this type of cache conflict cannot happen (with contiguous addresses) when using 2MiB pages.
- Enabling Transparent Huge Pages is the easiest way to use 2MiB pages, but it is also possible to pre-allocate the 2MiB pages and use them for memory allocated by mmap(), shmget(), or by using libhugetlbfs.
Conflicts in the L3 cache are extremely rare in Haswell systems (due to the high associativity of the L3), but if an L3 conflict occurs, each cache line evicted from the L3 must be evicted from the L2 caches first.
- L3 cache indexing on Haswell is extremely complex, deliberately documented, and has only been "reverse-engineered" for a small number of configurations.
In either case, if slowdown is rare (but lasts the duration of the job), the easiest approach is to note that the job is running slow, kill it, and restart it.

If virtualization is being

jimdempseyatthecove — Sat, 01 Sep 2018 12:32:50 GMT

If virtualization is being use on the server your "node" might not have exclusive use of L3.

Jim Dempsey

Page__Mike — Mon, 03 Sep 2018 13:11:00 GMT

A sysadmin is looking into BIOS settings on some slow nodes now. I have asked him to look at the Transparent Huge Pages' setting.

We are not using visualization on these nodes.

We have determined that BIOS

Page__Mike — Wed, 05 Sep 2018 17:00:34 GMT

We have determined that BIOS settings varied across the cluster and was causing the performance differences.

A new set will be uniformly installed across the cluster as part of maintenance.

Testing the new settings shows that performance is uniform and even improved.

Been there, done that, got

McCalpinJohn — Wed, 05 Sep 2018 19:42:03 GMT

Been there, done that, got the T-shirt....

(I did not really get a T-shirt....)

All the t-shirt needs is a

Page__Mike — Wed, 05 Sep 2018 19:56:41 GMT

All the t-shirt needs is a pithy slogan.

How does on make BIOS pithy or interesting?