I am running a large application on an academic cluster (Haswell) and have observed performance differences between nodes running the exact same application on the exact same input data. I came to this point by comparing runs for a short simulated time vs a long simulated time and noticed that the time per iteration varied depending on the node (maybe that answers the question of why I am repeatedly running the same code with the same data.
Within the cluster now I have identified a set of 'slow' nodes and a smaller set of 'fast' nodes. I have been instrumenting a triply-nested set of lengthy floating point calculations with PAPI performance monitoring and have the following, slow node vs. fast node:
Total Instructions: 295627 vs. 295627
Level 1 cache misses: 36258 vs. 36340
Level 2 cache misses: 22726 vs. 12015
Level 3 cache misses: 21771 vs. 1454
The code being measured is:
do k = lo(3), hi(3)
do j = lo(2), hi(2)
do i = lo(1), hi(1)
y(i,j,k) = alpha*a(i,j,k)*x(i,j,k) &
- dhx * (bX(i+1,j,k)*(x(i+1,j,k) - x(i ,j,k)) &
& - bX(i ,j,k)*(x(i ,j,k) - x(i-1,j,k))) &
- dhy * (bY(i,j+1,k)*(x(i,j+1,k) - x(i,j ,k)) &
& - bY(i,j ,k)*(x(i,j ,k) - x(i,j-1,k))) &
- dhz * (bZ(i,j,k+1)*(x(i,j,k+1) - x(i,j,k )) &
& - bZ(i,j,k )*(x(i,j,k ) - x(i,j,k-1)))
- Parallel Computing
- Are the "slow" nodes slow every time you run on them, or just slow for the duration of one job?
- Do the nodes have Transparent Huge Pages enabled?
If the slow nodes are only slow for the duration of a run, then this looks like a standard cache conflict due to unlucky combinations of physical addresses.
- Your data is consistent with cache conflicts in either the L2 or L3.
- Conflicts in the L2 cache are extremely common when using 4KiB pages.
- They arise because 3 bits of the L2 cache index are translated from contiguous virtual addresses to (pseudo-random) physical addresses. The cache can only hold its full capacity if the addresses being used are mapped to pages whose physical addresses have a completely uniform distribution of the values in these 3 bits.
- No L2 cache index bits are translated when using 2MiB pages, so this type of cache conflict cannot happen (with contiguous addresses) when using 2MiB pages.
- Enabling Transparent Huge Pages is the easiest way to use 2MiB pages, but it is also possible to pre-allocate the 2MiB pages and use them for memory allocated by mmap(), shmget(), or by using libhugetlbfs.
- Conflicts in the L3 cache are extremely rare in Haswell systems (due to the high associativity of the L3), but if an L3 conflict occurs, each cache line evicted from the L3 must be evicted from the L2 caches first.
- L3 cache indexing on Haswell is extremely complex, deliberately documented, and has only been "reverse-engineered" for a small number of configurations.
- In either case, if slowdown is rare (but lasts the duration of the job), the easiest approach is to note that the job is running slow, kill it, and restart it.
A sysadmin is looking into BIOS settings on some slow nodes now. I have asked him to look at the Transparent Huge Pages' setting.
We are not using visualization on these nodes.
We have determined that BIOS settings varied across the cluster and was causing the performance differences.
A new set will be uniformly installed across the cluster as part of maintenance.
Testing the new settings shows that performance is uniform and even improved.