I am running a large application on an academic cluster (Haswell) and have observed performance differences between nodes running the exact same application on the exact same input data. I came to this point by comparing runs for a short simulated time vs a long simulated time and noticed that the time per iteration varied depending on the node (maybe that answers the question of why I am repeatedly running the same code with the same data.
Within the cluster now I have identified a set of 'slow' nodes and a smaller set of 'fast' nodes. I have been instrumenting a triply-nested set of lengthy floating point calculations with PAPI performance monitoring and have the following, slow node vs. fast node:
Total Instructions: 295627 vs. 295627
Level 1 cache misses: 36258 vs. 36340
Level 2 cache misses: 22726 vs. 12015
Level 3 cache misses: 21771 vs. 1454
The code being measured is:
do k = lo(3), hi(3)
do j = lo(2), hi(2)
do i = lo(1), hi(1)
y(i,j,k) = alpha*a(i,j,k)*x(i,j,k) &
- dhx * (bX(i+1,j,k)*(x(i+1,j,k) - x(i ,j,k)) &
& - bX(i ,j,k)*(x(i ,j,k) - x(i-1,j,k))) &
- dhy * (bY(i,j+1,k)*(x(i,j+1,k) - x(i,j ,k)) &
& - bY(i,j ,k)*(x(i,j ,k) - x(i,j-1,k))) &
- dhz * (bZ(i,j,k+1)*(x(i,j,k+1) - x(i,j,k )) &
& - bZ(i,j,k )*(x(i,j,k ) - x(i,j,k-1)))
If the slow nodes are only slow for the duration of a run, then this looks like a standard cache conflict due to unlucky combinations of physical addresses.
A sysadmin is looking into BIOS settings on some slow nodes now. I have asked him to look at the Transparent Huge Pages' setting.
We are not using visualization on these nodes.
We have determined that BIOS settings varied across the cluster and was causing the performance differences.
A new set will be uniformly installed across the cluster as part of maintenance.
Testing the new settings shows that performance is uniform and even improved.