Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Cache performance differences between nodes running same application

Page__Mike
Beginner
1,111 Views

I am running a large application on an academic cluster (Haswell) and have observed performance differences between nodes running the exact same application on the exact same input data. I came to this point by comparing runs for a short simulated time vs a long simulated time and noticed that the time per iteration varied depending on the node (maybe that answers the question of why I am repeatedly running the same code with the same data.

 

 

Within the cluster now I have identified a set of 'slow' nodes and a smaller set of 'fast' nodes. I have been instrumenting a triply-nested set of lengthy floating point calculations with PAPI performance monitoring and have the following, slow node vs. fast node:

 Total Instructions:                                        295627 vs. 295627
 Level 1 cache misses:                                   36258 vs. 36340
 Level 2 cache misses:                                   22726 vs. 12015
 Level 3 cache misses:                                   21771 vs.  1454

The code being measured is:

    do       k = lo(3), hi(3)

       do    j = lo(2), hi(2)

          do i = lo(1), hi(1)

             y(i,j,k) = alpha*a(i,j,k)*x(i,j,k) &

                  - dhx * (bX(i+1,j,k)*(x(i+1,j,k) - x(i  ,j,k))  &

                  &      - bX(i  ,j,k)*(x(i  ,j,k) - x(i-1,j,k))) &

                  - dhy * (bY(i,j+1,k)*(x(i,j+1,k) - x(i,j  ,k))  &

                  &      - bY(i,j  ,k)*(x(i,j  ,k) - x(i,j-1,k))) &

                  - dhz * (bZ(i,j,k+1)*(x(i,j,k+1) - x(i,j,k  ))  &

                  &      - bZ(i,j,k  )*(x(i,j,k  ) - x(i,j,k-1)))

          end do

       end do

    end do

 
 Is there a hardware explanation for this difference in cache performance?
 
 

 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,111 Views

Been there, done that, got the T-shirt....

(I did not really get a T-shirt....)

View solution in original post

0 Kudos
8 Replies
Page__Mike
Beginner
1,111 Views

Additional measurement, slow vs. fast

Inst TLB misses:                                                  325 vs. 444
Data prefetch cache misses:                              8111 vs. 8

 

0 Kudos
Page__Mike
Beginner
1,111 Views

Very important:

I am running the code in serial mode with exclusive use of the node.

0 Kudos
McCalpinJohn
Honored Contributor III
1,111 Views

Questions:

  1. Are the "slow" nodes slow every time you run on them, or just slow for the duration of one job?
  2. Do the nodes have Transparent Huge Pages enabled?

If the slow nodes are only slow for the duration of a run, then this looks like a standard cache conflict due to unlucky combinations of physical addresses.

  • Your data is consistent with cache conflicts in either the L2 or L3. 
  • Conflicts in the L2 cache are extremely common when using 4KiB pages.
    • They arise because 3 bits of the L2 cache index are translated from contiguous virtual addresses to (pseudo-random) physical addresses.   The cache can only hold its full capacity if the addresses being used are mapped to pages whose physical addresses have a completely uniform distribution of the values in these 3 bits.
    • No L2 cache index bits are translated when using 2MiB pages, so this type of cache conflict cannot happen (with contiguous addresses) when using 2MiB pages. 
    • Enabling Transparent Huge Pages is the easiest way to use 2MiB pages, but it is also possible to pre-allocate the 2MiB pages and use them for memory allocated by mmap(), shmget(), or by using libhugetlbfs.
  • Conflicts in the L3 cache are extremely rare in Haswell systems (due to the high associativity of the L3), but if an L3 conflict occurs, each cache line evicted from the L3 must be evicted from the L2 caches first.  
    • L3 cache indexing on Haswell is extremely complex, deliberately documented, and has only been "reverse-engineered" for a small number of configurations.
  • In either case, if slowdown is rare (but lasts the duration of the job), the easiest approach is to note that the job is running slow, kill it, and restart it.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,111 Views

If virtualization is being use on the server your "node" might not have exclusive use of L3.

Jim Dempsey

0 Kudos
Page__Mike
Beginner
1,111 Views

 

A sysadmin is looking into BIOS settings on some slow nodes now. I have asked him to look at the Transparent Huge Pages' setting.

We are not using visualization on these nodes.

0 Kudos
Page__Mike
Beginner
1,111 Views

We have determined that BIOS settings varied across the cluster and was causing the performance differences. 

A new set will be uniformly installed across the cluster as part of maintenance.

Testing the new settings shows that performance is uniform and even improved.

0 Kudos
McCalpinJohn
Honored Contributor III
1,112 Views

Been there, done that, got the T-shirt....

(I did not really get a T-shirt....)

0 Kudos
Page__Mike
Beginner
1,111 Views

All the t-shirt needs is a pithy slogan.

How does on make BIOS pithy or interesting?

 

0 Kudos
Reply