I was wondering how does performance of the YASK stencil benchmarks varies based on different snoop configuration modes for Haswells or Broadwells ? Early-snoop, vs Home-snoop vs Cluster-onDie ?
I have not run these benchmarks, but most stencil operations are bandwidth-limited, so they will benefit from the higher bandwidth of "Home Snoop" vs "Early Snoop". If the implementation is NUMA-friendly, then "cluster-on-die" should provide an additional benefit.
The local bandwidth difference between "Home Snoop" and "Early Snoop" is not large, but there is a very big difference in remote bandwidth on the systems I have tested (mostly Xeon E5 v3 "Haswell EP"). The attached chart shows results I obtained using the Intel Memory Latency Checker on a 2-socket Xeon E5-2660 v3 system --- NOTE that these are REMOTE bandwidth numbers only -- the local bandwidth numbers are much, much closer.