We recently deployed several hundred identical servers with a Supermicro dual-socket X10DRT-HIBF motherboard, two E5-2670v3 processors, and 8 x ATP (Samsung) 8GB DDR4-2133 Reg ECC 1R 1.2V DIMMs. We are running the same STREAM memory benchmark test on all the nodes, and notice on 6 nodes the results are approximately 12% less than the others. The degradation is across both processors. We have tried swapping out the procs with ones from a server that passed the benchmark. We also did this with the DIMMs. We have verified the BIOS is identical and we are booting from a common OS image over the network.
Our next step is to replace the motherboards in the servers, however we are troubled that we do not have a reason for the slower memory performance on these 6 nodes. So we are wondering what other factors could impact these memory tests? could there be an issue with the QPI link? Does this happen time to time? Why would replacing the motherboard correct this issue? Any feedback is welcome.
Jeff Friedman - Sales Engineer
There are lots of possibilities, but it is often a challenge to know what information to gather....
If these are running Linux, then I may be able to offer some suggestions. Windows is SEP (http://hitchhikers.wikia.com/wiki/Somebody_Else's_Problem_field).
You may want to supplement the STREAM tests with runs using the "Intel Memory Latency Checker" (https://software.intel.com/en-us/articles/intelr-memory-latency-checker). The Intel code is a bit more self-contained than STREAM, and may avoid environment mis-configuration issues.
Thanks for the suggestions, due to time constraints on getting this cluster up and running, we replaced the motherboards on the slow nodes, and the bandwidth tests all passed. I will keep the latency-checker tool in mind for future issues, should we have any. But for these most recent incidents, I guess I'll chalk it up to physics :)