Author: Zane Ball
The 4th Gen Intel Xeon processor (formerly codenamed Sapphire Rapids) is a big generational leap in technology packed with an array of powerful new CPU cores with built-in acceleration for critical workloads. While these leaps in performance are always going to be the headline, I want to dive a little deeper into what many datacenter operators care about even more than performance: reliability.
Our newest processor is designed and manufactured to be a rock-solid foundation for even the largest scale computing applications with the toughest reliability requirements. In developing this processor, we focused on three critical improvement areas: memory subsystem reliability, system stability, and in-fleet management capabilities.
With the introduction of DDR5 technology, we fully updated our Memory Reliability/Availability/Serviceability (RAS) architecture. With a new higher performance Error Correction Code (ECC) capability along with new RAS features like Permanent Fault Detection (PFD) we can tackle memory errors significantly better than in past platforms. As important, we engineered the memory interface to enable 75 fully validated memory configurations including 2 DIMM/channel operation enabling more cost-effective, larger capacity configurability. Intel’s industry leadership in extensive margin testing, customer enablement tools, and partnerships with leading DRAM vendors ensure an exacting standard for memory reliability.
System stability goals for the design were dramatically raised vs prior generations. We demonstrated a whopping 200,000 resets without a single failure at our internal scale testing facility prior to production release. The new 4th Gen Intel Xeon Scalable processors also mark the debut of a next generation manufacturing test platform that increases initial silicon quality, which can be especially important in large-scale installations.
Finally, we provided new ways for the reliability of the fleet to be managed for months and years after installation. This includes advanced in-field testing capabilities to maintain reliability over time with minimal service interruptions. With this processor we piloted the newly developed Intel Platform Monitoring Technology (Intel PMT) in our internal scale testing facility to collect gigabytes of telemetry data throughout the validation. Intel PMT is a newly developed telemetry framework for both in-band and out-band manageability. The platform also allows most typical firmware updates to be delivered without a system reset. We also updated debug tools to diagnose and resolve issues remotely, even rare or sporadic problems that exist at the statistical margins of a large fleet.
I’m thrilled with the results of the 4th Gen Intel Xeon Scalable processor design. I believe we have built the highest quality, most reliable datacenter platform in our history.
About the author
Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center Platform Engineering & Architecture (DPEA) group. DPEA owns end-to-end engineering for Intel’s data center business and is responsible for designing and validating the latest data center platforms and enabling Intel’s customers to ramp and deploy platforms at scale.
Prior to his data center role, Ball was Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group. Ball has also served as a VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.
Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University. He holds six patents in high-speed electrical design. You can connect with him on Linkedin and Twitter.
Very informative! Thank you for sharing.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.