Applying ITIL Principles to Data Center Facilities Risk Management

AengusNolan · ‎08-04-2023

Nearly a decade ago, Intel IT adapted the Information Technology Infrastructure Library (ITIL) framework’s

change, incident and problem management modules for use in tracking data center issues. This is, as far as we know, a relatively unique application of ITIL, which is generally used by IT departments (including us) to manage software-related services.

Same Game, Different Configuration Items

In the ITIL world, the things you track are called Configuration Items (CIs). ITIL provides a framework for formalizing the tracking process. Typically, CIs might be software applications, client devices and anything else that is required to deliver a service such as the IT help desk or virtual desktop infrastructure. But in the data center facilities world, the individual units or components that we track are mechanical and electrical equipment such as computer room air conditioning (CRAC) units, generators, automatic transfer switches on the data center’s uninterrupted power systems (UPS), servers, cables and so on.

ITIL wasn’t really designed for electrical/mechanical hardware, and we didn’t have a blueprint – we were starting from scratch and it was quite an uplift. One of the first things we had to decide was what exactly were going to be the CIs. The UPS, generator, chilled water system and CRAC units were obvious CI choices, but how deep into the systems should we go?

For example, the CRAC units are fed chilled water by pipework—if the pipes burst, that’s a single point of failure—but would that make the pipework a good choice for tracking as a CI? What about the pumps that drive the water to the CRAC unit? At first, we included them in our CI tracking. However, there are thousands of pumps, and if one pump fails there is a lot of redundancy. So, eventually we removed the pumps as a CI. As a best practice, we focus on the single points of failure and equipment that has scheduled maintenance that introduces some significant level of risk in the data center.

Our Tracking System in Action

To make our use of ITIL a bit more tangible, let’s look at some examples of how we use our tracking system to identify and mitigate risk in data center facilities.

Change tracking. We have a system that lists the CIs for each of Intel’s 54 data centers around the world. One of the data points for each CI is the frequency for scheduled maintenance (considered a “change” in ITIL terminology) for that CI. One of our reports compares the projected frequency to the actual changes that occurred. For example, perhaps we should have seen 40 maintenance change events but only 20 occurred. This type of mismatch alerts us to a potential risk, or perhaps there is a good reason for fewer maintenance events (such as a remote data center scheduling fewer maintenance events due to logistics). We can investigate further to ensure that the data center manager is actively communicating with the Corporate Services division (which is responsible for electrical and mechanical equipment).

Global learnings and controlled change. On occasion, a vendor of a certain part issues a recall (such as an electrical component that has been known to overheat and trip). This part may have been in the field for 8-10 years and we need to find all instances of that part, in all of Intel’s data centers. We work closely with Corporate Services, which maintains a database of record that tracks when systems are due for maintenance and how old the systems are. That database helps us locate instances of the recalled part. As another example, a breaker in our electrical distribution network at one site failed unexpectedly. In our proactive model of risk management, we don’t just assume this is an isolated incident. These events undergo an After Action Review (AAR) and we investigate to determine if that make and model of breaker is in use in any other Intel data centers, and work to replace them. In this case, to replace the breaker we had to utilize our A/B redundancy and bring down one side of the data center power supply, while keeping everything online. In this way, we control the partial outage, do it on our schedule, and prevent an unplanned outage in the future.

Acceptable or unavoidable risk. Sometimes, an identified risk simply cannot be removed, or perhaps it would cost too much to remove compared to the level of risk it introduces. For example, it used to be standard practice that every Intel data center had a big, red emergency power-off (EPO) button. It could be used to turn off the entire power supply, such as when a fire crew needed to shut the electricity off for safety’s sake. In some cases, the EPO button is no longer necessary. For example, local codes may have changed, or our support model for the data center has changed and is now staffed 24/7 with someone who can escort the fire crew to the breaker room where the IT load can be safely turned off. These decades-old EPO buttons are difficult to adequately secure beyond a glass cover and good signage, and we cannot easily test them without significant disruption to our data center users. This makes them a risk item. However, it would be difficult to remove them from a location that requires the data center to be online 24/7, and may not be cost-effective to do so.

In Summary

Even though we have used ITIL for data center management for years, we are now more proactive in our tracking and response to incidents. Instead of waiting for an incident to happen and then reacting, we now track an incident while it’s ongoing, and we also track what happens after the incident. We look for correlations between that incident and others and perform a deep dive into what happened, who responded and what went right or wrong.

Read the IT@Intel white paper, “Data Center Facilities Risk Management” for more information on our use of ITIL, as well as our detailed data center audits, to reach proactive, data-driven risk decisions.