Applying Reliability Engineering to the Manufacturing IT Environment

PatrickEnnis · ‎08-16-2023

As part of Intel’s Manufacturing IT team, I have been involved in a concerted effort to make factory automation more resilient by using reliability engineering (RE) to maintain service levels during unexpected failures.

Intel currently operates many fab factories around the world. As the company ramps up its Integrated Device Manufacturing (IDM) 2.0 strategy, it is substantially scaling up Intel’s global manufacturing capabilities and capacity. With even more at stake than in the past, unplanned factory downtime is simply not acceptable. In fact, the goal is to increase factory availability to 99.99% (four nines). To achieve this goal, we sought to improve availability through a disciplined, principles-based approach to system reliability and resilience.

Bringing Reliability Engineering to the Factory

Manufacturing IT has many teams. Some, like mine, are operations-centric. Other teams are focused on infrastructure or applications. While factory resilience and uptime have always been goals, many reliability improvements to our systems were achieved reactively with teams focused on preventing repeats. Starting in 2019, we explored how to significantly move the needle on availability by avoiding, mitigating or reducing the impact from incidents. A major “ah ha!” moment occurred when we created an RE discipline that had an integrated view of all the services maintained by Manufacturing IT, across operations, infrastructure and applications.

We examined external and internal best practices for RE and realized that the principles being used, such as pursuing a proactive improvement approach to operations — expecting things to fail and responding resiliently — were directly applicable to our systems. In particular, the emergence of the Reliability Engineer role was a critical factor.

We approached the Manufacturing IT Directors with our idea of adopting a similar Reliability Engineer role to our headcount. We demonstrated how Reliability Engineers play a critical role in identifying common failure modes, developing standards and designing solutions to lower the risk of failure and advocating for the importance of resilience alongside new feature delivery. Convinced of the efficacy of such an approach, the Directors approved the formation of a small team of dedicated Reliability Engineers.

An Overview of Our Resiliency Maturity Model

To understand exposure to failure, the Reliability Engineers analyzed common failure modes across manufacturing operations, utilizing the Failure Mode and Effects Analysis (FMEA) methodology to anticipate potential issues and failures. Examples of common failure modes include “database purger/archiving failures

leading to performance impact” and “inadequate margin to tolerate typical hardware outages.” The Reliability Engineers also identified systems that were most likely to cause factory impact due to risk from these shared failure modes. This data helped inform a Resiliency Maturity Model (RMM), which scores each common failure mode on a scale from 1 to 5 based on a system’s resilience to that failure mode. This structured approach enabled us to not just fix isolated examples of applications that were causing the most problems, but to instead broaden our impact and develop a reliability mindset.

Using the RMM and working closely with the Technology Development organization, we identified over 200 resilience improvement projects and added them to our development roadmap for the next two years. We are developing a better understanding of where apps are working, where they are constrained and where they are at risk of failure. We are also achieving efficiency gains, giving us the ability to better scale them in support of IDM 2.0. But the RMM is not a one-and-done effort for our Reliability Engineers. Our team also continues system modernization through increased observability and better methods of deployment using continuous integration/continuous delivery (CI/CD) with automation.

Taking Reliability Engineering to the Next Level

Our results with the RMM are solid proof that RE and a structured approach to resilience and reliability improvement are effective. Still, there is more work to be done by our team. Next, we are targeting change resilience and observability.

Change implementation on our Manufacturing IT systems is always a source of factory automation systems failure. This is due to the large number of changes needed to deliver new features to the factories along with the increasing complexity of our systems and the integrated nature of all the core applications. Reliability Engineers are taking the lead on improving resilience to change failure with CI/CD as a key pillar. CI/CD brings efficiencies to our change implementation while solidifying change quality. It therefore has the advantage of allowing us to scale up our factories without needing a proportional increase in staff to support high-quality Manufacturing IT change installs. This efficiency is achieved through the automation of our change installs that CI/CD includes. It means the human toil of change is significantly reduced or eliminated.

In the case of observability, what sort of dashboards, monitoring and traceability do the development and operations teams have? Reliability Engineers are working with the Technology Development organization on developing an Observability Maturity Model (OMM), which is similar to the RMM in that it’s a structured approach for assessing our observability coverage. It can help us pinpoint systems that present opportunities for us to improve our observability. This structured approach affords us with a standard level of observability across all our critical systems and helps ensure the best data analytics and observability tools are being adopted to maximize the value from these tools. Observability directly corresponds to our resiliency work. It allows us to quickly detect failures and root causes and expedite our impact resolution, which further drives our four nines (99.99%) availability target.

Want to Learn More?

Our adoption of RE has helped us transform from firefighting individual issues to creating a culture of resilience that can help ensure that Intel’s manufacturing processes can withstand and recover from failures. To learn more about our resiliency efforts, read the IT@Intel white paper, “Reliability Engineering Helps Intel Cut IT Manufacturing Systems Downtime in Half.”