Deploying AI Everywhere at Netflix

MaxTerry · ‎09-29-2023

At Intel Innovation 2023, Amer Ather, Cloud and Studio Performance Engineer at Netflix, joined Vrushabh Sanghavi, Intel Sr. Software Engineer and End-to-End AI Manager, to discuss the value and performance acceleration that the combination of Intel® Xeon® processors and Intel® Software brings to the entire AI lifecycle.

Ather spoke of the optimization work happening at Netflix on video encoding and downsampling using the Intel® oneAPI Deep Neural Network Library (oneDNN), how AI inference is used at Netflix for video delivery and recommendations, why Intel® Xeon® CPUs are the platform of choice for cost-effectiveness and flexibility across a wide range of services, and how the collaboration between Intel and Netflix on profiling and architectural analysis helps break through performance bottlenecks.

This session dug into the technical details of how the Netflix Performance Engineering Team approaches the challenges of continuously improving the viewer experience while lowering cloud and streaming costs and techniques for utilizing Intel software optimizations to exploit the full benefit of Intel hardware capabilities.

The latest Intel software will help you realize the best performance on Intel hardware–from CPUs with built-in acceleration to multiarchitecture systems of CPUs, GPUs, FPGAs, and other accelerators. For example, developers can get over a 10x performance boost to AI applications with just a single-line code change when using Intel’s extensions for several industry-standard frameworks ranging from TensorFlow and PyTorch to Scikit-learn and Pandas.

Watch a replay of this and other Intel Innovation sessions here! Some key takeaways:

The importance of understanding the full end-to-end AI pipeline to extract maximum performance.

Intel’s vision is to bring AI everywhere by delivering to the needs of end-to-end application performance, as opposed to select DL or ML kernel performance in isolation. AI comprises a vast and complex set of workloads, including data preparation, data processing, classical machine learning, training, fine-tuning, inference, and managing and moving structured and unstructured data. Successful deployment requires a holistic, system-level approach.

Too often, AI deployments fail due to a lack of focus on the full end-to-end AI pipeline and a lack of awareness of the tools available to bridge the gaps in the pipeline and free developer resources from the costly, proprietary platforms the business has become locked into.

While Generative AI, LLMs, and GPU scarcity have dominated recent news, most AI workloads such as data processing and analysis are general-purpose workloads ideally suited for CPUs. In addition, 4th Gen Intel® Xeon® Scalable processors deliver special hardware support in the form of Intel® Accelerator Engines such as Intel® Advanced Matrix Extensions (Intel® AMX) and optimized instruction set architectures such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) to accelerate AI workflows with varying characteristics. Together with Intel’s end-to-end AI software suite, Intel Xeon processors cover the full spectrum of the end-to-end pipeline - engineering data, model creation-optimization-tuning, and deployment.

Netflix provides many examples of how enterprise developers can take advantage of the full Intel AI software portfolio, Intel Xeon CPU instruction sets, and analysis tools such as VTune™ Profiler to optimize a wide range of applications.

Optimizing Video Encoding for Improved Viewer Experience

As Ather noted, “In today’s digital world, performance is more important than ever. Users want applications to be robust and responsive. Netflix subscribers expect a rich and optimal video experience from the Netflix streaming service.”

“The Netflix app is hosted on a variety of devices, each with its own unique capabilities, access profile, and requirements. These devices are operated under varying network conditions. Performance optimization and end-to-end reliability are critical for delivering quality content to our members across 190 countries. An optimized user interface, personalized recommendations, efficient streaming, and a vast catalog of engaging content define the Netflix streaming experience.”

Ather detailed how downscaling via neural networks is the first step in Netflix’s Machine Learning journey to improve video encoding quality. Downscaling is typically performed by traditional filters, such as Lanczos. However, applying data-driven or neural network-based approaches to downsampling have shown great potential in improving the encoded video quality.

“Our whole encoding pipeline runs on Intel Xeon processors in the cloud. With the help of Intel engineering, FFmpeg, VMAF (Video Multimethod Assessment Fusion), and other relevant encoded libraries are vectorized to take full advantage of instruction-level parallelism support provided via AVX-512 and matrix multiplication speedup, common operation in neural networks via VNNI, and the more recent AMX acceleration built into the Sapphire Rapids platform (4th Gen Intel Xeon Scalable processors).”

Downsampler, the convolutional neural network that Netflix recently started using on a limited scale, consists of 10 convolutional layers designed specifically for the purposes of downsampling for adaptive video streaming. The first 5 layers are called “preprocessing blocks” and the last 5 “downscaling blocks.” The preprocessing block learns pertinent information before feeding it to the downscaling block, which performs the actual downscaling.

According to Ather, Downsampler A/B testing reported better-encoded video quality across various encoders and upsamplers and has scored well across several quality metrics. 77% of human testing favored the video quality of Downsampler. “We saw an up to 2x performance boost for various encoding jobs with oneDNN and VNNI/AVX-512 enabled,” he noted, adding that performance improvements in the form of reduced CPI hours or frames-per-second (fps) speedup to encode a title means huge savings in cloud infrastructure cost. Ather has elaborated on this in a previous blog post.

Optimizing AI Inference for Improved Content Recommendations

A second goal at Netflix is to anticipate viewer preferences. Netflix uses machine learning extensively to recommend relevant titles to global audiences and customize member home pages. Every aspect of the home page is an evidence-driven, A/B-tested experience backed by Machine Learning.

One example is the Adaptive Row Ordering Service, which creates personalized row ordering to make it easy for viewers to discover relevant content. Models select new content to suggest to viewers based on viewing history, country, and language, among other factors. Another example is Evidence Service, used for ranking and selecting assets – such as artwork, synopses, and video clips – to serve with program information on home pages based on user behaviors.

Ather noted that “at Netflix, inference services are primarily hosted on Intel Xeon processors as it’s more practical and cost-effective than GPUs.” He explained that Netflix services use performant Java-based inference with little penalty for JNI to TensorFlow and have a pure Java implementation of XGBoost for inference. “Feature encoding and generation is a good proportion of our end-to-end pipeline, all written in Java. Thus, offloading inference tasks to GPU will increase cost with minor latency wins.”

Using Intel Profiling Tools to Identify and Resolve Performance Bottlenecks

The Netflix Performance Engineering team’s charter is to help Netflix reduce cloud and streaming costs while continuing to improve the viewer experience. To accomplish this goal, Ather explained, “Netflix is working closely with Intel as our strategic technology partner in optimizing our cloud infrastructure and software stacks.”

As part of that collaboration, Netflix has successfully used Intel profiling tools – VTune™ Profiler, Intel® PerfSpect, and Intel® Process Watch – to uncover regressions and boost the performance of production workloads.

“Intel's rich profiling tools…offer powerful insights into system performance. We have used these tools’ versatile capabilities to uncover regressions and boost performance in test and production environments.”

Ather particularly noted VTune’s ability to program performance monitoring units (PMU) on Intel processors to trace hardware events, uncover difficult-to-debug issues, and successfully root cause them. A Netflix blog highlighted one such success story on how VTune’s powerful observability helped Netflix isolate the cause of regression seen while migrating or consolidating critical microservices to large cloud instances.

Ather noted that VTune makes it easy to examine how the code executes and identify areas that can be optimized to improve CPU and memory usage.

Watch a replay of the whole session here!

Learn more about how you can maximize performance on the latest Intel hardware and take advantage of diverse accelerator architectures with Intel® software developer tools powered by oneAPI.

Additional Resources

Intel Optimization at Netflix. Netflix Performance Engineering work… | by Amer Ather | Medium

Seeing through hardware counters: a journey to threefold performance increase | by Netflix Technology Blog | Netflix TechBlog

AI and Machine Learning (intel.com)