Intel’s MLPerf Results Show Robust CPU-Based Training Performance For a Range of Workloads

MaryT_Intel · ‎07-28-2020

AT A GLANCE

Intel® Xeon® Scalable platform is the foundation for AI, and its new Intel® Deep Learning Boost extensions accelerate deep learning training and inference.
For training, 3rd Gen Intel® Xeon® Scalable processors are well-suited for large datasets/models, intermittent, lower priority batch jobs on spare cycles with shared infrastructure for transfer learning, high-definition computer vision, recommender engines and more.
Benchmark results such as MLPerf help us provide practical guidance to our customers as to what performance they can expect for various applications and scenarios.

Our Intel software engineers work to ensure that Intel’s hardware innovations translate to practical improvements for customers in AI. In addition to optimizing widely used software to take full advantage of Intel hardware, we also measure our performance against industry-standard benchmarks, such as the Machine Learning (ML) Performance (MLPerf) benchmark suite.

Benchmark results help us provide practical guidance to our customers as to what performance they can expect for various applications and scenarios. They also provide useful information to our design and optimization teams. However, it’s important to note that, while they offer a glimpse of performance for a few workloads and scenarios, generalized benchmarks are just one factor to consider when deciding on the best infrastructure for your enterprise’s unique AI needs.

With the recent launch of 3^rd generation Intel® Xeon® Scalable processors, we were excited to assess the new product family’s performance on the Machine Learning (ML) Performance (MLPerf) benchmark suite.

Bfloat16: Raising the Bar for Training on CPUs

Intel Xeon Scalable processors are the industry standard for classic machine learning and deep learning inference, and many customers also want to use their general-purpose infrastructure for deep learning training as well. To meet our customers’ evolving needs, Intel enhances each generation to add value for diverse workloads, including artificial intelligence (AI).

The 2^nd generation Intel Xeon Scalable processors began to address the needs of the AI ecosystem with Intel Deep Learning Boost which includes Vector Neural Network Instructions (VNNI), aimed at improving inference performance using the int8 numerical precision.

The 3^rd generation Intel Xeon Scalable processors, which launched in June 2020, evolves Intel Deep Learning Boost by adding built-in support for bfloat16—commonly known as the brain floating-point format—as well as enhancements to VNNI. This makes 3^rd generation Intel Xeon Scalable processors the first general-purpose data center processors to feature built-in acceleration for both deep learning training and inference.

Bfloat16 is a number-encoding format with same dynamic range as IEEE FP32. It delivers greater throughput for both training and inference workloads, without sacrificing accuracy or requiring extensive parameter tuning. As such, it is beneficial for deep learning workloads that have high compute intensity n including vision, natural language processing (NLP), reinforcement learning (RL), and more.

MLPerf Results: Leadership in CPU Training

MLPerf is an industry-standard benchmark suite for ML/DL inference and training. MLPerf Training measures how fast a system can train models on a given dataset to a specified level of accuracy or quality. Vendors submit results 1-2 times a year, which are verified and published on a regular cadence alternating with MLPerf Inference results.

Reflecting the broad range of AI workloads, Intel submitted results for MLPerf Training Release v0.7 in June 2020 for three training topologies. Results in each case demonstrated that Intel continues to raise the bar for training on general purpose CPUs.

MiniGo is a representative benchmark in MLPerf for RL—a fast-growing area of ML with applications in robotics, traffic control systems, finance, and games. Intel’s MLPerf submission this year measured 409 minutes[1] to train MiniGo on eight nodes of the 4-socket 3rd Gen Intel® Xeon® Platinum processor (28core, 2.70GHz, pre-production) with 6 UPI system.. These results show you can train MiniGo overnight with your CPU.
DLRM, the deep learning recommendation model benchmark, is designed to balance memory capacity, memory bandwidth, interconnect bandwidth, and compute/floating point performance, all of which are important for large-scale recommendation systems with 1 terabyte advertising click logs. Intel’s MLPerf submission trained DLRM with PyTorch in 116.62 minutes[2] on a single node of the 4-socket 3rd Gen Intel® Xeon® Platinum processor (28core, 2.70GHz, pre-production) with 6 UPI system, 73.77 minutes[3] on two nodes of the 4-socket 3rd Gen Intel® Xeon® Platinum processor (28core, 2.70GHz, pre-production) with 6 UPI system, 71.55 minutes[4] on one node of the 8-socket Intel Xeon Platinum 8380H CPU @ 2.90GHz system, and 45.04 minutes[4] on four nodes of the 4-socket 3rd Gen Intel® Xeon® Platinum processor (28core, 2.70GHz, pre-production) with 6 UPI system. These results show you can train DLRM in under an hour with your CPU.
ResNet-50 v1.5, which uses a 50-layer convolutional neural network (CNN), is a widely used benchmark for image classification applications and others based on CNNs. Intel’s MLPerf Training submission measured 1145.82 minutes[5] to train ResNet-50 on a single node of the 8-socket Intel Xeon Platinum 8380H CPU @ 2.90GHz system with TensorFlow, and 1104.53 minutes[6] on a single node of the 8-socket Intel Xeon Platinum 8380H CPU @ 2.90GHz system with MXNet. These results show you can train ResNet-50 v1.5 in a day on your CPU.

Infrastructure Implications

GPUs have their place in dedicated deep learning training, and Intel is developing a family of GPUs based on the X^e architecture. However, most machine learning and deep learning inference is still optimal on CPUs-- and for many organizations, 3^rd generation Intel Xeon Scalable processors now provide a practical, efficient, and performant platform for deep learning training too without the need for investing in new compute. Training on Intel Xeon Scalable processors lets enterprises avoid the cost and complexity of introducing special-purpose hardware for AI training, and benefit from being able to maintain a common data pipeline by performing data preparation, model training, and inference on the same familiar Xeon Scalable technology. Many organizations are using Intel Xeon Scalable processors and Analytics Zoo to enable an efficient end-to-end analytics and AI dataflow compared to GPUs.

To decide whether training on general-purpose infrastructure is appropriate for your organization, we recommend you consider three aspects of your training requirements:

Dataset size. Datasets with large images and large models (e.g., satellite imaging, oil and gas imaging, text and handwriting generation, handwriting, music generation, language translation via neural machine translation or imaging caption) can benefit from the enormous memory capacity of 3d generation Intel Xeon Scalable processors, including up to 4.5 TB per socket with Intel Optane™ persistent memory.
Workload characteristics. Reinforcement learning, transfer learning, recommender engines, high-resolution computer vision with CNNs perform very well on CPUs.
Demand/frequency. Training workloads that do not require dedicated, 24/7 training resources can often run overnight or intermittently on shared infrastructure. This may also help improve overall utilization if special-purpose hardware would sit idle part of the time. For example, MiniGo results were less than 7 hours training, which could easily be done overnight.

If you’re not already using Intel Xeon Scalable processors for model training, you can get started by training, testing, and optimizing your model on the CPU you know using an Intel-optimized framework such as TensorFlow, PyTorch, and MXNet. Learn more about the MLPerf Training v0.7 June 2020 results at https://mlperf.org/training-results-0-7/ and watch for future MLPerf Training reports—we plan to continue optimizing for more topologies and workloads over time.

Learn more about Intel’s AI technology portfolio here: https://www.intel.com/ai

_{Product and Performance Information}

_{[1] MLPerf v0.7 Training Closed; Retrieved from https://mlperf.org/training-results-0-7/ 29 July 2020, entry 0.7-XYZ. MLPerf name and logo are trademarks. See www.mlperf.org for more information.}

_{[2] MLPerf v0.7 Training Closed; Retrieved from https://mlperf.org/training-results-0-7/ 29 July 2020, entry 0.7-XYZ. MLPerf name and logo are trademarks. See www.mlperf.org for more information.}

_{[3] MLPerf v0.7 Training Closed; Retrieved from https://mlperf.org/training-results-0-7/ 29 July 2020, entry 0.7-XYZ. MLPerf name and logo are trademarks. See www.mlperf.org for more information.}

_{[4] MLPerf v0.7 Training Closed; Retrieved from https://mlperf.org/training-results-0-7/ 29 July 2020, entry 0.7-XYZ. MLPerf name and logo are trademarks. See www.mlperf.org for more information.}

_{[5] MLPerf v0.7 Training Closed; Retrieved from https://mlperf.org/training-results-0-7/ 29 July 2020, entry 0.7-XYZ. MLPerf name and logo are trademarks. See www.mlperf.org for more information.}

_{[6] MLPerf v0.7 Training Closed; Retrieved from https://mlperf.org/training-results-0-7/ 29 July 2020, entry 0.7-XYZ. MLPerf name and logo are trademarks. See www.mlperf.org for more information.}

_{Notices and Disclaimers}

_{Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Performance results may not reflect all publicly available security updates.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product user and reference guides for more information regarding the specific instruction sets covered by this notice.

© Intel Corporation Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.}