Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
492 Discussions

2nd Generation Intel® Xeon® Scalable CPUs Outperform NVIDIA GPUs on NCF Deep Learning Inference

MaryT_Intel
Employee
0 0 1,040

Recommender systems are some of the most complex and prevalent commercial AI applications deployed by internet companies today. One of the biggest challenges in using these systems is collaborative filtering – making predictions about the interests of one person based on the tastes and preferences of similar users. A novel model called Neural Collaborative Filtering (NCF) leverages deep learning to learn user-item interaction for better recommendation performance, and the MLPerf organization uses NCF as a key benchmark.

Recommendation systems are some of the most complex and prevalent commercial AI applications deployed by internet companies today. One of the biggest challenges in using these systems is collaborative filtering – making predictions about the interests of one person based on the tastes and preferences of similar users. A novel model called Neural Collaborative Filtering (NCF) leverages deep learning to learn user-item interaction for better recommendation performance, and the MLPerf organization uses NCF as a key benchmark.

Through hardware advances, software tool development, and framework optimizations, in recent years Intel has achieved tremendous deep learning performance improvement on CPUs. Thanks to the Intel Deep Learning Boost (Intel DL Boost) feature found in 2nd generation Intel Xeon Scalable processors, we demonstrated leadership NCF model inference performance of 64 million sentences per second under 1.22 milliseconds (msec) on a dual socket Intel Xeon Platinum 9282 Processor-based system, outperforming the GPU performance on NCF published by NVIDIA on Jan 16th, 2020. [1]

Model Platform Performance Precision Dataset
NCF Intel Xeon Platinum 9282 CPU

Throughputs: 64.54 million requests/sec

Latency: 1.22msec

INT8 MovieLens 20 Million
NVIDIA V100 Tensor Core GPU

Throughputs: 61.94 million requests/sec

Latency: 20msec

Mixed MovieLens 20 Million
NVIDIA T4 Tensor Core GPU

Throughputs: 55.34 million requests/sec

Latency: 1.8msec

INT8 Synthetic

Figure 1: 2nd Gen Intel Xeon Scalable Processor performance on NCF model compared to NVIDIA GPUs. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Refer to http://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel software products.

Reproducible Instructions:

Step 1: Install Intel Math Kernel Library (Intel MKL) through YUM or APT Repository.

Step 2: Build MXNet with Intel MKL and activate runtime environment.

Intel Deep Neural Network Library (Intel DNNL), formerly known as Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN) is enabled by default. See the article, Install MXNet with Intel MKL-DNN, for details.

git clone https://github.com/apache/incubator-mxnet
cd ./incubator-mxnet && git checkout dfa3d07
git submodule update --init --recursive
make -j USE_BLAS=mkl USE_INTEL_PATH=/opt/intel
source /opt/intel/bin/compilervars.sh intel64
export PYTHONPATH=/workspace/incubator-mxnet/python/

Step 3: Launch NCF (see README for details). You can do a quick benchmark with the following command:

# go to NCF dir
cd /workspace/incubator-mxnet/example/neural_collaborative_filtering/
# install some python libraries
pip install numpy pandas scipy tqdm
# prepare ml-20m dataset
python convert.py
# download pre-trained models
# optimize model
python model_optimizer.py
# calibration on ml-20m dataset
python ncf.py --prefix=./model/ml-20m/neumf-opt --calibration
# benchmark
bash benchmark.sh -p model/ml-20m/neumf-opt-quantized

Summary

As shown above, Intel Xeon Scalable processors are highly effective for NCF model inference. Next, we will extend the acceleration to broader recommender system models like DLRM, and illustrate the training efficiency with mixed precision from both single precision (float 32) and half precision (bfloat16) with new extensions being added to Intel DL Boost in next-gen Intel Xeon Scalable processors, due out later this year.

[1] Configuration Details

Intel Xeon Platinum 9282 Processor: Tested by Intel as of 01/17/2020. DL Inference: Platform: Intel S2900WK 2S Intel Xeon Platinum 9282 (56 cores per socket), HT ON, turbo ON, Total Memory 384 GB (24 slots/ 16 GB/ 2933 MHz), microcode:0x500002c, BIOS: PLYXCRB1.86B.0576.D18.1902140627, Ubuntu 18.04.2 LTS (GNU/Linux 5.4.0-050400-generic x86_64), Deep Learning Framework: Apache MXNet version: https://github.com/apache/incubator-mxnet Commit id: dfa3d07, GCC 7.4.0 for build, DNNL version: v1.1.2 (commit hash: cb2cc7a), model: https://github.com/apache/incubator-mxnet/tree/dfa3d07/example/neural_collaborative_filtering, BS=700, Real Data:MovieLens-20M, 112 software instance/2 socket, Datatype: INT8; throughput: 64.54 million samples/s.

NVIDIA performance and configuration details taken from https://developer.nvidia.com/deep-learning-performance-training-inference on 01/16/2020. Batch latency of NVIDIA V100 is inferred by equation ‘1048576*1000/61941700~=16.92ms=0.01692s~=0.02s=20ms’

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Refer to http://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel software products.

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available security updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

About the Author
Mary is the Community Manager for this site. She likes to bike, and do college and career coaching for high school students in her spare time.