Numenta and Intel Accelerate Inference 20x on Large Language Models with Intel® Xeon® CPU Max Series

Intel_AI_Community · ‎04-06-2023

Posted on behalf of Authors:

Lawrence Spracklen, Numenta; Christy Maver, Numenta;

Nick Payton, Intel; Vikram Saletore, Intel

Summary

Natural language processing (NLP) has exploded with the evolution of transformers. But running these language models efficiently in production for either short text snippets, such as text messages or chats, with low latency requirements, or long documents with high throughput requirements, has been challenging – if not impossible – to do on a CPU.

In prior work, Numenta showed how their custom-trained language models could run on 4th Gen Intel® Xeon® Scalable processors with <10ms latency and achieve 100x throughput speedup vs. current generation AMD Milan CPU implementations for BERT inference on short text sequences (1). In this project, Numenta showcases how their custom-trained large language models can run 20x faster for large documents (long sequence lengths) when they run on Intel® Xeon® CPU Max Series processors with high bandwidth memory located on the processor vs current generation AMD Milan CPU implementations (2). In both cases, Numenta demonstrates the capacity to dramatically reduce the overall cost of running language models in production on Intel, unlocking entirely new NLP capabilities for customers.

Read and learn more about this Numenta project at Intel Customer Spotlight.

(1) For more, see: https://edc.intel.com/content/www/us/en/products/performance/benchmarks/4th-generation-intel-xeon-scalable-processors/

Numenta: BERT-Large: Sequence Length 64, Batch Size 1, throughput optimized 3rd Gen Intel® Xeon® Scalable: Tested by Numenta as of 11/28/2022. 1-node, 2x Intel® Xeon®8375C on AWS m6i.32xlarge, 512 GB DDR4-3200, Ubuntu 20.04 Kernel 5.15, OpenVINO 2022.3, Numenta-Optimized BERT-Large, Sequence Length 64, Batch Size 1 Intel® Xeon® 8480+: Tested by Numenta as of 11/28/2022. 1-node, pre-production platform with 2x Intel® Xeon® 8480+, 512 GB DDR5-4800, Ubuntu 22.04 Kernel 5.17, OpenVINO 2022.3, Numenta-Optimized BERT-Large, Sequence Length 64, Batch Size 1.

(2) For more, see: https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html

Numenta BERT-Large

AMD Milan: Tested by Numenta as of 11/28/2022. 1-node, 2x AMD EPYC 7R13 on AWS m6a.48xlarge, 768 GB DDR4-3200, Ubuntu 20.04 Kernel 5.15, OpenVINO 2022.3, BERT-Large, Sequence Length 512, Batch Size 1.

Intel® Xeon® 8480+: Tested by Numenta as of 11/28/2022. 1-node, 2x Intel® Xeon® 8480+, 512 GB DDR5-4800, Ubuntu 22.04 Kernel 5.17, OpenVINO 2022.3, Numenta-Optimized BERT-Large, Sequence Length 512, Batch Size 1.

Intel® Xeon® Max 9468: Tested by Numenta as of 11/30/2022. 1-node, 2x Intel® Xeon® Max 9468, 128 GB HBM2e 3200 MT/s, Ubuntu 22.04 Kernel 5.15, OpenVINO 2022.3, Numenta-Optimized BERT-Large, Sequence Length 512, Batch Size 1.