Authors: YeHur Cheong, Rahul Unnikrishnan Nair
Embedded LLM, an Intel® Liftoff member, is the creator of JamAI Base - a collaborative, AI-native spreadsheet where each cell acts as an intelligent agent, enabling users to build complex AI pipelines with simplicity and speed.
With a mission to provide the foundational “power grid” software for the knowledge economy, Embedded LLM is also behind TokenVisor, a tool for advanced model orchestration.
As part of their ongoing work, the team recently benchmarked Intel® Gaudi® 2 against NVIDIA A100, uncovering meaningful performance and efficiency gains that have significant implications for enterprise AI deployments.
Deploying on Intel® Tiber™ AI Cloud & Invitation to Startups
Before we dive deep into the Intel® Gaudi® 2 software stack and its performance optimizations for Large Language Models (LLMs), it's important to set the stage. All the benchmarking, testing, and development work described in this post were performed using the Intel® Tiber™ AI Cloud.
Intel® Tiber™ AI Cloud is a managed cloud platform specifically designed to provide developers and AI startups with scalable and cost-effective access to Intel's advanced AI hardware portfolio.
This includes Intel® Gaudi® 2 (and Gaudi® 3) accelerators, Intel® Data Center GPU Max Series, and the latest Intel® Xeon® Scalable processors.
For startups focused on building and deploying compute-intensive AI models, Intel® Tiber™ AI Cloud removes the significant barrier of upfront hardware investment, while providing an environment optimized for performance.
The AI landscape is continuously being reshaped by the remarkable capabilities of Large Language Models (LLMs). These models are increasingly being deployed in applications that demand sophisticated reasoning, pushing the boundaries of what AI can achieve.
However, the sheer computational power required by these advanced LLMs, particularly when processing the extensive input sequences necessary for complex reasoning, presents significant challenges for their widespread adoption and efficient deployment.1
Startups and enterprises alike are seeking innovative solutions to overcome these performance and efficiency hurdles.
In this context, the Habana Gaudi 2 AI accelerator emerges as a compelling alternative, purpose-built to address the intricate computational demands of modern AI workloads, including the intensive needs of reasoning LLMs.
For startups within programs like Intel® Liftoff, the pursuit of high performance coupled with cost-effectiveness is paramount. Intel® Gaudi® 2 is architected from the ground up to accelerate deep learning tasks, offering a unique set of features that make it a strong contender for those looking to deploy cutting-edge AI models.
Its design philosophy centers around providing efficient computation and high memory bandwidth, crucial elements for handling the complexities of reasoning in LLMs.
Optimizing the Engine: How Software and Hardware Power Gaudi 2 LLM Performance
Achieving state-of-the-art performance for LLMs on Habana Gaudi 2, especially in demanding long-sequence scenarios benchmarked with frameworks like vLLM, requires more than just powerful silicon.
It relies on a seamless interplay between the SynapseAI™ software stack and the purpose-built hardware architecture. Understanding how these two layers work together reveals the key to unlocking Gaudi 2's potential.
The SynapseAI™ Software Stack: Translating and Optimizing Execution
At the heart of Gaudi 2's software ecosystem lies SynapseAI™, Habana's SDK designed specifically for deep learning. It acts as the crucial bridge between popular frameworks like PyTorch (used by vLLM) and the underlying Gaudi 2 hardware. SynapseAI™ doesn't just translate operations; it actively compiles and optimizes the computational graph for maximum efficiency on the HPUs. Key optimization techniques include:
Graph Compilation & Recipe Caching: SynapseAI™ features a graph compiler that analyzes the sequence of operations defined in the PyTorch model [Source: Habana Docs]. It optimizes this graph and compiles computational kernels tailored for the HPUs. To accelerate subsequent runs and initialization times, SynapseAI employs "recipe caching." Compiled graph segments or kernels ("recipes") can be stored (configurable via environment variables like PT_HPU_RECIPE_CACHE_CONFIG), allowing them to be quickly loaded later, bypassing redundant compilation work [Source: Habana Docs].
HPU Graphs for Reduced Overhead: HPU Graphs are a powerful SynapseAI feature for minimizing host-CPU overhead during inference [Source: Habana Docs]. By using mechanisms like the @hpu.graphs.capture decorator in PyTorch, the computational graph for one or more iterations can be captured and then replayed directly on the device. This bypasses the Python interpreter overhead for each operation within the captured graph, making the execution significantly more device-bound and improving throughput, especially for the iterative nature of LLM token generation.
The Critical Warmup Phase: Initial inference iterations on Gaudi 2 often exhibit higher latency. This "warmup" period is essential for SynapseAI to perform its upfront optimizations .
During warmup, initial graph compilation occurs, populating the recipe cache. HPU Graphs (if used) are captured. Memory buffers, including space for the crucial KV cache used heavily in LLMs, are allocated and optimized [Source: Habana Tutorials]. This investment in warmup translates directly into lower, consistent latency and higher sustained throughput for the subsequent inference workload, which is critical for real-world deployment.
Frameworks like vLLM running on Gaudi 2 benefit directly from these SynapseAI optimizations. While vLLM manages high-level scheduling and memory (like the KV cache), its underlying PyTorch operations are efficiently executed on the HPUs thanks to SynapseAI's graph compilation, HPU Graph utilization, and caching mechanisms.
The Gaudi 2 Hardware Foundation: Purpose-Built for AI
These sophisticated software optimizations run on a hardware architecture specifically designed to meet the demands of large-scale deep learning:
Compute Engines: Flexibility and Acceleration
Tensor Processor Cores (TPCs): These highly programmable VLIW SIMD processors handle the diverse range of operations found in modern LLMs, offering flexibility beyond just matrix math. Their programmability is key for efficiently executing complex attention mechanisms or custom operations.
Matrix Math Engine (MME): This dedicated accelerator focuses solely on dense matrix multiplications, the computational core of LLMs, providing significant speedups for these intensive operations.
Memory Hierarchy: Capacity, Bandwidth, and Locality
High Bandwidth Memory (HBM2E): A substantial 96GB capacity accommodates large models and the extensive KV caches inherent to long sequences, while the ~2.4-2.45 TB/s bandwidth ensures the compute engines aren't starved for data.
On-Chip SRAM: 48MB of fast, low-latency SRAM keeps frequently accessed data close to the TPCs and MME, minimizing latency and improving data locality.
Integrated Networking for Scalability
RDMA over Converged Ethernet (RoCE v2): Built-in high-speed networking (24 x 100 Gbps ports) enables efficient scaling across multiple Gaudi 2 accelerators using standard Ethernet, crucial for very large models or high-throughput requirements.
Synergy for Performance
In conclusion, Gaudi 2's LLM performance, particularly evident in benchmarks leveraging frameworks like vLLM, arises from the powerful synergy between the SynapseAI™ software stack and the underlying hardware.
SynapseAI™ optimizes execution through graph compilation, HPU Graphs, and intelligent caching, primed by a warmup phase.
This software runs efficiently on an architecture featuring flexible TPCs, accelerated MMEs, a high-capacity/high-bandwidth memory system, and integrated scaling capabilities. Together, they form a potent combination for tackling today's most demanding AI inference workloads.
To put their infrastructure to the test, Embedded LLM ran a series of rigorous benchmarks focused on long-context LLM inference - an area where hardware efficiency and architectural nuance matter deeply. Their findings shed light on how thoughtful software-hardware co-optimization can shift the balance of power in AI infrastructure decisions.
Here, we'll explore how Intel® Gaudi® 2, an AI accelerator designed for deep learning workloads, is emerging as a compelling and cost-competitive alternative to traditional GPU solutions like the NVIDIA A100. We'll demonstrate the significant performance gains achieved through a combination of firmware enhancements, advanced quantization techniques, cutting-edge vLLM optimizations, and an Intel® Gaudi® 2-specific custom warmup procedure. Finally, we'll pit Gaudi2 against the A100 in a real-world long context inference scenario, revealing its potential to disrupt the LLM inference landscape.
Here’s what we'll be covering:
- Performance improvements from the Intel® Gaudi® Software Suite v1.18 to v1.19 update.
- The impact of FP8 quantization for faster inference.
- Optimizations enabled by newer versions of the vLLM framework.
- A custom warmup technique that unlocks Intel® Gaudi® 2's potential for long sequences.
- A direct performance comparison with the NVIDIA A100 in a long context setting.
The Power of Software and Optimization: v1.18 vs. v1.19
The move from Intel® Gaudi® Software Suite version v1.18 to v1.19 brings substantial performance gains to Gaudi2, enabling faster and more efficient LLM inference. Embedded LLM benchmarks using Llama3.3-70B-Instruct and the ShareGPT dataset show improvements across key metrics, as illustrated in Figure 1.
They focussed on the improvement in throughput (measured by request rate) and reduction in generation latency (measured by Median time per output token (TPOT).
- Request Throughput (Req/s): The v1.19 software suite significantly improves request throughput. At a request rate of 8, the v1.19 software suite boosts throughput by 20.62%, from 3.54 requests per second to 4.27 requests per second. This means Gaudi2 can process more requests concurrently, improving overall system utilization.
- Time Per Output Token (Median TPOT): The Median TPOT also sees improvements. The median time to generate each subsequent token decreased by up to ~47%, from 85.06 ms to 44.87 ms. This contributes to faster overall generation times.
These combined improvements highlight the effectiveness of the v1.19 Intel® Gaudi® Software Suite in optimizing Intel® Gaudi® 2 for demanding LLM inference workloads.
Figure 1: The v1.19 firmware update delivers a significant improvement in request throughput (requests per second) and time per output token across various request rates, highlighting the performance improvements gained through software optimization.
Quantization: FP8 Unleashes Further Potential on Gaudi 2
LLMs are traditionally stored and processed using higher precision formats like BF16. Quantization is a powerful technique to reduce the precision of these numbers to smaller ones, which makes computation faster and more memory efficient. One emerging standard is FP8 (8-bit Floating Point), useful for striking an excellent balance between model accuracy and system performance.
Crucially, the Intel® Gaudi® 2 accelerator provides native hardware support for FP8 computations, handling both the E4M3 and E5M2 formats directly within its Tensor Processing Cores (TPCs) and Matrix Math Engine (MME). This native capability, distinct from software emulation, is fundamental to its performance advantage. This hardware potential is unlocked and optimized by the Habana SynapseAI™ software stack, whose graph compiler translates and refines the model's operations for efficient FP8 execution. (Optional, if relevant to your specific benchmark methodology: Preparing models for FP8 deployment leverages tools like Intel® Neural Compressor (INC), which integrates with SynapseAI for precise calibration and quantization). Furthermore, to maintain high numerical accuracy during these lower-precision computations, Gaudi 2 typically uses internal FP32 accumulation within its MME.
Leveraging these integrated hardware and software capabilities for FP8 quantization on Gaudi 2 yields substantial latency reductions, resulting in a more responsive and interactive user experience. As depicted in Figure 2, quantizing Llama3-70B-Instruct to FP8 within the v1.19 software suite significantly improves both the time to first token (TTFT) and the median time per output token (Median TPOT). Specifically:
- Reduced Time to First Token (TTFT): The first token measures the initial response, and FP8 offers up to ~67% improvement at 16 request rate.
- Reduced Time Per Output Token (Median TPOT): The median time to generate each subsequent token also decreases significantly with FP8 quantization. At a request rate of 4, the median time to generate each subsequent token decreased by over 30% with FP8 compared to BF16.
These improvements clearly show how Gaudi 2's optimized FP8 implementation plays a crucial role in reducing inference latency and enhancing user interaction.
Figure 2: FP8 quantization significantly reduces both Time to First Token (TTFT) and Median Time Per Output Token (TPOT), leading to a more responsive LLM inference experience on Gaudi2.
Further optimization is possible by fine-tuning vLLM's deployment arguments. For example, enabling multistep scheduling specifically targets the decoding stage, yielding an additional Median TPOT reduction of up to 19%.leading to even faster text generation times. The figure demonstrates that by doing this optimization, we can further reduce the median TPOT numbers at Gaudi2.
Figure 3: This chart illustrates the impact of multistep scheduling on Median TPOT when using FP8 quantization on Gaudi2. The data demonstrates that enabling multistep scheduling yields a further reduction in generation latency compared to FP8 alone.
To truly appreciate the combined impact of these optimizations, let's compare the best optimized setup (v1.19 with FP8 and multistep scheduling) against their baseline v1.18 configuration with custom warmup. We'll focus on Median Time Per Output Token (TPOT) as a key indicator of the end-to-end generation performance. By comparing these different settings, we are able to see how much performance gained throughout the process. The improvements we made show that we are improving, and the best is to come. Figure 4 shows the results.
Figure 4: Comparing Median TPOT between the original vLLM-v1.18 configuration and the optimized vLLM-v1.19 FP8 with multistep scheduling setup highlights the cumulative gains from software, quantization, and vLLM optimizations.
Gaudi2 Secret Sauce: Custom Warmup for Long Context Inference (Beneficial for SOTA reasoning models)
Gaudi2 benefits from a specialized warmup procedure that optimizes memory allocation and kernel loading for sustained high performance. This is particularly beneficial for longer sequences inference that are very important to the deployment of reasoning LLMs, where these reasoning LLMs have become the de facto SOTA models.
Figure 5: Gaudi2's optimized warmup procedure for long context inference achieves higher request throughput and reduced latency (TTFT and Median TPOT) compared to the A100.
Is Gaudi2 a Viable Solution for LLM Inference?
The data speaks for itself. Through a combination of firmware enhancements, FP8 quantization, optimized vLLM configurations, and a custom warmup procedure, Intel® Gaudi® 2 demonstrates impressive performance for LLM inference, rivaling and in some cases surpassing the NVIDIA A100, especially when dealing with long context.
Specifically, the long context benchmark reveals that Intel® Gaudi® 2, with its optimized warmup configuration achieves:
- Higher Throughput: Gaudi2 achieves better output tokens/second at the request rate of 16 for long sequence (1000 input/ 3000 output)
- Competitive Latency: Gaudi2 shows similar Time to First Token (TTFT) with A100 at the same request rate of 16.
- Generation Latency: Gaudi 2 shows a 34% faster generation speed measured by Time Per Output Token (TPOT).
The data presented positions Intel® Gaudi® 2 as a strong competitor to the A100 for LLM inference. Its effective utilization of FP8 quantization and tailored software enhancements translates to competitive throughput and latency, offering a compelling, cost-efficient path to high-performance LLM deployment.
If you represent an AI startup interested in exploring the capabilities of Intel® Gaudi® accelerators and leveraging the optimized environment of Intel® Tiber™ AI Cloud for your own projects, we encourage you to connect with the Intel® Liftoff for AI Startups program.
This program is designed to support startups like yours with resources, technical expertise, and access to platforms like Tiber AI Cloud.
To maximize AI performance, you need the right hardware, but you also need the right ecosystem to scale. AI startups don’t have to figure it all out alone. At Intel® Liftoff, you’re surrounded by mentors, access to cutting-edge technical resources, and fellow founders working to solve the toughest challenges in AI.
Related resources
Intel® Tiber™ AI Cloud - Cloud platform for AI development and deployment
Intel® Gaudi® 2 AI accelerator - High-performance AI training processor designed for deep learning workloads
Embedded LLM evaluates Intel's CPU:
https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Beyond-GPUs-Why-JamAI-Base-Moved-Embedding-Models-to-Intel-Xeon/post/1650850
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.