Running the AI Factory: How Enterprises Operationalize AI Placement at Scale

rahulawasthy · ‎06-01-2026

In Parts 1 and 2, we established the factory floor metaphor and mapped it to AI infrastructure:

Inference is the function
GPUs are the high-speed robotic line
CPUs are flexible production equipment

Part 3 now moves from metaphor to operations. When an enterprise runs an AI workload, placement-discipline, not hardware upgrades, determines whether costs scale or spiral out of control. And then we'll turn to the workload that makes this question even sharper: Agentic AI.

Job-Type and Production Requirements dictate placement

Consider a large enterprise running a daily AI workload to classify internal and external communications for customer emails, chat transcripts, support tickets, sales notes. The value of classifying the data from these workflows shows downstream in other workflows - Routing and escalation, Compliance checks, Analytics and reporting, Search and retrieval, Training and coaching workflows. Running this job in production dictate SLAs and Metrics and translated into operational terms, this job needs.

Volume: 500,000 documents per day
Input size: Variable, but predictable
Output: Structured tags and classifications
SLA: 1-2 Hours
Arrival pattern: Bursty intake, steady processing window

The requirements above suggest metrics on throughput, predictability, cost efficiency, and operational stability. They do not need sub-second latency, interactive streaming, or per-request optimization.

The Default Mistake and Outcome

Many enterprises seem to be routing this work directly to GPUs. Not because the workload requires it, but because the accelerators exist. The outcome is predictable:

Premium equipment doing low-urgency work and utilization spikes followed by idle time
Operational complexity increases and cost per unit escalates over time

How to AVOID making the default mistake

Let's evaluate the same workload across two production paths. In this example, we run the workload across 2 paths. Path A is a GPU-first approach using one H100-class server that completes batch work in minutes, idles the rest of the time. Path B is using 3 Xeon servers + on-demand GPU as needed.

Figure2: Comparing 2 systems for data-classification workload(1)

The punchline here is simple – Pay for GPU when GPU is working and that can be achieved by renting GPUs in the cloud or if an enterprise decides to own them, then the economics dictate careful workload placement to minimize GPU idle time. The argument goes further when inference can be optimized on CPUs depending on SLAs. Recent MLPerf Inference benchmarks show Intel® Xeon® processors performant to 450+ tokens/s in throughput on batchable models such as Llama-class 8B variants — well within the performance envelope required for high-volume, latency-tolerant production runs. GPUs retain an advantage in peak token generation rates for interactive workloads, but the gap narrows substantially under batching and relaxed latency budgets.

Notable in this context is the fundamental capacity planning problem that standard silicon benchmarking fails to address fully. Approaches that often rely on fixed-length, stateless synthetic prompts at uniform concurrency, fail to reflect the heterogeneous, multi-turn nature of real developer-assistant traffic (source – VMware SPOC paper)

Agentic Workloads change the conversation further

The data classification example above is a clean case — a batch workload with a somewhat flexible SLA. “Unlike batch classification, agentic workloads are not single-pass jobs—they are iterative systems where execution alternates between compute types.

An agent doesn't just generate a response. It runs a loop. The LLM thinks, calls a tool, waits for the result, thinks again, calls another tool, and waits again. Code execution, web search, database queries, file editing, API calls — each step alternating between GPU-bound generation and CPU/IO-bound tool execution. This isn't an architectural detail. It's the dominant shape of the workload.

Figure 3: Workflow of LLM agent.

Researchers from Microsoft Research, Shanghai Jiao Tong University, and Stevens Institute of Technology measured the shape of agent workloads directly in production traces across coding, deep research, and scientific benchmarks. Their finding, published in Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution (Sui et al., arXiv:2603.18897, March 2026):

"Tool execution accounts for 35% to 61% of the total request time. This execution model forces LLMs to hold expensive memory resources, yet it still delivers long end-to-end latency."

Broken down by agent type in the paper's measurements:

Coding agents: ~60% of request time in tool execution
Deep research agents: ~50% of request time in tool execution
Scientific agents: ~36% of request time in tool execution

If we examine the above with a workload placement lens, every second an agent spends producing an answer, between one-third and nearly two-thirds of that wall-clock time is spent in tool execution — running scripts, querying databases, fetching URLs, parsing JSON, editing files, orchestrating control flow. Work that is overwhelmingly CPU and IO-bound. During those same seconds, GPU utilization can drop significantly during tool execution phases, particularly in synchronous or poorly pipelined agent loops.

The Question Every Agentic AI Budget Has to Answer

Enterprise AI discussions tend to treat CPUs and GPUs as interchangeable — as if the only question were which one runs inference. Agentic workloads change that framing. In an agent loop:

The GPU runs token generation.
The CPU runs everything between token generation: the tool calls, the orchestration, the data movement, the validation, the control flow, the retries.

Parts 1 and 2 of this blog series asked which jobs belong on the GPU-heavy system and which don't. Agentic workloads force a harder question. Below is a typical pairing of CPU: GPU shared by cloud providers - a ratio perhaps tuned for a time when GPU was the bottleneck, and the CPU was – let's say plumbing?

Figure 4: Cloud Instances for AI Applications, Source: AWS, GCP and Azure

Agent workloads don't behave that way. When 35–61% of the request time is tool execution, under-provisioning the CPU side doesn't save money. It idles your most expensive hardware. So, the question for anyone planning agentic infrastructure in the next budget cycle is - If tool execution is half your wall-clock time, what are the new ratios that optimize costs to SLAs?

References:

The workload assumptions in this example are intentionally illustrative and derived from common enterprise classification and enrichment patterns rather than private benchmarking or lab validation. The scenario models a latency-tolerant enterprise workflow where a single inference request may include a customer email thread, support case history, CRM notes, ticket metadata, compliance context, and prompt instructions combined into an approximately 2,000-token input. The expected output is intentionally small (~150 tokens) and focused on structured classifications such as routing tags, sentiment, escalation indicators, summaries, confidence scoring, or JSON-formatted metadata. This creates an input-heavy, output-light workload profile representative of enterprise tagging, enrichment, and routing systems rather than long-form generative assistant interactions. The throughput, utilization, and cost examples are based on public benchmark data, public cloud infrastructure configurations, and conservative operational assumptions intended.

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of the dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Tom_Tom · ‎06-01-2026

Agentic AI exposes an interesting inefficiency. During tool execution, the bottleneck is often orchestration, data movement, and control flow rather than token generation. If that trend continues, infrastructure discussions may need to evolve beyond "How many GPUs?" and toward "How is work being coordinated across CPUs, GPUs, and specialized accelerators?" The economics could look very different.