Powering Agentic AI with CPUs: LangChain, MCP, and vLLM on Google Cloud

EhssanKhan · ‎09-24-2025

Authors:

Ehssan Khan

Benjamin Consolvo

With the launch of the C4 series, Google Cloud now offers access to Intel® Xeon® 6 with P-cores (code-named Granite Rapids) CPUs. These processors are well-suited for a variety of workloads, including agentic AI systems—particularly those that rely on small and medium language models with fewer than 20 billion parameters.

In this walkthrough, we’ll show how to build a basic agentic AI solution using CPUs on Google Cloud, combining LangChain for agents and orchestration, self-hosted Model Context Protocol (MCP) servers, and vLLM for efficient model serving.

Selecting Google Cloud Virtual Machine

To launch a CPU virtual machine, you’ll need a Google Cloud account with billing enabled. Visit cloud.google.com, navigate to Compute Engine, enable the API, and click Create Instance.

For this guide, we chose the c4-highmem-48-lssd (x86, Intel) configuration with 24 physical cores and 372 GB of memory. If your workload requires more compute or memory, you can scale up to other options with more memory and cores.

Figure 1: Creating a Virtual Machine (VM) on Google Cloud

By default, the VM’s boot disk is set to just 10 GB, which is insufficient even for basic installation files. Before creating the VM, navigate to the OS and storage section and increase the disk size to 500 GB to ensure adequate space for dependencies and model files.

Once the VM is created, you can connect via SSH using the command provided in the Google Cloud Console. The SSH command should look something like this:

gcloud compute ssh --zone "us-central1-a" "instance-20250729-191625" --project "dcg-xeon-ai-intel"

You can use any terminal or the Cloud Shell Editor, which provides a familiar VS Code-like interface directly in your browser. Google Gemini is integrated into the console and can assist with code generation and debugging based on your terminal output.

Figure 2: Google Cloud Shell Editor with Gemini

To run a sanity check to make sure we are running with Xeon 6 as an example, we can run `lscpu`. We should get a model name as follows:

Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) 6985P-C CPU @ 2.30GHz

We can also run a check to understand if we have enough mounted storage:

df -h

The output should indicate that we have a ~500 GB drive after we edited the storage (/dev/nvme2n1p1):

Filesystem Size Used Avail Use% Mounted on
udev 61G 0 61G 0% /dev
tmpfs 13G 688K 13G 1% /run
/dev/nvme2n1p1 492G 2.6G 470G 1% /
tmpfs 61G 0 61G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/nvme2n1p15 124M 12M 113M 10% /boot/efi
tmpfs 13G 0 13G 0% /run/user/1000

vLLM Model Serving

To enable agentic AI workflows, we first need to serve a language model that agents can interact with. vLLM is a high-performance library for LLM model serving and inference, designed for scalability and efficiency.

For installation on x86 architecture, we follow the instructions on the x86 CPU – vLLM documentation page, specifically the Build wheel from source section. While the official guide is comprehensive, we’ve added a few additional steps here that are necessary when building on a fresh Google Cloud VM instance with Debian 12. For more details beyond this walkthrough, feel free to explore the vLLM GitHub repo and their official documentation. The vLLM parameters we use are outlined below.

Table 1: vLLM parameters for this walkthrough

1. Create and activate a new virtual environment.

sudo apt install python3.11-venv -y
python3 -m venv llms
source llms/bin/activate

2. Install GCC.

sudo apt update -y
sudo apt install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

3. Clone the vLLM project.

git clone --branch v0.9.2 https://github.com/vllm-project/vllm.git vllm_0.9.2
cd vllm_0.9.2

4. Install Python packages for vLLM CPU backend.

pip install --upgrade pip
pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install transformers==4.53.1

Install transformers v.4.53.1, as later versions are not compatible with vLLM v.0.9.2.

5. Build vLLM

VLLM_TARGET_DEVICE=cpu python setup.py install

6. Set environment variables.

To get the best use out of the CPU, set up these vLLM environment variables; you can read more on the related runtime environment variables.

unset VLLM_ATTENTION_BACKEND
export VLLM_USE_V1=1
export VLLM_RPC_TIMEOUT=1000000
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_ENGINE_ITERATION_TIMEOUT_S=600
export VLLM_CPU_OMP_THREADS_BIND=auto
export VLLM_CPU_KVCACHE_SPACE=40
LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD"

Before continuing, make sure to open a new terminal window, activate your virtual environment, and reconfigure the necessary environment variables. If this step is skipped, vLLM may throw an error indicating that it cannot locate a required library.

7. Hugging Face CLI.

For this guide, we use the Llama-3.1–8B-Instruct model hosted on Hugging Face, though you’re free to use any model compatible with vLLM. If you choose a Llama model, note that access is gated by Meta. You’ll need to request access via the model card on Hugging Face and authenticate your Hugging Face account using the Hugging Face Hub CLI.

pip install -U "huggingface_hub[cli]"
hf auth login

Once you are approved for the gated models, you can fetch your access token from the HF website to enter it in the command line.

8. Serve the vLLM Model

Use the following command to serve an OpenAI-compatible vLLM model:

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct -tp=1 --trust-remote-code --block-size 128 --port 8089 --dtype bfloat16 --distributed-executor-backend mp --enable-auto-tool-choice --tool-call-parser pythonic

The end of the output should indicate that the model is running.

INFO: Started server process [22501]
INFO: Waiting for application startup.
INFO: Application startup complete.

In another terminal, you can verify that the model is being hosted by using a curl command:

curl http://localhost:8089/v1/models

This command should return a JSON response showing the models that are being served:

{
"object": "list",
"data": [
{
"id": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"object": "model",
"created": 1753888467,
"owned_by": "vllm",
"root": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"parent": null,
"max_model_len": 131072,
"permission": [
{
"id": "modelperm-a758d416a44948b7b268b9be78f661d9",
"object": "model_permission",
"created": 1753888467,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}

9. Benchmark the vLLM model.

Open a new terminal window and execute the following command to benchmark the model.

python3 benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 2 --request-rate inf --seed 2048 --ignore-eos --port 8089

Learn more about the benchmarking suite by visiting the benchmark_serving.py file on the vLLM GitHub.

MCP Server Setup

To demonstrate the use of AI agents, I will set up a couple of model context protocol (MCP) servers, hosted directly on the Intel Xeon 6 CPU.

Before starting up the servers, we need to install the required libraries for MCP and LangChain:

pip install -r requirements.txt

where the `requirements.txt` file has the following libraries

mcp
langchain>=0.0.267
langgraph
langchain-mcp-adapters
langchain_openai

Once the packages are installed, in two separate terminal windows, we can run each MCP server. Just make sure when starting up a new terminal that you activate the virtual environment each time.

Here is a simple MCP server with two mathematics tools: addition and multiplication. Run the MCP math server in its own terminal window with:

python3 mcp_math.py

The output should just be blank if successful. The contents of the `mcp_math.py` script are as follows:

# mcp_math.py
# Original code from: https://langchain-ai.github.io/langgraph/agents/mcp/
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("Math")

@MCP.tool()
def add(a: int, b: int) -> int:
"""Add two numbers"""
return a + b

@MCP.tool()
def multiply(a: int, b: int) -> int:
"""Multiply two numbers"""
return a * b

if __name__ == "__main__":
mcp.run(transport="stdio")

We can add another tool for the agents to use: an MCP weather server to fetch the weather forecast for a given location. It uses an API call from a US government weather station (https://api.weather.gov). Run the MCP math server in its own terminal window with:

python3 mcp_weather.py

The output should just be blank if successful. The contents of the `mcp_weather.py` file are as follows:

# mcp_weather.py
# Original code from https://modelcontextprotocol.io/quickstart/server

from typing import Any
import httpx
from mcp.server.fastmcp import FastMCP

# Initialize FastMCP server
mcp = FastMCP("weather")

# Constants
NWS_API_BASE = "https://api.weather.gov"
USER_AGENT = "weather-app/1.0"

async def make_nws_request(url: str) -> dict[str, Any] | None:
"""Make a request to the NWS API with proper error handling."""
headers = {"User-Agent": USER_AGENT, "Accept": "application/geo+json"}
async with httpx.AsyncClient() as client:
try:
response = await client.get(url, headers=headers, timeout=30.0)
response.raise_for_status()
return response.json()
except Exception:
return None

def format_alert(feature: dict) -> str:
"""Format an alert feature into a readable string."""
props = feature["properties"]
return f"""
Event: {props.get('event', 'Unknown')}
Area: {props.get('areaDesc', 'Unknown')}
Severity: {props.get('severity', 'Unknown')}
Description: {props.get('description', 'No description available')}
Instructions: {props.get('instruction', 'No specific instructions provided')}
"""

@MCP.tool()
async def get_alerts(state: str) -> str:
"""Get weather alerts for a US state.

Args:
state: Two-letter US state code (e.g. CA, NY)
"""
url = f"{NWS_API_BASE}/alerts/active/area/{state}"
data = await make_nws_request(url)

if not data or "features" not in data:
return "Unable to fetch alerts or no alerts found."

if not data["features"]:
return "No active alerts for this state."

alerts = [format_alert(feature) for feature in data["features"]]
return "\n---\n".join(alerts)

@MCP.tool()
async def get_forecast(latitude: float, longitude: float) -> str:
"""Get weather forecast for a location.

Args:
latitude: Latitude of the location
longitude: Longitude of the location
"""
# First get the forecast grid endpoint
points_url = f"{NWS_API_BASE}/points/{latitude},{longitude}"
points_data = await make_nws_request(points_url)

if not points_data:
return "Unable to fetch forecast data for this location."

# Get the forecast URL from the points response
forecast_url = points_data["properties"]["forecast"]
forecast_data = await make_nws_request(forecast_url)

if not forecast_data:
return "Unable to fetch detailed forecast."

# Format the periods into a readable forecast
periods = forecast_data["properties"]["periods"]
forecasts = []
for period in periods[:5]: # Only show next 5 periods
forecast = f"""
{period['name']}:
Temperature: {period['temperature']}°{period['temperatureUnit']}
Wind: {period['windSpeed']} {period['windDirection']}
Forecast: {period['detailedForecast']}
"""
forecasts.append(forecast)

return "\n---\n".join(forecasts)

if __name__ == "__main__":
# Initialize and run the server
mcp.run(transport="stdio")

LangChain AI Agents

The actual LangChain agents can be constructed to refer to the model served with vLLM and to the MCP servers. Run the agents with

python3 client.py

where the contents of the `client.py` file are as follows:

# client.py
# original code from: https://langchain-ai.github.io/langgraph/agents/mcp/

import os
import asyncio
import json
import dataclasses
from langchain_openai import ChatOpenAI
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent

def to_serializable(obj):
# Handles dataclasses (like HumanMessage, AIMessage, ToolMessage) and other non-serializable objects
if dataclasses.is_dataclass(obj):
return dataclasses.asdict(obj)
if hasattr(obj, "__dict__"):
return obj.__dict__
return str(obj)

async def main():
api_key = "None"
base_url = "http://127.0.0.0:8089/v1/"
llm = "meta-llama/Meta-Llama-3.1-8B-Instruct"

model = ChatOpenAI(model=llm, base_url=base_url, api_key=api_key)

mcp_client = MultiServerMCPClient(
{
"math": {
"command": "python",
"args": ["mcp_math.py"],
"transport": "stdio",
},
"weather": {
"command": "python",
"args": ["mcp_weather.py"],
"transport": "stdio",
},
}
)
tools = await mcp_client.get_tools()

agent = create_react_agent(model, tools)
math_response = await agent.ainvoke(
{
"messages": [
{"role": "user", "content": "Determine the value of (3 + 5) x 12."}
]
}
)
print("Math response:")
print(
json.dumps(math_response, indent=2, default=to_serializable, ensure_ascii=False)
)

weather_response = await agent.ainvoke(
{
"messages": [
{
"role": "user",
"content": "What is the weather forecast in Houston, Texas?",
}
]
}
)
print("Weather response:")
print(
json.dumps(
weather_response, indent=2, default=to_serializable, ensure_ascii=False
)
)

if __name__ == "__main__":
asyncio.run(main())

We’ve now walked through how to run a complete agentic AI workflow entirely on CPUs in Google Cloud— these steps include:

How to set up vLLM model serving;
How to host MCP servers; and
How to use LangChain agents in conjunction with the served vLLM model (meta-llama/Meta-Llama-3.1–8B-Instruct) and the running MCP servers.

If you’d like to dive deeper into the Intel Xeon 6 product line, you can explore the product brief. Join us in the Intel DevHub Discord server to exchange ideas with other developers or ask questions directly to Intel engineers.

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.