Authors:
Ehssan Khan
Benjamin Consolvo
With the launch of the C4 series, Google Cloud now offers access to Intel® Xeon® 6 with P-cores (code-named Granite Rapids) CPUs. These processors are well-suited for a variety of workloads, including agentic AI systems—particularly those that rely on small and medium language models with fewer than 20 billion parameters.
In this walkthrough, we’ll show how to build a basic agentic AI solution using CPUs on Google Cloud, combining LangChain for agents and orchestration, self-hosted Model Context Protocol (MCP) servers, and vLLM for efficient model serving.
Selecting Google Cloud Virtual Machine
To launch a CPU virtual machine, you’ll need a Google Cloud account with billing enabled. Visit cloud.google.com, navigate to Compute Engine, enable the API, and click Create Instance.
For this guide, we chose the c4-highmem-48-lssd (x86, Intel) configuration with 24 physical cores and 372 GB of memory. If your workload requires more compute or memory, you can scale up to other options with more memory and cores.
Figure 1: Creating a Virtual Machine (VM) on Google Cloud
By default, the VM’s boot disk is set to just 10 GB, which is insufficient even for basic installation files. Before creating the VM, navigate to the OS and storage section and increase the disk size to 500 GB to ensure adequate space for dependencies and model files.
Once the VM is created, you can connect via SSH using the command provided in the Google Cloud Console. The SSH command should look something like this:
gcloud compute ssh --zone "us-central1-a" "instance-20250729-191625" --project "dcg-xeon-ai-intel" |
You can use any terminal or the Cloud Shell Editor, which provides a familiar VS Code-like interface directly in your browser. Google Gemini is integrated into the console and can assist with code generation and debugging based on your terminal output.
Figure 2: Google Cloud Shell Editor with Gemini
To run a sanity check to make sure we are running with Xeon 6 as an example, we can run `lscpu`. We should get a model name as follows:
Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) 6985P-C CPU @ 2.30GHz |
We can also run a check to understand if we have enough mounted storage:
df -h |
The output should indicate that we have a ~500 GB drive after we edited the storage (/dev/nvme2n1p1):
Filesystem Size Used Avail Use% Mounted on udev 61G 0 61G 0% /dev tmpfs 13G 688K 13G 1% /run /dev/nvme2n1p1 492G 2.6G 470G 1% / tmpfs 61G 0 61G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock /dev/nvme2n1p15 124M 12M 113M 10% /boot/efi tmpfs 13G 0 13G 0% /run/user/1000 |
vLLM Model Serving
To enable agentic AI workflows, we first need to serve a language model that agents can interact with. vLLM is a high-performance library for LLM model serving and inference, designed for scalability and efficiency.
For installation on x86 architecture, we follow the instructions on the x86 CPU – vLLM documentation page, specifically the Build wheel from source section. While the official guide is comprehensive, we’ve added a few additional steps here that are necessary when building on a fresh Google Cloud VM instance with Debian 12. For more details beyond this walkthrough, feel free to explore the vLLM GitHub repo and their official documentation. The vLLM parameters we use are outlined below.
Table 1: vLLM parameters for this walkthrough
1. Create and activate a new virtual environment.
sudo apt install python3.11-venv -y python3 -m venv llms source llms/bin/activate |
2. Install GCC.
sudo apt update -y sudo apt install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 |
3. Clone the vLLM project.
git clone --branch v0.9.2 https://github.com/vllm-project/vllm.git vllm_0.9.2 cd vllm_0.9.2 |
4. Install Python packages for vLLM CPU backend.
pip install --upgrade pip pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu pip install transformers==4.53.1 |
Install transformers v.4.53.1, as later versions are not compatible with vLLM v.0.9.2.
5. Build vLLM
VLLM_TARGET_DEVICE=cpu python setup.py install |
6. Set environment variables.
To get the best use out of the CPU, set up these vLLM environment variables; you can read more on the related runtime environment variables.
unset VLLM_ATTENTION_BACKEND export VLLM_USE_V1=1 export VLLM_RPC_TIMEOUT=1000000 export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 export VLLM_ENGINE_ITERATION_TIMEOUT_S=600 export VLLM_CPU_OMP_THREADS_BIND=auto export VLLM_CPU_KVCACHE_SPACE=40 LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" |
Before continuing, make sure to open a new terminal window, activate your virtual environment, and reconfigure the necessary environment variables. If this step is skipped, vLLM may throw an error indicating that it cannot locate a required library.
7. Hugging Face CLI.
For this guide, we use the Llama-3.1–8B-Instruct model hosted on Hugging Face, though you’re free to use any model compatible with vLLM. If you choose a Llama model, note that access is gated by Meta. You’ll need to request access via the model card on Hugging Face and authenticate your Hugging Face account using the Hugging Face Hub CLI.
pip install -U "huggingface_hub[cli]" hf auth login |
Once you are approved for the gated models, you can fetch your access token from the HF website to enter it in the command line.
8. Serve the vLLM Model
Use the following command to serve an OpenAI-compatible vLLM model:
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct -tp=1 --trust-remote-code --block-size 128 --port 8089 --dtype bfloat16 --distributed-executor-backend mp --enable-auto-tool-choice --tool-call-parser pythonic |
The end of the output should indicate that the model is running.
INFO: Started server process [22501] INFO: Waiting for application startup. INFO: Application startup complete. |
In another terminal, you can verify that the model is being hosted by using a curl command:
curl http://localhost:8089/v1/models |
This command should return a JSON response showing the models that are being served:
{ "object": "list", "data": [ { "id": "meta-llama/Meta-Llama-3.1-8B-Instruct", "object": "model", "created": 1753888467, "owned_by": "vllm", "root": "meta-llama/Meta-Llama-3.1-8B-Instruct", "parent": null, "max_model_len": 131072, "permission": [ { "id": "modelperm-a758d416a44948b7b268b9be78f661d9", "object": "model_permission", "created": 1753888467, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] } ] } |
9. Benchmark the vLLM model.
Open a new terminal window and execute the following command to benchmark the model.
python3 benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --num-prompts 2 --request-rate inf --seed 2048 --ignore-eos --port 8089 |
Learn more about the benchmarking suite by visiting the benchmark_serving.py file on the vLLM GitHub.
MCP Server Setup
To demonstrate the use of AI agents, I will set up a couple of model context protocol (MCP) servers, hosted directly on the Intel Xeon 6 CPU.
Before starting up the servers, we need to install the required libraries for MCP and LangChain:
pip install -r requirements.txt |
where the `requirements.txt` file has the following libraries
mcp langchain>=0.0.267 langgraph langchain-mcp-adapters langchain_openai |
Once the packages are installed, in two separate terminal windows, we can run each MCP server. Just make sure when starting up a new terminal that you activate the virtual environment each time.
Here is a simple MCP server with two mathematics tools: addition and multiplication. Run the MCP math server in its own terminal window with:
python3 mcp_math.py |
The output should just be blank if successful. The contents of the `mcp_math.py` script are as follows:
# mcp_math.py mcp = FastMCP("Math") @MCP.tool() @MCP.tool() if __name__ == "__main__": |
We can add another tool for the agents to use: an MCP weather server to fetch the weather forecast for a given location. It uses an API call from a US government weather station (https://api.weather.gov). Run the MCP math server in its own terminal window with:
python3 mcp_weather.py |
The output should just be blank if successful. The contents of the `mcp_weather.py` file are as follows:
# mcp_weather.py from typing import Any # Initialize FastMCP server # Constants async def make_nws_request(url: str) -> dict[str, Any] | None: def format_alert(feature: dict) -> str: @MCP.tool() Args: if not data or "features" not in data: if not data["features"]: alerts = [format_alert(feature) for feature in data["features"]] @MCP.tool() Args: if not points_data: # Get the forecast URL from the points response if not forecast_data: # Format the periods into a readable forecast return "\n---\n".join(forecasts) if __name__ == "__main__": |
LangChain AI Agents
The actual LangChain agents can be constructed to refer to the model served with vLLM and to the MCP servers. Run the agents with
python3 client.py |
where the contents of the `client.py` file are as follows:
# client.py import os def to_serializable(obj): async def main(): model = ChatOpenAI(model=llm, base_url=base_url, api_key=api_key) mcp_client = MultiServerMCPClient( agent = create_react_agent(model, tools) weather_response = await agent.ainvoke( if __name__ == "__main__": |
We’ve now walked through how to run a complete agentic AI workflow entirely on CPUs in Google Cloud— these steps include:
- How to set up vLLM model serving;
- How to host MCP servers; and
- How to use LangChain agents in conjunction with the served vLLM model (meta-llama/Meta-Llama-3.1–8B-Instruct) and the running MCP servers.
If you’d like to dive deeper into the Intel Xeon 6 product line, you can explore the product brief. Join us in the Intel DevHub Discord server to exchange ideas with other developers or ask questions directly to Intel engineers.
Notices and Disclaimers
Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.