Authors:
Rahul Unnikrishnan Nair, Head of Engineering, Intel Liftoff,
Sri Raj Aryan Karumuri, Sr Solutions Engineer, Intel Liftoff
Beyond Monolithic LLMs
Generative AI is evolving fast and so are the conversations around it. While large language models like GPT-4, Claude, and Llama have taken center stage, they come with real trade-offs: high computational costs, latency, and deployment challenges.
What if there was a more efficient approach? One that leverages smaller, specialized models working together through a standardized protocol to achieve comparable or even superior results for specific tasks?
This blog explores how the Model Context Protocol (MCP) and Intel accelerators (Intel Max Series GPUs) enable the creation of efficient, modular AI agents without relying on heavyweight frameworks. We’ll dive into a practical example: a multi-modal recipe generation system that analyzes food images, identifies ingredients, searches for relevant recipes, and generates customized cooking instructions.
Image the Recipe generator agent sees
Generated Recipe
From LLMs to Agent Frameworks: The Evolution of AI Systems
The Rise of Large Language Models
Large Language Models (LLMs) have revolutionized AI by demonstrating remarkable capabilities across diverse tasks. These models, trained on vast corpora of text data, can generate human-like text, answer questions, translate languages, and even write code. Their key strength lies in their generality - a single model can handle a wide range of tasks without task-specific training.
However, this generality comes at a cost:
- Computational Demands: Running state-of-the-art LLMs requires significant computational resources.
- Latency Issues: Large models could introduce higher inference times.
- Deployment Complexity: Deploying massive models in production environments presents challenges.
- Black Box Nature: Understanding exactly how these models arrive at specific outputs can be difficult.
The Emergence of Agent Frameworks
To address some of these limitations and extend LLM capabilities, agent frameworks like LangChain, AutoGPT, and others emerged. These frameworks enable LLMs to:
- Access External Tools: Connect to databases, APIs, and other external systems
- Maintain Context: Preserve information across multiple interactions
- Follow Multi-Step Reasoning: Break complex tasks into manageable steps
- Adapt Dynamically: Change strategies based on intermediate results
Agent frameworks typically implement / help in implementing patterns like ReAct (Reasoning + Acting), which combines: - Reasoning: Step-by-step thinking through problems - Acting: Taking concrete actions based on reasoning
Diagram 1: ReAct Pattern
While agent frameworks significantly extend LLM capabilities, they often:
- Remain tightly coupled to specific LLM providers
- Lack standardization in how tools and context are provided
- It can be complex to configure and maintain
- May introduce additional latency through their orchestration layers.
By no means should these frameworks be avoided; they often provide a faster path to developing complex agents. The challenge typically emerges as these systems evolve, potentially becoming monolithic and bound by the specific constraints of the framework.
The Model Context Protocol (MCP): A New Paradigm
What is MCP?
The Model Context Protocol (MCP) is an open protocol that standardizes how applications provide context to AI models. Think of MCP as the “USB-C port for AI” - just as USB-C provides a standardized way to connect devices to various peripherals, MCP provides a standardized way to connect AI models to different data sources and tools.
MCP was initially developed by Anthropic for their Claude AI assistant but has since been released as an open protocol that any AI system can implement. The protocol defines how AI systems can:
- Access Resources: Standardized ways to retrieve information
- Use Tools: Execute functions and receive structured results
- Follow Prompts: Use templates for common interaction patterns
- Sample Text: Generate text completions with specific parameters
Key Components of MCP
MCP consists of several core components:
- MCP Servers: Lightweight programs that expose specific capabilities through the standardized protocol
- MCP Clients: Applications that connect to MCP servers and use their capabilities
- Transport Layer: Defines how messages are exchanged (typically via Server-Sent Events or stdio)
- Message Types: Standardized formats for requests, responses, and notifications
Diagram 2: MCP
Why MCP Matters
MCP offers several significant advantages over traditional agent frameworks:
- Standardization: A common protocol for all AI systems to interact with tools and data
- Isolation: Clear separation between AI models and the tools they use
- Security: Tools run in isolated environments with explicit permissions
- Modularity: Easy to add, remove, or update individual components
- Interoperability: Switch between different AI providers without changing tools
- Efficiency: Use specialized models for specific tasks rather than one large model for everything
FastMCP: A Pythonic Implementation of MCP
While the MCP protocol can be implemented directly, frameworks like FastMCP make it much easier to build MCP servers and clients. FastMCP is a high-level, Pythonic framework inspired by FastAPI that simplifies MCP implementation.
Key Features of FastMCP
FastMCP provides:
- Simple Tool Creation: Create tools with Python function decorators
- Resource Management: Easily expose data as resources
- Prompt Templates: Define reusable interaction patterns
- Client Library: Connect to and use MCP servers
- Server Composition: Combine multiple servers into unified interfaces
- Async Support: Built on modern async Python
Here’s a simple example of creating an MCP tool with FastMCP:
from fastmcp import FastMCP
server = FastMCP("WeatherServer")
@server.tool()
def get_weather(location: str, unit: str = "celsius") -> str:
"""
Get the current weather for a location.
Args:
location: City or location name
unit: Temperature unit (celsius or fahrenheit)
Returns:
Current weather information
"""
# Implementation details here
return f"Weather in {location}: 22°{unit[0].upper()}, Partly Cloudy"
if __name__ == "__main__":
server.run()
Building a Multi-Modal Recipe Agent with MCP
To keep the focus on the Model Context Protocol's application and ensure this blog post remains manageable in length, the following code snippets are illustrative. They will demonstrate the core interactions and structure, with some internal implementation details simplified or stubbed. A full production codebase would naturally be more extensive. This system will:
- Analyze food images to identify ingredients
- Search for relevant recipes
- Generate customized cooking instructions
System Architecture Overview
Our multi-modal recipe agent consists of three specialized MCP servers and a client orchestrator:
Diagram 3: Recipe Generator Agent
Each component has a specific role:
- Vision Server: Identifies food items in images using a specialized vision model
- Search Server: Searches for recipes based on identified ingredients
- LLM Server: Generates customized recipes based on ingredients and search results
- Client Orchestrator: Coordinates the workflow between servers
Component 1: Vision Server
The Vision Server is responsible for analyzing food images and identifying ingredients. It uses one of the best light-weight vision models we have worked with (Moondream2) optimized for food item detection.
Diagram 4: Vision Server
Implementation Details
The Vision Server exposes a single tool called identify_food_items that takes an image path and returns a list of identified food items.
from fastmcp import FastMCP
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
server = FastMCP("VisionServer", host="0.0.0.0", port=8000)
model_name = "vikhyatk/moondream2"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForImageTextToText.from_pretrained(model_name)
@server.tool()
def identify_food_items(image_path: str) -> str:
"""
Identify food items in an image.
Args:
image_path: Path to the image file
Returns:
String with detected food items
"""
# Load and process the image
image = Image.open(image_path).convert("RGB")
# Generate prompt for food detection
prompt = "What food items can you see in this image? Return a JSON array of strings."
# Process image and prompt
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate response
outputs = model.generate(**inputs, max_new_tokens=512)
food_items = processor.decode(outputs[0], skip_special_tokens=True)
return food_items
if __name__ == "__main__":
server.run("sse")
The Vision Server is optimized for a specific task - food item detection - and doesn’t need the full capabilities of a general-purpose LLM.
Component 2: Search Server
The Search Server is responsible for finding recipes based on the identified ingredients. It uses DuckDuckGo search to find relevant recipes.
Diagram 5: Search Server
Implementation Details
The Search Server exposes a search_recipes tool that takes a list of ingredients and returns relevant recipe information.
from fastmcp import FastMCP
import json
import logging
from langchain_community.tools import DuckDuckGoSearchRun
server = FastMCP("SearchServer", host="0.0.0.0", port=8002)
search_tool = DuckDuckGoSearchRun()
@server.tool()
def search_recipes(ingredients: str) -> str:
"""
Search for recipes based on provided ingredients.
Args:
ingredients: List of ingredients to search recipes for
Returns:
JSON string with recipe information
"""
# Search for recipes
query = f"recipes with {ingredients} easy homemade"
search_results = search_web(query)
# Extract recipe names from search results
recipes = []
if search_results and len(search_results) > 100:
for line in search_results.split("\n"):
recipes.append(line.strip())
recipes = recipes[:5] if len(recipes) > 5 else recipes
# Return formatted results
result = {
"ingredients": ingredients,
"recipes": recipes,
"full_results": search_results[:1000] if search_results else "",
}
return json.dumps(result)
@server.tool()
def search_web(query: str) -> str:
"""
Search the web for information using DuckDuckGo.
Args:
query: The search query
Returns:
Search results as text
"""
try:
search_results = search_tool.invoke(query)
return search_results
except Exception as e:
return "No search results found."
if __name__ == "__main__":
server.run("sse")
The Search Server demonstrates how MCP can integrate with existing tools and libraries like LangChain’s DuckDuckGoSearchRun.
Component 3: LLM Server
The LLM Server generates customized recipes based on the identified ingredients and search results. It uses a smaller, more efficient language model (Qwen2.5-3B-Instruct) that’s specialized for text generation.
Diagram 6: LLM server
Implementation Details
The LLM Server exposes a generate_recipe tool that takes ingredients and search results and returns a customized recipe.
from fastmcp import FastMCP
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from typing import Any
server = FastMCP("LLMServer", host="0.0.0.0", port=8001)
# Initialize the model and tokenizer
model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Create a text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
@server.tool()
def generate_recipe(ingredients: Any, search_results: str = "", max_new_tokens: int = 512) -> str:
"""
Generate a recipe based on provided ingredients and optional search results.
Args:
ingredients: List of ingredients
search_results: Optional search results to inform recipe generation
max_tokens: Maximum number of tokens to generate
Returns:
Generated recipe text
"""
# Prepare the prompt
prompt_parts = [f"Based on these ingredients: {ingredients}"]
if search_results and len(search_results) > 10:
prompt_parts.append(f"And considering these recipe ideas: {search_results}")
prompt_parts.append("Create a detailed recipe with the following format:")
prompt_parts.append("Recipe name: [Creative name]")
prompt_parts.append("Brief description: [Short description]")
prompt_parts.append("Ingredients: [List of ingredients with quantities]")
prompt_parts.append("Simple step-by-step instructions: [Numbered steps]")
prompt_parts.append("Cooking time: [Time in minutes]")
prompt_parts.append("Servings: [Number of servings]")
prompt = "\n".join(prompt_parts)
# Generate the recipe
response = generator(
prompt,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
# Extract and format the generated text
generated_text = response[0]["generated_text"]
recipe_text = generated_text[len(prompt):].strip()
return recipe_text
if __name__ == "__main__":
server.run("sse")
The LLM Server, similar to our ingredients detector, uses a smaller, efficient language model (3B parameters) that’s fine-tuned specifically for instruction following and text generation. This specialized model provides excellent recipe generation capabilities without the computational overhead of a massive general-purpose LLM.
Component 4: Client Orchestrator
The Client Orchestrator coordinates the workflow between the three specialized servers. It handles user input, manages the sequence of operations, and presents the final output.
Diagram 7: Client Orchestrator
Implementation Details
The Client Orchestrator uses the MCP client library to connect to and interact with the three specialized servers.
import argparse
import asyncio
import json
import logging
import os
from typing import Any, Dict
from fastmcp import Client
class MultiModalAgent:
"""Multi-Modal Agent for analyzing food images and suggesting recipes"""
def __init__(self):
# Server configurations
self.vision_server_url = os.environ.get("VISION_SERVER_URL", "http://localhost:8000")
self.llm_server_url = os.environ.get("LLM_SERVER_URL", "http://localhost:8001")
self.search_server_url = os.environ.get("SEARCH_SERVER_URL", "http://localhost:8002")
# Retry configuration
self.max_retries = 3
self.retry_delay = 5 # seconds
async def suggest_recipe(self, image_path: str) -> Dict[str, Any]:
"""Analyze a food image and suggest recipes"""
print("🧠 Initializing Multi-Modal Recipe Agent...")
print(f" ️ Analyzing food image file: {image_path}")
# Step 1: Identify food items in the image
food_items = await self._identify_food_items(image_path)
print(" Detected food items")
# Step 2: Search for recipes based on the identified ingredients
recipe_search = await self._get_recipe_suggestions(food_items)
print(" Found recipe ideas")
# Step 3: Generate a customized recipe
recipe = await self._generate_recipe(food_items, recipe_search)
print(" Recipe Suggestion:")
return {
"food_items": food_items,
"recipe": recipe
}
async def _call_mcp_tool(self, server_url: str, tool_name: str, params: Dict[str, Any]) -> Any:
"""Call an MCP tool with simple retry logic"""
for attempt in range(self.max_retries + 1):
try:
client = Client(f"{server_url}/sse")
async with client:
result = await asyncio.wait_for(
client.call_tool(tool_name, params), timeout=60.0
)
return result
except Exception as e:
if attempt < self.max_retries:
await asyncio.sleep(self.retry_delay)
return None
async def _identify_food_items(self, image_path: str) -> str:
"""Identify food items in image using the Vision Server"""
result = await self._call_mcp_tool(
self.vision_server_url, "identify_food_items", {"image_path": image_path}
)
return self._extract_text_from_response(result)
async def _get_recipe_suggestions(self, food_items: str) -> str:
"""Get recipe suggestions using Search Server"""
search_result = await self._call_mcp_tool(
self.search_server_url, "search_recipes", {"ingredients": food_items}
)
return self._extract_text_from_response(search_result)
async def _generate_recipe(self, ingredients: str, search_results: str) -> str:
"""Generate a recipe using LLM Server"""
llm_params = {
"ingredients": str(ingredients),
"search_results": str(search_results),
"max_tokens": 1000,
}
recipe_result = await self._call_mcp_tool(
self.llm_server_url, "generate_recipe", llm_params
)
return self._extract_text_from_response(recipe_result)
def _extract_text_from_response(self, response: Any) -> str:
"""Helper method to extract text content from MCP responses"""
return str(response)
async def main():
"""Main function to run the multi-modal agent for recipe suggestions"""
parser = argparse.ArgumentParser(
description="Multi-Modal Agent for Recipe Suggestions"
)
parser.add_argument(
"--image", type=str, required=True, help="Path to food image file"
)
args = parser.parse_args()
agent = MultiModalAgent()
result = await agent.suggest_recipe(args.image)
print(result["recipe"])
if __name__ == "__main__":
asyncio.run(main())
The Client Orchestrator demonstrates how MCP enables the composition of specialized services into a cohesive workflow. Each server focuses on a specific task, and the orchestrator manages the flow of information between them.
While traditional agent frameworks like those using the ReAct pattern rely on a model with explicit reasoning steps, the MCP-based multi-modal agent above takes a different approach. It distributes intelligence across specialized components, each optimized for its specific modality or task, while still maintaining the core capability of processing and integrating multiple types of data (images and text) into a cohesive output.
Why This Architecture Matters: The Power of Specialization
The multi-modal recipe agent demonstrates several key advantages of the MCP approach:
1. Efficiency Through Specialization
Each component in our system is optimized for a specific task:
- Vision Server: Uses a fast vision model for food item detection
- Search Server: Focuses on web search and result extraction
- LLM Server: Uses a small, efficient language model for text generation
This specialization allows us to achieve excellent results with significantly lower computational requirements than using a single massive model for everything.
2. Isolation and Security
Each server runs in its own isolated environment with clearly defined inputs and outputs. This isolation provides several benefits:
- Security: Each component has only the permissions it needs
- Reliability: Issues in one component don’t affect others
- Maintainability: Components can be updated independently
3. Flexibility and Interoperability
The MCP architecture makes it easy to:
- Swap Components: Replace any server with an alternative implementation
- Add Capabilities: Extend the system with new servers and tools
- Scale Independently: Allocate resources based on each component’s needs
4. Reduced Latency
By using specialized models and efficient communication patterns, the MCP approach can achieve lower end-to-end latency than monolithic systems. Each component does exactly what it needs to do, without the overhead of a massive general-purpose model.
Containerization and Deployment
One of the key advantages of our MCP-based architecture is the ease of containerization and deployment. Each server can be packaged as a separate Docker container, allowing for independent scaling, updates, and resource allocation.
Docker Containerization
For our multi-modal recipe agent, we created separate Docker containers for each server and the client orchestrator:
Diagram 8: Container Orchestration
Here’s an example Dockerfile for the Vision Server:
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Create cache directory with proper permissions
RUN mkdir -p /app/.cache && \
chmod -R 777 /app/.cache
COPY servers/vision_server.py .
CMD ["python", "vision_server.py"]
Using Docker Compose, we can easily orchestrate the deployment of all services:
services:
vision-server:
build:
context: .
dockerfile: docker/Dockerfile.vision
ports:
- "8000:8000"
volumes:
- ./data:/app/data
search-server:
build:
context: .
dockerfile: docker/Dockerfile.search
ports:
- "8002:8002"
llm-server:
build:
context: .
dockerfile: docker/Dockerfile.llm
ports:
- "8001:8001"
client:
build:
context: .
dockerfile: docker/Dockerfile.client
depends_on:
- vision-server
- search-server
- llm-server
volumes:
- ./data:/app/data
This containerized approach provides several benefits:
- Isolation: Each service runs in its own container with only the dependencies it needs. This isolation is especially critical because MCP servers can also execute arbitrary code; therefore, under no circumstances should these services be run directly on the host system.
- Portability: The entire system can be deployed on any platform that supports Docker
- Scalability: Individual services can be scaled independently based on demand
- Versioning: Each service can be versioned and updated independently
Optimizing for Intel GPUs
One additional advantage of our modular approach is the ability to optimize each component for specific hardware. In our case, we can leverage Intel® Data Center GPU Max 1100 for efficient inference across all components.
Intel Extension for PyTorch (IPEX)
Intel Extension for PyTorch (IPEX) is a library that extends PyTorch with optimizations for Intel hardware. It can significantly improve the performance of PyTorch models on Intel CPUs and GPUs.
Here’s how we can modify our LLM Server to use IPEX:
import intel_extension_for_pytorch as ipex
from fastmcp import FastMCP
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from typing import Any
server = FastMCP("LLMServer", host="0.0.0.0", port=8001)
# Initialize the model and tokenizer
model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Optimize the model with IPEX
model = ipex.llm.optimize(model) # optional to add dtype
# Create a text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
Similar optimizations can be applied to the Vision Server and any other components that use PyTorch models.
Lessons Learned: Insights from Building a Multi-Modal MCP Agent
Developing our multi-modal recipe agent provided valuable insights that can benefit others building similar systems:
1. Simplicity Over Complexity
One of our key learnings was the value of simplicity. Initially, we considered using complex agent frameworks like LangChain for the search functionality. However, we found that direct implementation using the DuckDuckGo search library provided better control and reduced dependencies.
# Before: Using LangChain's wrapper
from langchain.agents import Tool
from langchain_community.tools import DuckDuckGoSearchRun
search_tool = DuckDuckGoSearchRun()
# After: Direct implementation with the library
from duckduckgo_search import DDGS
ddgs = DDGS()
results = ddgs.text(query, max_results=10)
The lesson: Always question whether you need the full complexity of a framework or if a simpler, more direct approach would suffice.
2. Clear Interfaces Simplify Development
The MCP protocol enforces clear interfaces between components, which significantly simplified development and testing. Each server could be developed and tested independently, with well-defined inputs and outputs.
This approach allowed us to:
- Develop components in parallel
- Test components in isolation
- Replace implementations without affecting other parts of the system
3. User Experience Matters
Even in a technical system, user experience considerations are important. We found that adding simple visual cues and clear status messages significantly improved the user’s understanding of the system’s operation.
4. Containerization Simplifies Deployment and Process level Isolation
Using Docker containers for each component made deployment and testing much simpler.
We could easily:
- Test different configurations
- Deploy to different environments
- Scale individual components based on demand
The containerized approach also ensured consistency between development and production environments, reducing the “it works on my machine” problem.
The Future of Gen AI is Modular
The multi-modal recipe agent we’ve explored demonstrates a powerful alternative to monolithic LLM-based systems. By leveraging the Model Context Protocol (MCP) and specialized components, we can create AI systems that are:
- More Efficient: Using the right tool for each job
- More Secure: Isolating components and limiting permissions
- More Flexible: Easily swapping or upgrading components
- More Maintainable: Clearly defined interfaces and responsibilities
- More Resilient: Handling failures gracefully through proper error management
- More User-Friendly: Providing clear feedback and intuitive interactions
This approach represents a shift from the “one massive model for everything” paradigm to a more modular, specialized architecture. As AI continues to evolve, we expect to see more systems adopt this approach, combining the strengths of different models and tools through standardized protocols like MCP.
The future of AI is about building smarter systems, ones made of specialized parts that work together with purpose.MCP provides the standardized “connective tissue” that makes this possible, opening up new possibilities for efficient, powerful AI applications that can run on a variety of hardware, from powerful servers to edge devices.
This modularity extends to diverse architectural strategies. While our recipe agent showcases an MCP-native orchestration for optimal control and leanness, MCP can also serve as a robust foundation for tools within hybrid architectures. In such scenarios, agent development frameworks could manage high-level planning, reasoning (e.g., using ReAct-like patterns), and conversational flow, while relying on MCP for standardized, secure, and efficient access to a rich ecosystem of specialized AI models, data sources, and traditional tools. This allows teams to leverage the strengths of both approaches – sophisticated agentic control from frameworks and a cleanly defined, interoperable service layer via MCP.
By using this modular approach, we can create AI systems that are not only more capable but also more accessible, efficient, and adaptable to specific needs. The multi-modal recipe agent is just one example of what’s possible when we break free from the constraints of monolithic models and embrace the power of specialized, interconnected components.
Try it Yourself on Intel® Tiber™ AI Cloud: You can run this code and explore the performance of the Intel Data Center GPU Max 1100 directly. Intel Tiber™ AI Cloud offers:
- Free JupyterLab Environment: Get hands-on access to a Max 1100 GPU for training and experimentation by creating an account at cloud.intel.com and launching a GPU-accelerated notebook from the "Training" section.
- Virtual Machines & Bare Metal: Access single Max 1100 GPU VMs (starting at $0.39/hr/card) or powerful multi-GPU systems connected via high-speed bridges. PoC credits are available for qualifying AI startups via the Intel® Liftoff program. Find more details on the Intel Tiber™ AI Cloud Pricing Page.
References
- Model Context Protocol (MCP): https://modelcontextprotocol.io/
- FastMCP: https://gofastmcp.com/
- Anthropic Claude: https://www.anthropic.com/claude
- Intel Extension for PyTorch: https://github.com/intel/intel-extension-for-pytorch
- ReAct: Synergizing Reasoning and Acting in Language Models: https://arxiv.org/abs/2210.03629
- DuckDuckGo Search Python Library: https://github.com/deedy5/duckduckgo_search
- Moondream2 Vision Model: https://huggingface.co/vikhyatk/moondream2
- Qwen2.5-3B-Instruct Model: https://huggingface.co/Qwen/Qwen2.5-3B-Instruct
Related resources
Intel® Tiber™ AI Cloud - Cloud platform for AI development and deployment
Intel® Gaudi® 2 AI accelerator - High-performance AI training processor designed for deep learning workloads
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.