Building Efficient Agentic RAG System with SmolAgents and Intel GPU Acceleration

Eugenie_Wirz · ‎07-20-2025

Authors: Sri Raj Aryan Karumuri, Sr Solutions Engineer at Intel® Liftoff, Rahul Unnikrishnan Nair, Engineering Lead, Applied AI at Intel® Liftoff.

Beyond Lookup- Toward Agentic Retrieval

Retrieval-Augmented Generation (RAG) has become a popular strategy to ground language models with external knowledge, especially from domain-specific sources like PDFs or websites. Traditional RAG systems, however, often rely on monolithic pipelines—pairing large vector stores with powerful LLMs and simple query-response flows. While effective, these systems can be rigid, computationally expensive, and poorly suited for tasks requiring tool coordination or fallback strategies.

What if RAG could be smarter, lighter, and more adaptable?

In this blog, we explore how SmolAgents, a lightweight agent framework, combined with hybrid retrieval (FAISS + BM25) and locally accelerated LLMs (powered by Intel GPUs and IPEX), can create modular RAG systems that are efficient, explainable, and context-aware. We demonstrate this through a practical use case: a document-grounded QA system that intelligently chooses between document search and web search.

Overview:

Retrieval-Augmented Generation (RAG) is a cornerstone of modern intelligent systems. But integrating it into agents that make autonomous tool choices—while keeping computing efficient and grounded on local documents—is non-trivial. This blog post showcases how to build a hybrid-retrieval RAG pipeline using:

SmolAgents – a lightweight, open-source agent framework
Qwen2.5-3B – a compact, multilingual instruction-tuned LLM
Intel Extension for PyTorch (IPEX) – for GPU-optimized local inference
FAISS + BM25 – for dual semantic and lexical retrieval
LangChain – to manage document parsing, chunking, and indexing

These agents can reason about tool usage—first consulting document-grounded retrieval and falling back to web search only when necessary—ensuring both efficiency and trustworthiness.

Semantic Meets Reasoning: How Hybrid Retrieval and Agentic Logic Work Together

1. Hybrid Retrieval: FAISS + BM25 for Deeper Document Grounding

At the core of this system lies a hybrid retrieval strategy that blends semantic similarity with exact keyword matching to extract the most relevant document chunks.

FAISS (Semantic Search): Utilizes dense embeddings from 'Qwen3-Embedding-0.6B' to find semantically aligned passages—even when the exact terms don’t match. This is useful when users ask conceptually phrased questions.
BM25 (Syntactic Search): A robust lexical retrieval algorithm that ranks document chunks based on how often query terms appear and how unique those terms are. Ideal for precision-focused, term-heavy queries.

Together, these systems ensure high recall (catching meaning) and precision (capturing keywords), giving the LLM both breadth and accuracy of context.

2. Agentic Work: SmolAgents Make Intelligent Tool Decisions

Unlike static pipelines that blindly apply one retrieval method, SmolAgents introduce agentic reasoning—allowing the system to choose tools dynamically based on the query context.

Here's how it works:

The CodeAgent receives a query and evaluates whether it's answerable from the provided document chunks.
It always attempts retrieval first via the 'HybridRetrieverTool'.
If the document context is insufficient or irrelevant, the agent may fall back to web search using 'DuckDuckGo'.
All steps and tool outputs are logged, promoting transparency and explainability in agent decisions.

This modular, decision-driven approach allows the system to be both resource-efficient and trustworthy—prioritizing local knowledge before relying on external sources.

Technical Architecture

The architecture of our Agentic RAG system integrates various components working together to ensure efficient retrieval and generation. At a high level, the system comprises the following components:

1. Document Ingestion & Chunking:

PDF Loader: Uses LangChain’s 'PyPDFLoader' to ingest source documents.
Text Splitter: The 'RecursiveCharacterTextSplitter' efficiently segments content into manageable chunks for embedding.

2. Hybrid Retrieval:

FAISS (Semantic Search): Utilizes dense embeddings (from the 'Qwen3-Embedding-0.6B' model) to fetch semantically similar content.
BM25 (Syntactic Search): Complements FAISS by scoring chunks based on exact keyword matches.

The results of both retrieval methods are combined and deduplicated before being handed off.

3. Agentic Decision Making:

SmolAgents Framework: The core agent assesses the query, determines which tools to invoke, and sequences the retrieval process.
The agent’s prompt enforces a tool usage policy—prioritizing the local document retriever over fallback methods (e.g., web search via DuckDuckGo).

4. Local Inference Engine:

Qwen2.5-3B-Instruct with IPEX: This local LLM, optimized using Intel Extension for PyTorch (IPEX), generates the final answer based on the retrieved context.

5. Output Delivery:

The pipeline concludes by synthesizing the retrieved data and generating a fluent, grounded answer to the user query.

Code Breakdown

Step 1: Import Necessary Libraries

The code starts by importing all the essential Python libraries required for building a hybrid-retrieval Agentic RAG system that runs efficiently on Intel GPUs.

# Standard library imports 
import os 
from pathlib import Path 

# Third-party imports 
import torch 
import intel_extension_for_pytorch as ipex 
import langchain 
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, pipeline 

# Langchain imports 
from langchain.schema import Document 
from langchain.vectorstores import FAISS 
from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain_huggingface import HuggingFacePipeline
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_community.document_loaders import PyPDFLoader

from rank_bm25 import BM25Okapi
from smolagents import CodeAgent, TransformersModel, Tool,
DuckDuckGoSearchTool

import warnings
warnings.filterwarnings("ignore")

Step 2: Load a Local LLM with Intel GPU Acceleration

This function loads the tokenizer and model from HuggingFace’s transformers library using the Qwen/Qwen2.5-3B-Instruct model checkpoint. It performs the following:

def load_local_llm(): 
    try: 
        model_id = "Qwen/Qwen2.5-3B-Instruct" 
        tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) 
        model = AutoModelForCausalLM.from_pretrained( 
            model_id, 
            trust_remote_code=True, 
            torch_dtype=torch.float16 
        ).to("xpu")
        
        pipe = pipeline( 
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            device="xpu",
            max_new_tokens=500,
            temperature=0.1,
            repetition_penalty=1.1,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

        return HuggingFacePipeline(pipeline=pipe)

    except ValueError as ve:
        print(f"[ValueError] Invalid configuration or parameters: {ve}") 
    except OSError as oe:
        print(f"[OSError] Model or tokenizer files could not be loaded: {oe}")
    except RuntimeError as re:
        print(f"[RuntimeError] Runtime issue during model setup or device transfer: {re}")
    except Exception as e:
        print(f"[Exception] Unexpected error occurred: {e}")

    return None

# Initialize the model
llm = load_local_llm()

Tokenizer Loading: Loads a fast tokenizer suitable for transformer-based generation tasks.
Model Initialization: Loads a causal language model (AutoModelForCausalLM) and pushes it to the xpu device, enabling execution on Intel Max Series GPUs via intel_extension_for_pytorch.
Pipeline Creation:
Wraps the model into a text-generation pipeline with the following configuration:
'max_new_tokens': Controls output length
'temperature': Ensures deterministic, low-variance responses
'repetition_penalty': Prevents loops and token repetition
LangChain Compatibility: The pipeline is returned as a HuggingFacePipeline, making it compatible with LangChain agents and tools.

Step 3: Prepare the Document Database with Hybrid Indexing

This function sets up the knowledge base by loading a PDF file, splitting it into chunks, and indexing it for both semantic and lexical retrieval. It returns components essential for hybrid RAG: a FAISS vector store, a BM25 keyword ranker, and the processed document chunks.

PDF Loading:

loader = PyPDFLoader(pdf_path)
docs = loader.load()

Loads the input PDF using LangChain's PyPDFLoader, which extracts raw text from each page.
Validates that the document contains content; raises an error if empty.

Document Chunking:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=128,
    strip_whitespace=True,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""]
)
document_chunks = text_splitter.split_documents(docs)

Uses 'RecursiveCharacterTextSplitter' to break long text into overlapping chunks.
Parameters:
- chunk_size=512: Maximum characters per chunk.
- chunk_overlap=128: Overlap between chunks to preserve context.
- strip_whitespace=True: Cleans up leading/trailing spaces.
- add_start_index=True: Adds metadata for position tracking.
Wraps each chunk into a LangChain Document object.

Semantic Embedding & FAISS Indexing:

embeddings = HuggingFaceEmbeddings( 
    model_name="Qwen/Qwen3-Embedding-0.6B", 
    model_kwargs={'device': 'xpu'} 
) 
vector_db = FAISS.from_documents(documents, embeddings)

Uses the 'Qwen/Qwen3-Embedding-0.6B' model to generate dense vector representations of the document chunks.

Stores embeddings in a FAISS index for fast semantic similarity search.

Lexical Scoring with BM25:

tokenized_documents = [doc.page_content.split() for doc in documents]
bm25 = BM25Okapi(tokenized_documents)

Tokenizes document content into word-level terms.
Builds a BM25 index that enables keyword-based ranking.
Effective for matching formula names, exact phrases, or terminology.
Stores embeddings in a FAISS index for fast semantic similarity search.

Step 4: Defining the Custom Tools

Hybrid Retriever Tool:

This class defines a custom tool for SmolAgents that combines semantic (FAISS) and lexical (BM25) retrieval to extract relevant chunks from a PDF document. It's designed to be the agent’s primary method for answering document-grounded queries.

faiss_docs = self.faiss_db.similarity_search(query, k=3)
tokenized_query = query.split()
bm25_scores = self.bm25.get_scores(tokenized_query)
bm25_ranked_docs = sorted(
            [(score, idx) for idx, score in enumerate(bm25_scores)],
            reverse=True
        )[:5]
bm25_docs = [self.documents[idx] for _, idx in bm25_ranked_docs]
combined_docs = faiss_docs + bm25_docs
unique_docs = {doc.page_content: doc for doc in combined_docs}.values()
document_chunks = text_splitter.split_documents(docs)

Search Tool:

A 'DuckDuckGoSearchTool' is added to allow the agent to retrieve answers from the web, but only when local PDF-based retrieval doesn’t yield relevant results.

search_tool = DuckDuckGoSearchTool()
search_tool.description = (
    "Use this tool only if the PDF document does not contain the answer. "
    "Avoid using this tool for document-related questions."
)

Step 5: Initializing the Agent with Tools and Local LLM

This step defines the SmolAgents-powered CodeAgent, equipping it with hybrid retrieval and web search capabilities, and connecting it to a locally accelerated LLM for response generation.

agent = CodeAgent(
    tools=[hybrid_retriever_tool, search_tool],
    model=TransformersModel(model_id="Qwen/Qwen2.5-3B-Instruct",
device_map="auto"),
    max_steps=4,
    verbosity_level=3,
)

Tools:

'hybrid_retriever_tool': Uses FAISS + BM25 to fetch chunks from a local PDF.
'search_tool': A DuckDuckGo-based fallback for web queries if the document lacks an answer.

Model:

Uses a 'Qwen2.5-3B-Instruct' LLM wrapped in TransformersModel.
Runs locally and efficiently on Intel hardware via device_map="auto" (i.e., uses xpu with IPEX if available).

Reasoning Controls:

'max_steps=4': Limits the number of tool-call iterations to prevent infinite loops.
'verbosity_level=3': Enables detailed reasoning logs for debugging and traceability.

Step 6: Customizing the System Prompt for Tool-Aware Reasoning

This enhances the agent’s reasoning by injecting tool-specific instructions directly into its system prompt. It ensures that the agent understands what tools are available, how to use them, and when to prefer each.

Tool Descriptions & Prompt Injection:

tool_descriptions = "\n".join([
    f"- {tool.name}: {tool.description}\n    Takes inputs: {tool.inputs}\n   Returns an output of type: {tool.output_type}"
    for tool in agent.tools.values()
])

Dynamically generates a detailed list of all tools (name, description, input/output schema).
Ensures future tools can be added without manually updating the prompt.

Prompt Customization:

agent.prompt_templates["system_prompt"] += f"""
You are an intelligent assistant that uses the following tools to answer
queries:
{tool_descriptions}

Here is how you should proceed:

1. First, use the 'HybridRetriever_Tool' to find relevant chunks from the PDF document.
2. Read all the retrieved document text carefully.
3. ONLY if this tool does not return anything relevant or is clearly unrelated to the question, use the 'web_search' tool (DuckDuckGoSearchTool).
4. Avoid using web search unless necessary. Your priority is to rely on the retriever output.

Always start with HybridRetriever_Tool. Use web search only as a fallback.
"""

Appends procedural logic to the agent's base system prompt.
Emphasizes tool order, discouraging unnecessary web search.
Makes tool usage explainable and auditable for trust and debugging.

Step 7: Running the Agent

Query-1:

question = "What is the formula for Scaled Dot-Product Attention?"
agent_output = agent.run(question)
print("\nFinal Answer:")
print(agent_output)

Query-1: What is the formula for Scaled Dot-Product Attention?
Execution Logs:

Output:

[Output Text]: The formula for Scaled Dot-Product Attention is:

QK^T / (√d_k * √d_v)

where Q is the query vector, K is the key vector, and V is the value vector. The result is passed through a softmax function to normalize the weights.

Query-2:

question = "Tell me about Intel Liftoff Days?"
agent_output = agent.run(question)
print("\nFinal Answer:")
print(agent_output)

Query-2: Tell me about Intel Liftoff Days?
Execution Logs:

Output:

[Output Text]: Intel Liftoff Days is a program that supports and accelerates AI startups. It includes workshops, mentorship, and opportunities for startups to collaborate and refine their AI solutions.

Prerequisites:

Ensure you have the necessary libraries installed.

pip install torch intel-extension-for-pytorch langchain langchain-core
langchain-community langchain-huggingface transformers faiss-cpu rank_bm25
pypdf smolagents duckduckgo-search

Note: Ensure you install intel-extension-for-pytorch specifically for your hardware/environment if not using Tiber Cloud.

Complete Runnable Code:

# Standard library imports
import os
from pathlib import Path

# Third-party imports
import torch
import intel_extension_for_pytorch as ipex
import langchain
from transformers import AutoTokenizer, AutoModelForCausalLM,
GenerationConfig, pipeline

# Langchain imports
from langchain.schema import Document
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFacePipeline
from langchain_huggingface.embeddings import HuggingFaceEmbeddings 
from langchain_core.prompts import PromptTemplate
from langchain_community.document_loaders import PyPDFLoader

from rank_bm25 import BM25Okapi
from smolagents import CodeAgent, TransformersModel, Tool, DuckDuckGoSearchTool

import warnings
warnings.filterwarnings("ignore")

# Check and set the device
device = torch.device("xpu" if torch.xpu.is_available() else "cpu")

if device.type == "xpu":
    # Empty the XPU cache
    torch.xpu.empty_cache()
    print(f"Using device: {torch.xpu.get_device_name()}")
else:
    print("Using CPU")

def load_local_llm():
    try:
        model_id = "Qwen/Qwen2.5-3B-Instruct"
        tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            trust_remote_code=True,
            torch_dtype=torch.float16
        ).to("xpu")

        pipe = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            device="xpu",
            max_new_tokens=500,
            temperature=0.1,
            repetition_penalty=1.1,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

        return HuggingFacePipeline(pipeline=pipe)

    except ValueError as ve:
        print(f"[ValueError] Invalid configuration or parameters: {ve}")
    except OSError as oe:
        print(f"[OSError] Model or tokenizer files could not be loaded:
{oe}")
    except RuntimeError as re:
        print(f"[RuntimeError] Runtime issue during model setup or device transfer: {re}")
    except Exception as e:
        print(f"[Exception] Unexpected error occurred: {e}")
 
    return None

# Initialize the model
llm = load_local_llm() 

def db_setup(pdf_path):
    try:
        # Load the PDF
        loader = PyPDFLoader(pdf_path)
        docs = loader.load()
        if not docs:
            raise ValueError("No documents loaded from the PDF.")

        # Split documents into chunks

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=128, strip_whitespace=True, add_start_index=True, separators=["\n\n", "\n", ".", " ", ""])
        document_chunks = text_splitter.split_documents(docs)
        documents = [
            Document(page_content=chunk.page_content)
            for chunk in document_chunks if chunk.page_content
        ]
        print(f"✅ Loaded {len(documents)} chunks from PDF.")

        # Embed the documents for FAISS retrieval
        embeddings = HuggingFaceEmbeddings(
            model_name="Qwen/Qwen3-Embedding-0.6B",
            model_kwargs={'device': 'xpu'}
        )
        vector_db = FAISS.from_documents(documents, embeddings)

        # Prepare BM25 documents
        tokenized_documents = [doc.page_content.split() for doc in documents]

        bm25 = BM25Okapi(tokenized_documents)

        return vector_db, bm25, documents

    except FileNotFoundError as fe:
        print(f"[FileNotFoundError] PDF file not found: {fe}")
    except ValueError as ve:
        print(f"[ValueError] {ve}")
    except OSError as oe:
        print(f"[OSError] Error reading PDF or related file system issue: {oe}")
    except Exception as e:
        print(f"[Exception] Unexpected error during DB setup: {e}")

    return None, None, None

class HybridRetrieverTool(Tool):
    name = "HybridRetriever_Tool"
    description = (
        "Use this tool to answer questions based **only on the PDF document**. "
        "It retrieves the most relevant chunks using FAISS and BM25. "
        "**Always try this tool first before using anything else.**"
    )
    inputs = {
        "query": {
            "type": "string",
            "description": "Formulate the query to closely match the semantics of your target documents. Use an affirmative form instead of a question.", 
        }
    }
    output_type = "string"

    def __init__(self, faiss_db, bm25, documents, **kwargs):
        super().__init__(**kwargs)
        self.faiss_db = faiss_db
        self.bm25 = bm25
        self.documents = documents

    def forward(self, query: str) -> str:
        """Execute the hybrid retrieval based on the provided query."""
        assert isinstance(query, str), "The query must be a string."
 
        faiss_docs = self.faiss_db.similarity_search(query, k=3)
        tokenized_query = query.split()
        bm25_scores = self.bm25.get_scores(tokenized_query)
        bm25_ranked_docs = sorted(
            [(score, idx) for idx, score in enumerate(bm25_scores)],
            reverse=True
        )[:5]

        bm25_docs = [self.documents[idx] for _, idx in bm25_ranked_docs]
        combined_docs = faiss_docs + bm25_docs

        unique_docs = {doc.page_content: doc for doc in
combined_docs}.values() 

        return "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(unique_docs)
            ]
        )

vector_db, bm25, documents = db_setup("attention.pdf")
hybrid_retriever_tool = HybridRetrieverTool(faiss_db=vector_db, bm25=bm25,
documents=documents) 

search_tool = DuckDuckGoSearchTool()
search_tool.description = (
    "Use this tool only if the PDF document does not contain the answer. "
    "Avoid using this tool for document-related questions."
)

# Initialize the agent with both retriever and search tools
agent = CodeAgent(
    tools=[hybrid_retriever_tool, search_tool],
    model=TransformersModel(model_id="Qwen/Qwen2.5-3B-Instruct",device_map="auto"),
    max_steps=4,
    verbosity_level=3,
)
tool_descriptions = "\n".join([
    f"- {tool.name}: {tool.description}\n    Takes inputs: {tool.inputs}\n    Returns an output of type: {tool.output_type}"
    for tool in agent.tools.values()
])

# Append tool descriptions and custom prompt to system_prompt
agent.prompt_templates["system_prompt"] += f"""

You are an intelligent assistant that uses the following tools to answer queries:
{tool_descriptions}

Here is how you should proceed: 
1. First, use the 'HybridRetriever_Tool' to find relevant chunks from the PDF document.
2. Read all the retrieved document text carefully.
3. ONLY if this tool does not return anything relevant or is clearly unrelated to the question, use the 'web_search' tool (DuckDuckGoSearchTool).
4. Avoid using web search unless necessary. Your priority is to rely on the retriever output.

Always start with HybridRetriever_Tool. Use web search only as a fallback.
"""

# Optional
#print(agent.prompt_templates["system_prompt"]) 

# Define the question/query
question = "What is the formula for Scaled Dot-Product Attention?"

# Run the agent with the question
agent_output = agent.run(question)
print("\nFinal Answer:")
print(agent_output)

Note on Agent Output Behaviour

While the agent is designed to reason effectively over retrieved content and generate accurate answers using local LLMs, it's important to consider the following factors:

Probabilistic Reasoning

The underlying language model generates responses based on learned patterns and probabilities. While often accurate, it may occasionally produce hallucinated or imprecise answers—especially if the retrieved content is ambiguous or incomplete.

Dependency on Retrieved Context:

The agent's quality of response is closely tied to the relevance of the retrieved PDF chunks. If the retrieval tools surface weak or noisy context, the generated answer may reflect those limitations.

Model Training Constraints:

The Qwen2.5-3B-Instruct model, while powerful and multilingual, is trained on static datasets. It may not reflect the latest knowledge or domain-specific updates unless explicitly included in the PDF or retrieved via web search.

Version & Prompt Variability:

Even with the same inputs, outputs may differ between runs or across different Qwen versions due to changes in tokenizer behavior, fine-tuning strategies, or generation parameters.

Why this matters

Efficiency: Smaller models, accelerated locally, deliver real-time performance with lower compute overhead.
Explainability: Agents transparently select tools and follow a clear reasoning path.
Modularity: Tools like retrieval and web search are decoupled and reusable.
Trustworthiness: Local document retrieval is prioritized, with fallback to web only when needed.

Final Thoughts

Agentic RAG flips the paradigm: rather than one big model doing everything, we coordinate small, focused components—retrievers, generators, search tools—under an intelligent protocol. This makes AI more scalable, auditable, and efficient for enterprise and developer use alike.

Try it yourself on Intel® Tiber™ AI Cloud: You can run this code and explore the performance of the Intel Data Center GPU Max 1100 directly. Intel Tiber™ AI Cloud offers:

Free JupyterLab Environment: Get hands-on access to a Max 1100 GPU for training and experimentation by creating an account at cloud.intel.com and launching a GPU-accelerated notebook from the "Training" section.

References & Resources

Intel® Extension for PyTorch (IPEX): Accelerate PyTorch models on Intel hardware with optimized operators and memory layout: https://github.com/intel/intel-extension-for-pytorch
Qwen2.5-3B-Instruct (Hugging Face): A compact, instruction-tuned multilingual model ideal for local LLM tasks: https://huggingface.co/Qwen/Qwen2.5-3B-Instruct
SmolAgents Framework: Lightweight, open-source agent architecture for tool-augmented language models: https://huggingface.co/blog/smolagents
Intel® Tiber™ AI Cloud: Intel’s development and deployment platform for building performant AI workloads on Intel-optimized infrastructure: https://www.intel.com/content/www/us/en/solutions/tiber.html
Intel® Data Center GPU Max Series: Intel’s most powerful and compact general-purpose discrete GPU, featuring over 100 billion transistors and up to 128 Xe Cores—the essential building blocks of Intel GPU compute: https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html
6th Gen Intel® Xeon® Scalable Processors: High-performance server CPUs designed to power modern AI, cloud, and enterprise workloads: https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html
Intel® Liftoff for AI Startups Program: https://developer.intel.com/liftoff