Monitoring and Debugging RAG Systems in Production

IzabellaRaulin · ‎10-02-2025

Retrieval-Augmented Generation (RAG) is a powerful method for building AI systems that provide accurate, context-aware responses by leveraging a specific knowledge base. A RAG pipeline includes components like document ingestion, chunking, retrieval, ranking, prompt construction, and answer generation using a large language model (LLM). Each part influences the system’s overall performance, so proper functioning across all components is essential.

To ensure the RAG system effectively supports users, it's important to monitor response times and task relevance. This is where debugging and observability become mission-critical. The article highlights how Intel® AI for Enterprise RAG enables oversight through tools for tracing data flow, performance telemetry, and log analysis.

Happy to announce that the Intel® AI for Enterprise RAG version 1.5.0 has just been released.

RAG Response Quality in Production

Deployment can be accompanied by accuracy evaluation, for example, to help select the models used in the pipeline. However, such evaluations on a mock dataset with synthetic questions provide only a preliminary indication. To truly determine whether your RAG system is well-functioning and effectively assisting end-users, it must be assessed in the context of its actual use case, using real production documents and user queries.

Intel® AI for Enterprise RAG features a powerful debug mode, accessible from the Admin Panel by adding ?debug=true to the URL. This mode provides full visibility into how the system processes documents and retrieves knowledge, which is helpful for understanding the RAG system’s current “thought path.” It also enables safe experimentation with pipeline parameters without affecting the production environment, showing only how results might differ.

Below are examples of how to use debug mode and how it can be helpful in different aspects of working with the RAG pipeline.

Controlling Text Extraction and Chunking Quality

Let’s demonstrate the capabilities of Intel® AI for Enterprise RAG with a simple example: a RAG assistant designed to answer questions about Intel technologies using a set of uploaded documents. As shown in Figure 1, the Admin Panel under the Data Ingestion tab displays all uploaded documents. All documents used in this example are listed in the Resources section at the end, so anyone can replicate the experiment.

Figure 1. View of the Intel® AI for Enterprise RAG Admin Panel → Data Ingestion in debug mode.

Clicking the "Extract Text" button reveals how the uploaded document is extracted, normalized, and chunked. The “Use Parameters” option lets you safely experiment with chunk sizes, overlaps, and other settings, without impacting production. Changes here only affect the debug view, they don’t alter how chunks are stored in production environment.

Figure 2. Reviewing extracted text from a PDF document.

Inspecting Extracted Text from Images

One of the interesting challenges in RAG systems is extracting knowledge from image files, graphs, diagrams. Inspecting what information has been captured from visual content and incorporated into the RAG pipeline ensures the data is accurate, complete, and ready to support reliable answers.

Figure 3. Reviewing extracted text from an image file.

Previewing ingested content is essential for controlling and monitoring how input data is processed, and gaining insights into how the system interprets and handles different types of content.

Inspecting the RAG System’s Reasoning Path

In a typical interaction with Intel® AI for Enterprise RAG Chat Q&A, a user asks a question and receives an answer accompanied by references to the source documents. This approach works well for correct answers, as users can verify information against the original sources. However, challenges emerge with unanswered questions or incorrect responses, where it may be unclear why certain documents were selected, potentially leading to wrong or irrelevant results. Addressing these cases requires deeper inspection of the document retrieval and ranking process.

Figure 4. View of the chat interface, illustrating the structure of a conversation. In this example, the response is accurate and supported by sources from the local knowledge base.

Monitoring RAG Pipeline Components

The Control Plane tab in Intel® AI for Enterprise RAG provides a clear view of the pipeline components that make up the Chat Q&A system. For a detailed overview of each component and its configuration options, see the Admin Panel documentation.

The reasoning path begins with the retriever processing the input query to identify relevant documents from the knowledge base. Inspecting the payloads and data flow for specific queries helps troubleshoot unanswered or unsatisfactory responses. To access this functionality, click the Retriever component to open its corresponding tab, and then select the Debug option in the top-right corner.

Figure 5. View of the Intel® AI for Enterprise RAG Admin Panel → Control Plane in debug mode. This view shows all pipeline components along with their statuses - green for operational and red for blocked. In debug mode, selecting the Retriever opens a right-side menu that provides access to the Debug interface, enabling interaction with the retriever and inspection of its data payload.

Inspecting Retriever

Opening Retriever → "Debug" launches a window to test how queries are processed in the Enterprise RAG pipeline. Let’s enter an example query, ‘What is AMX?’, to observe how the retriever selects relevant documents.

Figure 6. Retriever Debug allows entering a query to follow the retrieval path and inspect which documents are selected.

When you submit a query, the system converts it into a structured payload containing pipeline settings like search_type, k and other parameters. The retriever’s output is shown in Figure 7, where fragments identified as relevant to the query are listed under retrieved_docs.

Figure 7. Retriever Debug output displaying fragments selected as relevant for the provided query under retrieved_docs. Note: With the “Enable Reranker” checkbox off, only the retriever’s behavior is inspected, and all reranker-specific options (top_n, rerank_scrore_threshold) are ignored.

The Retriever Debug environment provides a safe sandbox for experimentation, allowing testing of different retrieval strategies and modification of k values without affecting the production system. Additionally, inspection can be extended to include the reranking stage.

Inspecting Reranking

Simply toggle “Enable Reranker” to include the reranking process of retrieved documents in the inspection. This activates parameters such as top_n and rerank_score_threshold, allowing review of documents after the reranking stage. The response includes documents under reranked_docs along with their corresponding reranker_score, which ranges from 0 to 1. The reranker identifies the top n documents with the highest reranker_score, prioritizing content deemed most relevant and likely to contribute to an accurate answer.

Figure_08_ERAG_Retriever_Debug_Output_With_Rerank_Enabled.png

Figure 8. Retriever Debug Output with Reranker Enabled. Post-ranking documents shown under reranked_docs with corresponding reranker scores.

Inspecting Retriever Behavior under Role-Based Access Control

Intel® AI for Enterprise RAG provides Role-Based Access Control (RBAC), so system responses can vary depending on the user profile, which may have access only to specific documents. It is possible to simulate a user’s interaction with the RAG pipeline by restricting the search_by parameter to only the bucket or list of buckets that the user has access to. For example, testing with an hr-department bucket ensures the pipeline behaves as if accessed by an HR user, retrieving only authorized content.

Figure_09_ERAG_Retriever_Debug_Output_With_Search_By.png

Figure 9. Retriever Debug View with retrieval restricted to the hr-department bucket.

Key Insights from Tracing the RAG System’s Reasoning Path

Using this approach, it is possible to identify several critical aspects of the RAG pipeline, presented in the table below:

Aspect	Description	Potential Issue	Recommended Actions
Content gaps	Whether the retriever or reranker fails to find relevant information in the vector database.	Important information may be missing, leading to incomplete or incorrect answers.	Review and enrich the vector database; adjust retriever parameters (k, search_type) to improve coverage.
Conflicting information	Instances where retrieved documents contain contradictory content.	Could cause inconsistent or misleading responses in the LLM output.	Identify and resolve conflicts in source documents; delete deprecated documents, profiles your user (RBAC access to sources)
Chunk quality issues	How the text is chunked and which chunks are prepared for inclusion in the prompt.	Too small chunks: relevant info may be split across chunks and missed. Too large: retrieval precision decreases, and LLM may struggle to focus on important content.	Adjust chunk size and overlap parameters; verify chunking on sample documents; refine chunking strategy.
Reranker evaluation	How the Reranker scores retrieved documents and determines their relevance.	Low-quality scoring may prioritize irrelevant documents or ignore high-value content.	Tune reranker hyperparameters (e.g., top_n, score_threshold), monitor its outputs. Consider using a more effective model for the reranking process.

Telemetry and Observability

One of the core strengths of Intel® AI for Enterprise RAG is its integrated monitoring and observability stack. While the debug features explained in previous sections enable deep inspection of individual queries and pipeline components, telemetry provides a broader, continuous view of how the entire RAG system behaves in production.

Intel® AI for Enterprise RAG Telemetry is built on industry-standard open-source components: Prometheus, Loki, Tempo, and Open Telemetry Collectors provide a unified observability stack for metrics, logs and traces. Collected metrics are visualized in Grafana dashboards, enabling administrators continuous monitor system performance and health.

The dashboards are tailored to display metrics and insights specific to the Intel® AI for Enterprise RAG solution and its associated services. Below, only selected dashboards are presented. For full details, please visit the Intel® AI for Enterprise RAG Telemetry Documentation.

One of the dashboards continuously tracks Horizontal Pod Autoscaler (HPA) activity, highlighting how the Intel® AI for Enterprise RAG system responds to workload fluctuations and makes scaling decisions. To learn more about the HPA mechanism, please refer to the article "Deploying Scalable Enterprise RAG on Kubernetes".

Figure 10. Grafana Dashboard for Horizontal Pod Autoscaler, showing real-time scaling activity, replica counts, and metrics used for automatic resource adjustment. Note: Values shown on the dashboard are illustrative for this deployment and are not indicative of any performance benchmarks.

Additionally, there are dashboards displaying metrics for specific components in the RAG pipeline. For example, one such dashboard focuses on VLLM, a key component responsible for generating responses to user queries. It enables administrators to monitor how VLLM handles incoming requests, providing insights into token throughput, time per output token, time to first token, and other performance metrics. These metrics reveal how the selected Large Language Model (LLM) performs depending on the number of deployed replicas, underlying infrastructure, and configuration settings. Extended time to first token or spikes during high-traffic periods may indicate the need for deployment adjustments to ensure users receive timely responses and a smooth chat experience.

Figure 11. Grafana Dashboard for VLLM, showing key performance indicators for this component in the RAG pipeline. Note: Values shown on the dashboard are illustrative for this deployment and are not indicative of any performance benchmarks.

Centralized Logging

Logs are essential for monitoring and troubleshooting RAG pipelines, as they provide fine-grained visibility into the behavior of each service and its interactions. By combining centralized log storage with Grafana’s exploration tools, Intel® AI for Enterprise RAG ensures that operators have a clear, comprehensive view of system behavior, improving reliability and maintainability

Figure 12. Centralized Logging in Intel® AI for Enterprise RAG. Using Grafana’s Explore → Logs tab, operators can efficiently search and filter logs across all sources, making troubleshooting faster and more efficient.

Final Thoughts

Building a RAG pipeline is only half the story. Equally important is the ability to monitor the entire pipeline. Every stage, from the retriever through the reranker and up to the generative model, can significantly influence the final output. Without proper observability, diagnosing errors, understanding performance bottlenecks, and ensuring reliability become extremely difficult. Adjusting any single element can produce cascading effects on the others, making it essential to monitor each stage independently, and of course all together. A well-known truth applies here: “You can’t manage what you don’t measure.”

Intel® AI for Enterprise RAG addresses this with a dual-layer observability approach:

Debug mode explains why a single query behaved a certain way.
Telemetry reveals how the system performs over time.

This combination empowers administrators to monitor system health, tune hyperparameters, identify content gaps, and proactively address performance or reliability issues.

Acknowledgments

Special thanks to @MichalProstko and Anna Alberska for their contribution.

Resources

Documents uploaded to RAG as files:

“AMX Quick Start Guide.pdf” , source: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-amx.html
“gaudi-3-ai-accelerator-white-paper.pdf”, source: https://cdrdv2-public.intel.com/817486/gaudi-3-ai-accelerator-white-paper.pdf

Documents uploaded to RAG as links:

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.