RAG Chatbot Shows the Way to Simplify AI Solution Deployment

Porter_Brien · ‎08-26-2024

Introduction

As artificial intelligence (AI) rapidly evolves, enterprises seek robust solutions that streamline AI application integration and deployment. Red Hat and Intel work, independently and together, to meet this need with curated, validated, integrated hardware and software components. These components simplify the selection and integration process, ultimately reducing the time to market for new AI-based products. A validated pattern, which allows businesses to deploy a full application stack through a GitOps-based framework automatically, is critical in enabling this simplification.

Now, an intriguing Open Platform for Enterprise AI (OPEA) project—a chatbot called ChatQnA—is demonstrating the value of the Intel and Red Hat approach. The Retrieval Augmented Generation (RAG) ChatQnA demo, deployed on Red Hat® OpenShift® AI and accelerated by Intel® Gaudi® 2 AI accelerators, exemplifies advanced technology tailored for efficient, seamless AI deployment.

Background

Chatbots have transformed the business landscape, revolutionizing customer interactions by automating responses and streamlining routine inquiries. They enhance service efficiency and offer a 24/7 communication channel, thereby improving customer satisfaction and allowing for the collection of valuable insights to refine products and services.

With advances in AI technology, chatbots are becoming increasingly adept at handling complex interactions. They can now offer personalized experiences that boost engagement and loyalty. The implementation of RAG architecture is critical in enabling this chatbot evolution, providing additional context that allows models to respond accurately to company-specific or client-related queries. Intel Gaudi 2 accelerators deliver crucial AI acceleration to manage these demanding workloads effectively.

The new OPEA ChatQnA demo offers just one example of how the Intel and Red Hat collaboration empowers businesses to develop scalable, powerful, and easily deployable (within a GitOps architecture) AI applications. These applications can be key to gaining a competitive advantage. The discussion that follows highlights the elements that make up the new solution.

Building Blocks

Large Language Models (LLMs) and RAG enhance the quality of chatbot communications.

LLMs are advanced AI models trained on massive text datasets to process and generate human-like text. By predicting the next words in a sequence based on the previous context, LLMs produce coherent and context-relevant text. Incorporating LLMs allows chatbots to grasp more complex inquiries, deliver more accurate responses, and sustain a fluid conversation, addressing the limitations of traditional models like robotic responses and limited comprehension of user intent.

RAG is an innovative AI approach that combines the functionalities of a retrieval system and a generative model. It improves response generation by initially fetching relevant documents or data, guiding the LLM model to produce more precise and contextually relevant outputs. This method is especially beneficial in chatbots and content generation scenarios, where accuracy and contextual relevance are key.

Combined with the technologies described below, these models form the basis of the new ChatQnA demo.

Technologies

Intel® Gaudi® 2 AI accelerator drives improved deep learning price performance and operational efficiency for training and running state-of-the-art models, from the largest language and multi-modal models to more basic computer vision and NLP models. Designed for efficient scalability—whether in the cloud or your data center—Intel Gaudi 2 accelerators bring the AI industry the choice it needs—now more than ever.(1)

Red Hat OpenShift is a trusted, comprehensive, and consistent container platform designed for developing, modernizing, and deploying applications at scale, including today’s AI-enabled apps. Deliver better experiences faster with a complete set of services for bringing apps to market on your choice of infrastructure.(2)

Red Hat OpenShift AI is a flexible, scalable AI and machine learning (ML) platform that enables enterprises to create and deliver AI-enabled applications at scale across hybrid cloud environments.(3)

The Open Platform for Enterprise AI (OPEA) is an open framework aimed at accelerating the adoption of generative AI in businesses. It offers a set of composable building blocks for creating advanced AI systems, including LLMs and prompt engines, along with architectural blueprints for efficient end-to-end workflows. OPEA's four-step assessment ensures AI solutions are performant, feature-rich, trustworthy, and ready for enterprise use. Designed for efficiency, the platform supports existing infrastructure and is open to avoid vendor lock-in. It is also ubiquitous, scalable, and trusted, running across various environments and emphasizing security and transparency to meet enterprise standards.

ChatQnA shows how combining these technologies with LLM and RAG models on OPEA brings straightforward AI deployment within reach.

ChatQnA Demo – Chatbot Data Flow Explained

The ChatQnA demo clearly illustrates the RAG model's functionality (see the following discussion). In this case, the intel-designed validated pattern facilitates the straightforward deployment of the ChatQnA demo hosted on OpenShift and OpenShift AI platforms. Intel Gaudi 2 AI accelerators speed performance and enable cost-efficient scalability.

When a document (for example, a PDF) is uploaded to the system as a data source for the chatbot, it must be preprocessed to be used later. This involves extracting relevant data, such as raw text or image descriptions, from the document using techniques tailored to the document format and type. The document is then divided into smaller chunks, each processed with the embedding model to transform the text into a vector representation. These vectors and the corresponding text are stored in a vector database.

When a user query arrives, it undergoes the same embedding process as the previously described documents. By comparing this query vector to the document vectors in the database, the most similar results are identified and used to construct an extended prompt for the LLM. This enriched prompt provides the model additional context, enabling it to generate more accurate and relevant responses.

This demonstration comprises several components. The user interacts with the UI to submit their request. The backend processes this request, facilitating communication with the database and the model. The model is served using OpenShift AI Serving (KServe). Developers can fine-tune or retrain the model using OpenShift AI tools.

Validated Patterns

Validated patterns are frameworks that enable the automatic deployment of the full application stack using GitOps tools. The state of applications in the validated pattern should be inspected through two ArgoCD instance UIs. These UIs allow the user to monitor the status of every application. They also let the user modify selected components or simply synchronize them with the latest changes in the Git repository. Intel built our community pattern from a Red Hat-validated pattern.

The following community pattern sets up all components that enable Intel Gaudi accelerators. It also deploys the OPEA RAG chat as a sample solution accelerated by Intel Gaudi accelerators. This validated pattern shows how the user can deploy a custom Serving Runtime (TGI-Gaudi, in this case) on top of Red Hat OpenShift AI to work with an OPEA ChatQnA example. The deployed application lets the user load the RAG dataset (saved to Redis DB) easily through a graphical user interface (GUI), ask the chat questions, and then receive answers based on the provided context, a common chatbot usage scenario. By default, the RAG database is initialized with data based on part of an EDGAR Form 10-K filing by Nike. The provided context can be modified through the included user interface “upload” function.

Because Red Hat OpenShift AI is a GUI-based platform, some steps are not automated in the pattern, so a reference guide is provided instead. The guide outlines key instructions for uploading an LLM model to an S3-like service and customizing the included Serving Runtime to run the desired LLM model on Intel Gaudi cards.

All applications that comprise the validated pattern are divided into four groups. The first group is the Intel Gaudi software stack, which is strictly dedicated to enabling Intel Gaudi accelerators on nodes that have them installed. The second group is the OPEA ChatQnA stack, which aims to deploy a working RAG ChatQnA from OPEA. The third group consists only of Vault, which is responsible for storing secure secrets. The fourth group is the MLOps ecosystem stack, which encompasses tools for managing machine learning operations.

ChatQnA is built on a microservices architecture. The validated pattern makes it easy to deploy, reconfigure, and monitor the status of all of the included components, which directly helps to achieve several of the OPEA Generative AI Components targets—modularity, simplicity of scaling, and abstraction.

The introduction and “Getting Started” for this Pattern can be found in the general Validated Patterns documentation.

The pattern’s sources are stored in the Github repository.

Intel Gaudi Card Best Practices

The OPEA ChatQnA on Red Hat OpenShift showcases two components that can run in Intel Gaudi 2 AI accelerators: the tgi-gaudi Serving running the LLM and the tei-gaudi that handles embedding-related inference on the Intel Gaudi cards.

According to the Habana Gaudi documentation, the Intel Gaudi cards' external network interface may require configuration based on the setup used.

On stand-alone Intel Gaudi 2 card-based machines, the external NICs should be disabled. Skipping this step may result in “nicPort” errors when running any workloads using multiple cards at once.

Here are additional tips for using Intel Gaudi cards with Red Hat OpenShift:

The Low-level Information about Intel Gaudi Cards

The low-level information about Intel Gaudi cards can be inspected through the “hl-smi” tool. On Red Hat OpenShift, this can be accomplished by accessing machine with Intel Gaudi accelerators and running a privileged container configured with Intel Gaudi software. An example of a container image used for this purpose is vault.habana.ai/gaudi-docker/1.15.0/rhel9.2/habanalabs/pytorch-installer-2.2.0:latest. The command should look like this:

oc debug node/NODE_NAME --image=vault.habana.ai/gaudi-docker/1.15.0/rhel9.2/habanalabs/pytorch-installer-2.2.0:latest

The user can simply run the command “hl-smi”. If there is something wrong with the Intel Gaudi drivers, there will be an error message similar to the following:

habanalabs driver is not loaded or no AIPs available, aborting...

Intel Gaudi 2 Accelerator Instance Availability

The Habana AI Operator exposes Intel Gaudi cards as Kubernetes resources. After installing the operator, the “oc” tool can inspect how many cards are available and how many are assigned to existing workloads. To do this, run:

oc describe node <Intel_Gaudi_node_name>

The resources are named “habana.ai/gaudi”.

The “Capacity” section shows how many cards, in total, are detected on the node.

“Allocated resources” shows how many cards are used in existing workloads. The “Requests/Limits” nomenclature is used as is standard on Kubernetes.

The discussion above outlines the steps to create your version of a reliable, easy-to-deploy RAG chatbot solution.

Summary

The OPEA RAG ChatQnA, hosted on Red Hat OpenShift or OpenShift AI, streamlined with Intel’s new community pattern, and accelerated by Intel Gaudi 2 cards, demonstrates how to efficiently set up an application stack for a reliable RAG chatbot solution. The application’s modular design allows developers to augment it seamlessly with additional components, such as LLM guardrails, to meet a business’s specific needs. Together, the components offer enterprises an opportunity to gain a competitive edge as AI solutions reshape business processes.

To learn more about the OPEA RAG ChatQnA demo and its components, check out these resources:

https://validatedpatterns.io/patterns/gaudi-rag-chat-qna/

https://opea.dev/

https://github.com/opea-project/GenAIExamples/

Special thanks to our contributing authors:

Filip Skirtun – Intel Cloud Systems and Solutions Engineer
Igor Marzynski – Intel Cloud Systems and Solutions Engineer
Mateusz Kalinowski - Intel Systems and Solutions Engineering Manager
Lokendra Uppuluri – Intel Cloud Systems Architect
Piotr Grabuszynski – Intel Cloud Systems Architect

(1) https://habana.ai/products/gaudi2/

(2) https://www.redhat.com/en/technologies/cloud-computing/openshift

(3) https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.