Examine critical components of Cloud computing with Intel® software experts
123 Discussions

Boosting LLM Chat Performance: Intel® Data Center Flex GPUs & SW Optimizations for RAG (Part 2 of 2)

0 0 16.6K

In part1 of the blog series, we introduced Retrieval Augmented Generation (RAG) and its components.

In this part, we will look at the LLM platforms used for RAG deployment and Intel Data Center Flex GPUs used for the chat function for Twixor.

RAG Solution Components:

Haystack Platform for LLMs

Haystack is an open-source Python framework developed by Deepset for building custom applications with large language models (LLMs). It is designed to facilitate the development of state-of-the-art natural language processing (NLP) systems by providing a comprehensive set of tools and components. Below is an overview of the Haystack framework, focusing on its core components, Retrieval Augmented Generation (RAG), and the InMemoryDocumentStore.


The InMemoryDocumentStore is a lightweight document store that stores documents in memory. It is suitable for experimentation and small-scale applications but is not recommended for production workloads due to its limitations, such as the inability to persist data and the need to scan all documents for each query.

Sentence Transformers

The Sentence Transformers library is a powerful tool in the field of Natural Language Processing (NLP) that specializes in generating high quality, semantically rich embeddings for sentences, paragraphs, and images. Sentence Transformers is a versatile and powerful library for generating embeddings that can be used in a wide range of NLP applications. Its ease of use, extensive pre-trained models, and support for finetuning make it a valuable tool for researchers and developers alike.

RAG Implementation:

Twixor sought to improve the accuracy of the chat by leveraging Retrieval Augmented Generation (RAG). The use case was to index and vectorize customer knowledge base that can be queried through RAG.

The Haystack framework was used to deploy RAG. Customer knowledge base documents in PDF format were vectorized using the Sentence transformer library.

The Sentence Transformer library[i] is a popular Python library used for generating dense vector representations (embeddings) of text data like sentences, paragraphs, and documents. Here's how you can use it to vectorize text data:

Loading a Pre-trained Model

First, you need to load a pre-trained Sentence Transformer model. There are many models available, pre-trained on different datasets for various tasks. You can load a model like this:


Figure 1: Importing Sentence Transformer

Replace `'model_name'` with the name of the pre-trained model you want to use. Some popular models are `'all-MiniLM-L6-v2'`, `'paraphrase-MiniLM-L6-v2'`, and `'multi-qa-MiniLM-L6-cos-v1'`.

Generating Embeddings

Once you have loaded the model, you can generate embeddings (vectors) for your text data by passing a list of strings to the `model.encode() ` method:


Figure 2: Generate embeddings from the text data

The `embeddings` variable will contain a array with the dense vector representations of your sentences. The number of dimensions depends on the specific model used.

Using Embeddings

These embeddings can then be used for various natural language processing tasks like:

  • Semantic Search: Calculate cosine similarity between the query embedding and document embeddings to find the most relevant documents.
  • Clustering: Group similar sentences/documents by clustering their embeddings.
  • Deduplication: Remove near-duplicate texts by comparing their embeddings.
  • Classification: Train a classifier on the embeddings for text classification tasks.

For example, to find the similarity between two sentences:


Figure 3: Finding Similarity between two sentences

This will print the cosine similarity score between the two sentence embeddings.

The Sentence Transformer library makes it easy to leverage state-of-the-art models for generating high-quality text embeddings, which can then be used for various NLP applications. Custom templates were then used for generating responses. The RAG deployment uses the top 3 relevant document that are retrieved and used to answer queries.

Using the InMemoryDocumentStore:

We used the `InMemoryDocumentStore[ii]` in Haystack, which is a lightweight and straightforward document store designed for quick experimentation and development. It does not require any external services or dependencies, making it ideal for testing and small-scale applications.

Key Features

  1. Initialization:

    The `InMemoryDocumentStore` can be easily initialized without any external setup. For example:


Figure 4: Initialization of InMemoryDocumentStore

2. Document Handling:

   Documents are expected in dictionary form and can be written to the store using the `write_documents()` method. For example:


Figure 5: Document Handling with InMemoryDocumentStore

3.Embedding Support:

   The `InMemoryDocumentStore` supports both sparse (e.g., BM25, TF-IDF) and dense (e.g., neural network-based) retrievers. For dense retrievers, embeddings can be updated using the `update_embeddings()` method:


Figure 6: Embedding support with  InMemoryDocumentStore

4. Limitations:

   The `InMemoryDocumentStore` is not designed for large-scale or production use due to its in-memory nature, which limits the amount of data it can handle efficiently. It is best suited for development and testing purposes.

5. Saving and Loading:

   While the `InMemoryDocumentStore` does not natively support saving and loading, users have employed workarounds such as pickling the document store. However, this approach has limitations, especially with embeddings:


Figure 7: Workaround for saving and loading

6. Example Usage

Below we show a complete example of initializing an `InMemoryDocumentStore`, writing documents, and updating embeddings that we leveraged in our RAG deployment:


Figure 8: Complete example showing the use of InMemoryDocumentStore with RAG

In this RAG deployment for Twixor we used `InMemoryDocumentStore` for initial development and testing phases. For more extensive and production-level applications, other document stores like Elasticsearch, FAISS, or Milvus are recommended due to their scalability and advanced features.

RAG Results:

We did a qualitative analysis of chat responses before and after RAG was implemented. We found that the Retrieval-Augmented Generation (RAG) implementation on the NeuralChat model significantly enhanced the response quality of Large Language Models (LLMs) through the integration of external, up-to-date, and contextually relevant information into the model's generative process.

Here are key improvements produced by the RAG deployment for Twixor:

  1. The responses were current and showed specific information:
    • LLMs are typically trained on static datasets, which means their knowledge is limited to the information available up to the training cut-off date. The RAG implementation for Twixor addressed this limitation by retrieving the most relevant and current data from external sources and ensuring that the responses were up-to-date and contextually accurate.
  2. Reduction of Hallucinations:
    • Hallucinations occur when LLMs generate plausible-sounding but incorrect or nonsensical information. After deploying RAG, we noticed a reduction in hallucinations, which also increased the factual accuracy of the generated content, which was now generated from reliable sources.
  3. Enhanced Contextual Relevance:
    • After RAG the application retrieved documents and data that were specifically relevant to the user's query. This ensured that the generated responses were not only accurate but also highly relevant to the specific context of the query.
  4. Improved Reliability and Trust:
    • The response after RAG provided sources and citations for the information used in generating responses and this built user trust. We were able to verify the accuracy of the responses, as it was important for applications requiring high reliability.
  5. Cost-Effectiveness and Efficiency:
    • Implementing RAG by leveraging existing data eliminated the need for extensive retraining of the LLM, making it a cost-effective solution.
  6. Flexibility and Adaptability:
    • We were able to adapt RAG to various domains by integrating domain-specific knowledge bases. This flexibility allows the NeuralChat-derived LLM to perform well in specialized tasks without the need for resource-intensive, domain-specific fine-tuning.
  7. Scalability:
    • Studies have shown that the performance of LLMs improves as more data is made available for retrieval. RAG scales effectively with large datasets, enhancing the quality of responses even with massive amounts of data.

We found that RAG enhances the response quality of NeuralChat LLMs by providing access to current, relevant, and accurate information, reducing hallucinations, and improving contextual relevance and reliability. RAG is a powerful tool for deploying LLMs in dynamic and knowledge-intensive environments.

Neural Chat with Intel® Data Center GPU Flex Series 140:

In the second phase of this project, we used the Intel Data Center GPU Flex Series 140 for inference of the NeuralChat LLM for use by Twixor. The goal was to compare the latency leveraging the GPU that can deployed at the edge with an Intel® Xeon® CPU with Intel AMX for a customer service chat application.

Intel Data Center GPU Flex Series 140

The Intel Data Center GPU Flex Series 140 is designed to accelerate AI visual inference workloads in the data center. Here are some key points about its AI inference capabilities:

AI Inference Performance

  • The Intel® Data Center Flex 140 GPU contains two DG2-128 GPUs, each with 1024 cores, providing a total of 2048 cores for parallel processing of AI inference workloads.
  • It supports popular AI frameworks and libraries like TensorFlow, PyTorch, and Intel's OpenVINO toolkit with minimal code changes required to run inference on the GPU.

Hardware Acceleration

  • The Intel® Data Center Flex 140 GPUs have dedicated hardware acceleration for AI workloads like matrix multiplication and convolution operations.
  • It includes 8 ray tracing cores per GPU to accelerate ray tracing for visual inference tasks.

The GPUs support key AI instructions like INT8 and BF16 precision for efficient AI compute.



Figure 9: Intel Data Center GPU Flex Series 140

Open Software Stack

  • The Flex GPUs leverage Intel's oneAPI programming model, providing an open and standards-based software stack.
  • This allows developers to build cross-architecture AI solutions portable across CPUs, GPUs, and other accelerators.
  • Intel provides optimized libraries like oneDNN, OpenVINO, and Intel AI Analytics Toolkit for efficient AI deployment.

Chat Solution NeuralChat LLM with Intel® Data Center GPU Flex Series 140:

In a previous case study and blog series we showed how Intel worked with customer Twixor to choose and optimize an LLM with Intel AMX for their chat implementation. This solution is an enhancement to the earlier case study as Twixor required this chat application to be available on the edge where the CPUs might not have Intel AMX capabilities. Using Intel Data Center GPU Flex Series GPU and Intel SW optimizations the solution can potentially be deployed at the edge, which is the basis of this solution. Twixor started the journey by looking at pre-existing open-source LLMs from the Hugging Face[iii] AI community as a starting point for chat applications. Intel-optimized NeuralChat, which was originally used, was again used as the LLM along with RAG and Intel Flex GPU 140 for the chat application.

NeuralChat-7B Testing on Intel GPUs

The Intel Flex Series GPU 140 was installed in a PCI Slot of a generic x86 server representing the Edge use case. The PCI slot was standard, and the extra power drawn by the GPU card was within the limits of what an edge server supports. VMware virtualization was used to create a virtual machine with 4 vCPU and 16 GB RAM, and the two distinct GPU cards present in the Intel Flex Series GPU 140 were passed through into the virtual machine.

To adapt the solution to run on Intel GPUs, we leveraged the IPEX-LLM SW library.  IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency. This component was substituted for the IPEX and ITREX used with the CPU testing. We successfully tested it on the Intel Flex Series GPU 140with IPEX-LLM by using deepspeed-AutoTP.

NeuralChat-7B with INT4 quantization was then run on the GPUs with the same scripts and evaluated for LLM latency for 90 tokens. By leveraging the Intel optimizations for PyTorch and Transformers for these GPUs, we were able to achieve comparable results to that of Xeon with Intel AMX as shown in the table below. Note: Intel Flex Series GPU 140 inference is on par with 24 cores CPU Sapphire Rapids HW.

Neuralchat-7B INT4

Model: Neuralchat-7B

Data type: INT4


Server Config

Inference latency (in Secs)


48 core CPU, 240 GB RAM, Intel SPR



24 core CPU, 240 GB RAM, Intel SPR



12 core CPU, 240 GB RAM, Intel SPR



Intel Flex Series GPU 140+


Table 1: Latency comparison across configurations for first 90 tokens with NeuralChat

The transcript below shows some of the metrics captured during a GPU LLM Inference run. The latency measurement was 3.209 seconds for the first 90 tokens, which is well within Twixor’s requirement of less than 6 seconds.


Figure 10:  Metrics seen during a NeuralChat run with Intel Datacenter Flex GPU 140


The Retrieval Augmented Generation (RAG) technique has proven to be a powerful approach for enhancing the capabilities of large language models in natural language processing applications. By integrating external knowledge sources into the text generation process, RAG addresses limitations such as outdated information, hallucinations, and lack of domain-specific knowledge. The implementation of RAG for Twixor, leveraging the Haystack framework and Intel's optimizations, has significantly improved the accuracy, relevance, and reliability of the NeuralChat LLM-based chat application. Furthermore, the deployment of the NeuralChat model on Intel® Data Center Flex GPU 140 has demonstrated comparable inference latency to CPU-based solutions, making it a viable option for edge deployments. Overall, the combination of RAG and Intel's hardware and software optimizations has unlocked new possibilities for deploying LLMs in dynamic and knowledge-intensive environments, paving the way for more accurate, contextually relevant, and efficient natural language processing applications.


[i] Sentence Transformer Library for vectorizing documents=

[ii] In Memory Document Store in HayStack


Performance varies by use, configuration and other factors. Learn more at Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

The analysis in this document was done by VLSS and commissioned by Intel. 


We are profoundly grateful for the Intel AI Customer Engineering Team led by Anish Kumar and the sustained contributions of his team members Vasudha Kumari and Vishnu Madhu for their guidance and engagement with Twixor. We would also like to thank AAUM Analytics for working with the Intel team on behalf of Twixor on this solution.

Tags (4)
About the Author
Mohan Potheri is a Cloud Solutions Architect with more than 20 years in IT infrastructure, with in depth experience on Cloud architecture. He currently focuses on educating customers and partners on Intel capabilities and optimizations available on Amazon AWS. He is actively engaged with the Intel and AWS Partner communities to develop compelling solutions with Intel and AWS. He is a VMware vExpert (VCDX#98) with extensive knowledge on premises and hybrid cloud. He also has extensive experience with business critical applications such as SAP, Oracle, SQL and Java across UNIX, Linux and Windows environments. Mohan Potheri is an expert on AI/ML, HPC and has been a speaker in multiple conferences such as VMWorld, GTC, ISC and other Partner events.