Transforming Customer Service: How an Intel Customer Built a Smarter Chatbot (Part 2 of 2)

Mohan_Potheri · ‎05-09-2024

Twixor an Intel customer is a company that focuses on improving customer experience (CX) through automation in the business sector. They offer a platform that combines two key technologies, conversational AI and intelligent process automation. This allows businesses to:

Automate customer journeys through chat conversations
Provide features like actionable instant messaging for customer service
Utilize AI and live agents together to improve support efficiency

Their target customer industries include banking, insurance, healthcare, logistics, and commerce. Products that comprise their enterprise-grade customer service (CX) Automation Platform include:

Twixor Actionable Instant Messaging (AIM platform for conversational AI-powered customer engagement. Twixor CX Automation Platform for comprehensive customer experience management
Twixor Developer Platform for building custom conversational solutions.

Let’s take a closer look at two of the primary functional components that make up the Twixor CX Automation platform:

Twixor Actionable Instant Messaging (AIM)[i]: This is a low-code, no-code CX automation platform that uses conversational AI and intelligent process automation to allow businesses to automate tasks and fulfill customer requests through messaging channels. It offers features like intent recognition, multilingual support, and rich interactive content.
Twixor CX Automation Platform [ii]: This platform uses AI, NLP, and analytics to help businesses build relationships with their customers. It offers features like personalized support and campaign management to drive customer advocacy . It also offers capabilities like omnichannel chat solutions, intelligent process automation, and a low-code/no-code developer platform.

Twixor was working to embed AI-based chatbots into all aspects of its customer experience platform, resulting in a need for a robust AI-based chat engine. Due to the promise shown by LLMs for chat, Twixor looked to leverage open-source LLMs to enhance its chat features.

LLM challenges Twixor needed to address for chat:

Running LLMs can be expensive, especially for applications with high user traffic. Scaling LLM-based chat applications is another challenge. As user demand increases, the infrastructure and resources needed to run the LLM become difficult to manage. Maintaining both performance and cost-effectiveness becomes a balancing act.

Twixor sought to address this by using techniques to create smaller, more efficient models and implementing caching strategies. Twixor started the journey by looking at pre-existing open-source LLMS from the Hugging Face [iii] AI community as a starting point for chat applications. Effectively using LLMs in their chat applications required careful consideration of cost management while ensuring scalability. Chat being an interactive application, response time and latency were critical characteristics that needed to be addressed.

Twixor CX applications run in the data center and at the edge, and they needed a hardware platform that is ubiquitous and cost-effective for their customers. Twixor anticipated that they might need to use GPUs to meet the latency requirements of the application, but they realized that would become cost prohibitive, while also being incompatible with some of their edge use cases.

Twixor’s journey into LLMs:

Twixor’s journey into LLMs and Generative AI began with the goal of creating an efficient Question Answering system using open-source LLMs.

Key technologies used.

Table 1: Key Technologies Used

Hugging Face

Hugging Face has significantly contributed to the open-source Large Language Model (LLM) ecosystem, offering a variety of models, tools, and frameworks that support text generation, evaluation, and customization. Hugging Face’s open-source LLM ecosystem is a comprehensive suite of models, tools, and frameworks designed to democratize access to advanced AI capabilities, facilitate model evaluation and customization, and encourage community collaboration and innovation.

Haystack LLM framework

They opted for the Haystack framework as their foundation. The Haystack framework is an open-source Python framework designed for building custom applications powered by LLMs. It provides a comprehensive set of tools and components for developing state-of-the-art Natural Language Processing (NLP) systems that leverage LLMs and Transformer models. Haystack is developed by Deepset [iv] and is aimed at making it easier for developers to experiment with the latest models in NLP, offering flexibility and ease of use throughout the development process.

Intel® Extension for Transformers (ITREX)

The Intel Extension for Transformers [v], commonly referred to as ITREX, is a toolkit designed to optimize and accelerate Transformer-based models on Intel platforms. It is particularly effective on the 4th Gen Intel Xeon Scalable processors, formerly code-named Sapphire Rapids. ITREX is built on top of the Intel Neural Compressor (INC) ecosystem and integrates with Hugging Face's Transformers and Optimum to provide a seamless user experience for model compression and optimization. Key features of ITREX include support for various Transformer-based models such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, and Flan-T5. It also offers end-to-end workflows for tasks like text classification and sentiment analysis.

The toolkit supports INT4 inference on Intel® GPUs, including the Intel® Data Center GPU Max Series (formerly code-named Ponte Vecchio or PVC) and Intel® Arc™ A-Series (Intel® ARC™). It also provides a customizable chatbot framework supported by Intel® Gaudi®2, and Intel® CPUs and GPUs, allowing users to create their own chatbot by leveraging a set of plugins. Intel is committed to the open-source ecosystem, especially in AI, and has been working with Hugging Face to develop ITREX.

In summary, ITREX is an innovative toolkit from Intel that provides a range of features and optimizations to enhance the performance of Transformer-based models on Intel hardware, with a focus on accessibility and democratization of AI technologies.

Intel® Extensions for PyTorch (IPEX)

The Intel Extension for PyTorch is a package that extends the capabilities of PyTorch with performance optimizations specifically tailored for Intel hardware. These optimizations leverage Intel's hardware features such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Vector Neural Network Instructions (VNNI), and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs.

The extension optimizes performance of PyTorch on Intel CPUs and GPUs, making it easy to achieve GPU acceleration on Intel discrete GPUs through the PyTorch XPU device. It includes optimizations for both CPUs and GPUs, with features such as easy-to-use Python APIs, channels last memory format for convolutional neural networks, Auto Mixed Precision (AMP) with BFloat16 and Float16 data types, and graph optimizations. For CPUs, the extension automatically dispatches operators to the most optimized underlying kernels based on the detected Instruction Set Architecture (ISA).

The extension supports LLMs, which are increasingly popular in Generative AI (GenAI) applications. Starting from version 2.1.0, specific optimizations for certain LLMs have been introduced. INT8 Quantization is another feature provided by the extension, offering built-in quantization recipes to deliver good statistical accuracy for popular deep learning models, particularly in natural language processing and recommendation systems. IPEX is open-source and has been released on GitHub [vi], where users can find the source code and instructions on how to get started.

Testing

LLMs Evaluated

Twixor, in its quest to find the best LLM to meet its customer service chat requirements, evaluated many open-source LLMs. During the evaluation process, Twixor reached out to Intel for help. They were initially inclined to use NVIDIA GPUs because they believed that CPUs could not perform the LLM chat function to meet their latency needs. The Intel AI customer engineering team was engaged to help Twixor. The quality of the text generation for these models was initially evaluated and compared. Intel recommended the Intel-optimized Neuralchat-7B[x] model.

Twixor turned to the Neuralchat-7B model, another Mistral AI-based model fine-tuned by Intel. Neural-Chat-v3-1 is a fine-tuned 7B parameter LLM on the Intel Gaudi 2 processor from the mistralai/Mistral-7B-v0.1 [xi] on the open-source dataset Open-Orca/SlimOrca. The model was aligned using the Direct Performance Optimization (DPO) method with Intel/orca_dpo_pairs. For more information, refer to the Medium article, The Practice of Supervised Fine-tuning and Direct Preference Optimization on Intel Gaudi2 [xii].

Figure 1: NeuralChat-7b-v3 Ranked First on the 7B-sized LLM Leaderboard (November 13th, 2023)

Initial testing of the models was done on a cloud-based VM instance containing 8 generic CPU cores and 32 GB of RAM. They approached Intel about the need for a performant platform for this chat function by Twixor. The Intel team explained the benefits of leveraging 4th Gen Intel Xeon Scalable processors with Intel AMX for LLM use cases and convinced Twixor to evaluate this platform.

After Intel was engaged, a 48-core CPU machine based on 4th Gen Intel Xeon Scalable processors was made available for testing. The requirements of the chat application were to achieve latencies below 6 seconds for the first 90 tokens.

The testing was performed on an on-premises VMware vSphere virtual machine with a 4th Gen Intel Xeon Scalable processor-based. The details of the hypervisor are VMware ESXi, 8.0.1, 21495797 running on a QuantaGrid D54Q-2U with 2 X Intel(R) Xeon(R) Platinum 8480 processors with 56 cores each with hyperthreading enabled and a total of 512 GB RAM.

Serial	Model	Data type	Model Size	Server Config	First 90 token Latency (in Seconds)
1	Fastchat-T5	Bfloat16	3B	8-core CPU, 32 GB RAM	40-50
2	Fastchat-T5	Bfloat16	3B	48-core CPU, 240 GB RAM	3-5
3	Zephyr-7B-Beta	Bfloat16	7B	8-core CPU, 32 GB RAM	100-120
4	Zephyr-7B-Beta	Bfloat16	7B	48-core CPU, 240 GB RAM	35-50

Table 2: Results of initial model profiles

The result from the initial testing is shown in Table 2. The latency for the 48-core server was in the range of the SLA set forth by Twixor for FastChat-T5 with Bfloat16. It was decided that based on these results, further testing would proceed only on the 4th Gen Intel Xeon Scalable processor virtual machine configurations.

Intel convinced Twixor to evaluate the Intel-tuned Neuralchat-7B model with various combinations of 4th Gen Intel Xeon Scalable processor with Intel AMX enabled.

Results

In their optimization efforts, Twixor employed lower precision data types such as INT4 and INT8 to significantly reduce inference latency, achieving a remarkable 4x improvement in latency reduction. There was no impact on chat accuracy with lower precision, so INT4 was chosen as the data type for the final benchmarking. Table 3 shows the improvement in performance with optimizations for INT4 datatype.

Serial	Model	Data type	Model Size	Server Config	First 90 tokens Latency (in Seconds)
1	Neuralchat-7B	Bfloat16	7B	48 core CPU, 240 GB RAM	15-20
2	Neuralchat-7B	INT8	7B	48 core CPU, 240 GB RAM	4-5
3	Neuralchat-7B	INT4	7B	48 core CPU, 240 GB RAM	2-3

Table 3: Test results for Neuralchat-7b for different precisions

Machine sizes used for Testing:

Having shown INT4 to be performant and accurate with Neuralchat-7B, the sizing of the VM was modified to see if downsizing from 48 vCPUs could lower profiles. The following virtual machines were used for the testing:

4th Gen Intel® Xeon® Scalable processors: 48 cores, 240 GB RAM
4th Gen Intel® Xeon® Scalable processors: 24 cores, 128 GB RAM
4th Gen Intel® Xeon® Scalable processors: 12 cores, 64 GB RAM

The inference process for these chat applications was CPU bound, so the memory capacity of these machines beyond a particular amount had no impact on the results. The main comparison performed was the chat latency for the first 90 tokens for each of the configurations.

NeuralChat-7B testing on 4th Gen Intel Xeon Scalable processors:

Configuration 1: Run on 4th Gen Intel Xeon Scalable processor: 48 cores, 240 GB RAM

Figure 2: Latency measurement for 48-core virtual machine.

The 48-core configuration was used to test Inference Latency for the first 90 tokens, resulting in a latency of 2.738 secs averaged across multiple runs, which was well within the acceptable range.

Configuration 2: Run on 4th Gen Intel Xeon Scalable processor: 24 cores, 128 GB RAM

Figure 3: Latency measurement for 24-core virtual machine.

The 24-core configuration was used to test Inference Latency for the first 90 tokens, resulting in a latency of 3.528 secs averaged across multiple runs, which was also well within the acceptable range.

Configuration 3: Run on 4th Gen Intel Xeon Scalable processor: 12 cores, 64 GB RAM

Figure 4: Latency measurement for 12 core virtual machine.

The 12-core configuration was used to test Inference Latency for the first 90 tokens, resulting in a latency of 5.210 secs averaged across multiple runs, which was at the high end of acceptable.

Results Summary

Serial	Server Config for Neuralchat-7B Inference testing with INT4 precision	Inference latency for first 90 token (in Seconds)
1	48-core CPU, 240 GB RAM, 4th Gen Intel Xeon Scalable processor	2.7
2	24-core CPU, 128 GB RAM, 4th Gen Intel Xeon Scalable processor	3.528
3	12-core CPU, 64 GB RAM, 4th Gen Intel Xeon Scalable processor	5.2

Table 4: Latency for first 90 tokens for Int4 precision for different machine sizes

The results show that two of the configurations tested provide an acceptable latency of around 3 seconds for the first 90 tokens of the chat.

Conclusion

The collaboration between Twixor and Intel has led to the successful deployment of LLMs for chat applications that not only meet—but exceed—Twixor’s performance requirements. The use of 4th Gen Intel Xeon Scalable processors with Intel AMX has proven to be a cost-effective and powerful alternative to GPUs, offering low-latency and high-accuracy chatbot interactions.

The value proposition of this solution lies in its ability to deliver enterprise-grade customer service automation with reduced infrastructure costs and improved scalability. By leveraging Intel's optimized hardware and software, Twixor has been able to enhance its customer service offerings, providing businesses with a competitive edge in the market.

This study serves as a testament to the potential of utilizing CPUs in handling demanding AI workloads and paves the way for more businesses to adopt AI technologies without the prohibitive costs associated with GPUs. It showcases Intel's commitment to democratizing AI technologies and providing accessible, compelling solutions for a wide range of LLM applications. The full details of this case study can be found here.

Disclosure

Tests were performed Feb-April 2024 in a VMware-based lab environment in Oregon hosted by Intel. The 48-core VM was sized so that it would fit inside a NUMA boundary. Tests were performed by a team from Twixor with troubleshooting and guidance from the Intel team. The physical profile of the server was 2 X 48 core processors with a total of 512 GB of memory.

References

[i] Twixor Actionable Instant Messaging: A platform offering seamless business-customer interactions through Conversational AI

[ii] Twixor CX Automation Platform: An enterprise-grade platform enabling CX optimization with rich cards, hybrid chat, and omnichannel customer journeys

[iii] Hugging Face: A French-American company developing computational tools for machine learning applications, notably for natural language processing offering a huge repository of open source large language models

[iv] Haystack by Deepset: An open-source Python framework for building production-ready LLM applications, offering comprehensive tooling for NLP project life cycles

[v] Intel Extension for Transformers on GitHub: A toolkit designed to accelerate Transformer-based models on Intel platforms, offering state-of-the-art compression techniques for LLMs

[vi] Intel Extension for PyTorch on GitHub: Provides tools for building chatbots and offers SOTA compression techniques for LLMs on favorite devices

[vii] Hugging Face Flan-T5 Model Documentation: Documentation for the Flan-T5 model, a part of Hugging Face's transformers library for natural language processing tasks

[viii] Hugging Face Fastchat-T5-3B Model: A model on Hugging Face designed for fast and efficient chat applications

[ix] Hugging Face Zephyr-7B Beta Model: A beta version of the Zephyr-7B model available on Hugging Face for advanced NLP tasks

[x] Hugging Face Neural Chat 7B Model: A 7B-sized LLM named NeuralChat-v3-1 for fine-tuning on Gaudi2, showcased on Hugging Face

[xi] Hugging Face Mistral-7B Model: The Mistral-7B v0.1 model on Hugging Face, designed for a variety of NLP applications

[xii] Medium Article on Supervised Finetuning on Habana Gaudi2: Discusses the practice of supervised finetuning and direct preference optimization on Habana Gaudi2

Acknowledgements

We are profoundly grateful for the Intel AI Customer Engineering Team led by Anish Kumar and the sustained contributions of his team members Vasudha Kumari and Vishnu Madhu for their guidance and engagement with Twixor. We would also like to thank AAUM Analytics for working with the Intel team on behalf of Twixor on this solution.