Develop and Optimize Your Own Talking Chatbot in 5 Minutes

Jack_Erickson · ‎11-02-2023

Intel Innovation 2023, Intel’s primary annual conference for developers, architects, and business leaders, offered an array of keynotes, announcements, sessions, demos, and a hackathon. While several sessions supported the mission to enable developers to bring AI everywhere, one session in particular enabled developers to build, customize, and optimize their own talking chatbot running on various hardware on Intel Developer Cloud.

The hands-on lab, titled “Demystifying Generative AI: Develop and Optimize Your Own Talking Chatbot,” mixed instructions from Intel’s Ke Ding and Qun Gao with hands-on components for developers to apply this instruction immediately.

This was all powered by Neural Chat, a software package specifically designed to build a chatbot with a few lines of code, built on Hugging Face* Transformers and Intel AI software. Getting started only requires three lines of code, which builds a text-based chatbot based on a Llama 2 model and runs inference on an Intel® Xeon® platform:

from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Developers then added modules for voice input and output with just a few more lines of code. The input is an automated speech recognition model, while the output generates audio from the chatbot’s text output.

The session then explored different methods to customize a large language model (LLM) for your specific use case. This ranges from low to high in terms of complexity, cost, and quality:

Prompt engineering. This requires no extra training or modification to the LLM but requires some effort during inference to tune the prompt to get the LLM to provide the desired information.
Retrieval-augmented generation (RAG) enables the model to draw on a domain-specific knowledge base to form its responses. This allows the LLM to produce up-to-date answers using information as soon as knowledge is added to the database.
Fine-tuning uses the domain-specific knowledge base to update the LLM with a small amount of additional training. With the growth in the size of LLMs, parameter efficient fine-tuning (PEFT), which only fine-tunes a small number of parameters, has become popular since it reduces the compute required. Fine-tuning generally results in faster inference latency than RAG since the information becomes built into the model.
Training from scratch. This is generally the territory of the developers of LLMs due to the expertise and compute costs required.

Both RAG and fine-tuning are available and made easy with Neural Chat. The lab covered how to use PEFT to tune both the LLM and the TTS models on both Intel® Xeon® CPUs as well as Intel® Gaudi®2 AI accelerators.

Chatbots need to be able to run on a variety of devices, from the cloud to the edge. Even though a wide variety of model optimization techniques are available, Neural Chat simplifies the process by providing single-line API calls that implement transformer-specific optimization using Intel® Neural Compressor.

The lab concluded by showing developers how to deploy their chatbot for production as a server that can be accessed with client queries.

This video provides a brief overview of how to build up and customize your chatbot using Neural Chat:

You can get started by downloading the slides from the lab or by going to the Neural Chat GitHub* repository and starting with one of the example notebooks.

We also encourage you to check out and incorporate Intel’s AI/ML Framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.