NeuralChat: A Customizable Chatbot Framework

Ramya_Ravi · ‎09-22-2023

Creating Your Own Chatbot in Just a Few Minutes

This article was originally published on medium.com

Posted on behalf of:

Liang Lv, Xuhui Ren, Xinyu Ye, Kaokao Lv, Qun Gao, Feng Tian, and Haihao Shen, Intel Corporation

NeuralChat, a customizable chatbot framework under Intel® Extension for Transformers, provides an easy-of-use API to quickly build a chatbot on multiple architectures (e.g., Intel Xeon Scalable Processors and Intel® Gaudi® Accelerator). NeuralChat is built on top of large language models (LLMs) and supports fine-tuning, optimization, and inference. It also offers a rich set of plugins to allow users to make their chatbots smarter with knowledge retrieval, more interactive through speech, faster through query caching, and more secure with guardrails.

NeuralChat Components

Getting Started

NeuralChat is available as one component of Intel Extension for Transformers. Just run the following command to install it:

pip install intel-extension-for-transformers

Create a chatbot as NeuralChat provides an easy-of-use Python API:

## create a chatbot on local
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig
config = PipelineConfig()
chatbot = build_chatbot(config)
## use chatbot to do prediction
response = chatbot.predict("Tell me about Intel® Xeon® Scalable Processors.")

Deploy NeuralChat as a service:

neuralchat_server start --config_file ./server/config/neuralchat.yaml

Customize the behavior of this chatbot by modifying the following fields in the configuration file to specify which LLM model and plugins to use, as NeuralChat provides a default chatbot configuration in neuralchat.yaml:

Users can send requests to NeuralChat and get responses via curl if NeuralChat starts as a service:

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Tell me about Intel Xeon Scalable Processors."}' http://127.0.0.1:80/v1/chat/completions

Plugins to Augment the Chatbot

NeuralChat provides plugins that offer a rich set of LLM utilities and features to augment the chatbot’s capability. These plugins are applied in the chatbot pipeline for inference:

Knowledge retrieval consists of document indexing to efficiently retrieve relevant information, including Dense Indexing based on LangChain and Sparse Indexing based on fastRAG document rankers to prioritize the most relevant responses.
Query caching enables the fast path to get the response without LLM inference, which improves the chat response time.
Prompt optimization supports automatic prompt engineering to improve user prompts.
Memory controller enables efficient memory utilization.
Safety checker enables the sensitive content check on inputs and outputs of the chatbot.

You can enable, disable, or provide a customizable plugin as follows:

from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig, plugins
plugins.retrieval.enable = True
plugins.retrieval.args["input_path"]="./assets/docs/"
conf = PipelineConf(plugins=plugins)
chatbot = build_chatbot(conf)

Fine-Tuning a Chatbot

NeuralChat supports fine-tuning a pre-trained LLM for text generation, summarization, code generation tasks, and even Text-To-Speech (TTS) models:

from intel_extension_for_transformers.neural_chat import finetune_model, TextGenerationFinetuningConfig
finetune_cfg = TextGenerationFinetuningConfig() # support other finetuning configs
finetuned_model = finetune_model(finetune_cfg)

This way, users can fine-tune the models with proprietary datasets for customization.

Optimizing a Chatbot

NeuralChat provides several model optimization technologies, like advanced mixed precision (AMP) and Weight Only Quantization, to allow users to optimize chatbot inference:

from intel_extension_for_transformers.neural_chat import build_chatbot, AMPConfig
pipeline_cfg = PipelineConfig(optimization_config=AMPConfig())
chatbot = build_chatbot(pipeline_cfg)

This way, the pre-trained LLM will be optimized on the fly to boost inference speed.

Concluding Remarks

NeuralChat is now available for you to quickly create your own chatbot on multiple architectures. With plugins, you can also enhance your chatbot’s intelligence through knowledge retrieval, making it more interactive through speech and faster through query caching. We encourage you to try to create your own chatbot. You can submit pull requests, issues, or questions to GitHub.

We encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.