Deploy your own LLM Chatbot and Accelerate Generative AI Inferencing with Intel® AMX

LucasMelo · ‎09-28-2023

Hey Fellow Developers! We are back and ready to “turn the volume up” by getting hands-on with Intel® Optimized Cloud Modules showcasing how to perform GenAI Inferencing with our 4th Gen Intel® Xeon® Scalable processors.

Accelerate Your Generative AI Inferencing

Did you know our latest 4th Gen Intel® Xeon® Scalable processor has built-in AI accelerators? That’s right, an AI accelerator built right into the CPU that can perform high throughput generative AI inferencing and training on the CPU without needing dedicated GPUs. This allows you to leverage CPUs for traditional workloads as well as AI, keeping your overall TCO low.

Intel® Advanced Matrix Extensions (Intel® AMX) is a new built-in accelerator that improves the performance of deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing (NLP), image generation, recommendation systems, and image recognition. It specializes on bfloat16 and int8 data types.

And, if you were not aware, the 4th Gen Intel Xeon processor is generally available on GCP (C3, H3 instances) and AWS (m7i, m7i-flex, c7i, and r7iz instances) today.

Instead of just talking about it, let’s prepare to deploy your FastChat GenAI LLM Chabot on the 4th Gen Intel Xeon processor. Let’s go!

Intel® Optimized Cloud Modules and Intel® Optimized Cloud Recipes

Before we get into the code, here are a couple of updates. At Intel, we are working hard to make it easy for Developers and DevOps teams to consume our technologies. One step towards that end was the development of Intel’s Optimized Cloud Modules. Today, I want to introduce you to the modules’ companion, our Intel® Optimized Cloud Recipes, or OCRs.

What are Intel Optimized Cloud Recipes?

The Intel Optimized Cloud Recipes (OCRs) integrate with our cloud modules, focusing on optimizing operating systems and software using RedHat Ansible Microsoft PowerShell.

This is How We Do It

Enough reading let’s shift our focus to using our GCP Virtual Machine Module integrated with a FastChat OCR. Using the modules and OCR, you will deploy your own generative AI LLM chatbot solution on the 4th Gen Intel Xeon processor. We’ll then showcase the power of our Intel AMX built-in accelerator for inferencing without needing a dedicated GPU.

Pre-requisite: You need a cloud account access and permissions to provision VMs on GCP or AWS.

Deployment: GCP Steps

Follow the steps below; for detailed instructions, see the Module README.md (example below)

Usage

Log on to the GCP portal
Enter the GCP Cloud Shell (Click the terminal button on the top right of the portal page)
Run the following commands in order:

git clone https://github.com/intel/terraform-intel-gcp-vm.git
cd terraform-intel-gcp-vm/examples/gcp-linux-fastchat-simple
terraform init
terraform apply

# Enter your GCP Project ID and "yes" to confirm

Running the Demo

Wait approximately 10 minutes for the recipe to download and install FastChat and the LLM model before continuing.
SSH into the newly created CGP VM
Run: source /usr/local/bin/run_demo.sh
On your local computer, open a browser and navigate to http://<VM_PLUBLIC_IP>:7860 .
Get your public IP from the “Compute Engine” section of the VM in the GCP console.
Or use the https://xxxxxxx.gradio.live URL that is generated during the demo startup (see on-screen logs)

After starting (Step 3) and then browsing to the application (Step 4), you should now be able to “chat” and see Intel AMX in action for yourself.

Deployment: Intel Developer Cloud (when not using GCP/AWS)

You can also use the Intel Developer Cloud to provision a 4th Gen Intel Xeon Scalable Processor Virtual Machine.

Follow Intel Developer Cloud instructions to provision the Virtual Machine. Once the Virtual Machine is provisioned:

SSH into the virtual machine following the Intel Developer Cloud instructions.
Follow the AI Intel® Optimized Cloud Recipe Instructions to execute the automated recipe and start the LLM chatbot

GenAI Inferencing: 4th Gen Intel Xeon Scalable Processors with Intel AMX

Thank you for following along, you hopefully got hands-on experience with generative AI inferencing! The 4th Gen Intel Xeon Scalable Processors with Intel AMX can help you accelerate your AI workloads and build the next generation of AI applications. By leveraging our modules and recipes, you can easily enable generative AI inferencing and start reaping its benefits. Developers, researchers, and data scientists alike can take generative AI to the next level.

See you all next time!

Looking for more resources? Here are some helpful links:

Intel Developer Cloud

4th Gen Intel® Xeon® Scalable processors

Intel® Advanced Matrix Extensions (Intel® AMX)

GCP VM Module with FastChat OCR Integration

Optimized Cloud Recipe

FastChat