Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
857 Discussions

Unleash Fast and Optimized AI Inference with Intel® AI for Enterprise Inference

Alex_H_Sin
Moderator
1 1 2,183

Figure 1: Which is easier? Integrating and building AI or using an API to get a response?Figure 1: Which is easier? Integrating and building AI or using an API to get a response?

 

Enterprises adopting AI at scale consistently struggle with operational complexity, performance bottlenecks, and the lack of standardized architectural frameworks. Deploying large language model (LLM) inference services typically demands extensive manual configuration, heterogeneous hardware support, and orchestration expertise – barriers that slow time-to-value and inflate costs. Intel’s ecosystem surveys highlight that organizations and developers cite scalability, performance limitations, and talent shortages as the primary obstacles for generative AI deployments. As the market trends towards AI inference and agentic AI, infrastructure setup complexity and costs become increasingly important considerations for practical and feasible AI applications that generate return on investment.

 

The Solution: Intel® AI for Enterprise Inference

Intel® AI for Enterprise Inference provides an open source, automated, native LLM serving stack to deploy high performance inference services across Intel hardware on the cloud and on-premises environments. This solution aims to make AI models run best on any Intel server system, including Intel® Xeon® Scalable CPUs, with support for GPUs in the future, ensuring that enterprises can match model workloads with optimal compute economics and performance. The platform automates OpenAI compatible LLM endpoint deployment and abstracts away infrastructure complexity using Kubernetes based orchestration. The packaged solution is designed to integrate seamlessly with existing applications, allowing developers to go from testing to production without having to rewrite infrastructure or retrain workflows.

 

Figure 2: Intel® AI for Enterprise Inference Solution StackFigure 2: Intel® AI for Enterprise Inference Solution Stack

 

Architecture and Technical Components

From the hardware to the API endpoints, Enterprise Inference streamlines software deployment of AI models with just a single click. Infrastructure, orchestration, inferencing engines, and generative AI gateway layers are deployed to provide enterprises a functional and performant OpenAI compatible API endpoint to perform model inference at scale for applications and services.

 

  • NRI-balloon policy (Applicable only to Xeon-based deployments): Ensures model serving gets dedicated CPU cores, leading to optimal inference performance.

 

  • Kubernetes: A powerful container orchestration platform that automates the deployment, scaling, and management of containerized applications, ensuring high availability and efficient resource utilization.

 

  • vLLM and SGLang: vLLM is an inferencing engine to serve models with high throughput and efficient memory management, including parallelism for distributed inference. SGLang is a high-performance serving framework designed to deliver low-latency and high-throughput from single node to distributed clusters, as well as being a post-training backbone for native reinforcement learning integrations.

 

  • GenAI Gateway: An integrated gateway leveraging LiteLLM and Langfuse to provide flexible interfaces for routing and managing generative AI models. It enables user and key management, user token telemetry, and analytics for LLM inference workflows.

 

  • Keycloak: An open-source identity and access management solution that provides robust authentication and authorization capabilities, ensuring secure access to AI services and resources within the cluster.

 

  • APISIX: A cloud-native API gateway, handling API traffic and providing advanced features caching, and authentication, enabling efficient and secure access to deployed AI models.

 

  • Observability: An open-source monitoring solution designed to operate natively within Kubernetes clusters, providing comprehensive visibility into the performance, health, and resource utilization of deployed applications and cluster components through metrics, visualization, and alerting capabilities.

 

  • Ingress NGINX Controller: A high-performance reverse proxy and load balancer for traffic, responsible for routing incoming requests to the appropriate services within the Kubernetes cluster, ensuring seamless access to deployed AI models.

 

  • Model Deployments: Automated deployment and management of AI LLM models with hardware-specific optimal model configurations within the Kubernetes inference cluster, enabling scalable and reliable AI inference capabilities.

 

Intel® AI for Enterprise Inference simplifies infrastructure setup with all the components above. Instead of spending days or weeks figuring out what software libraries and packages are needed, this software solution packages everything that is needed to achieve secure and performant AI inference at scale. This enables enterprises to quickly deploy the necessary models to power the actual AI applications for business use cases. Developers can focus on the application development rather than the costs and time needed for infrastructure setup.

 

How to Deploy and Access Models

Enterprise Inference is designed to be a packaged one-click software deployment solution. The project is open-sourced on GitHub with instructions to deploy on a single node or cluster.

The prerequisites consists of the following steps:

  1. Create SSH keys as a non-root user with sudo privileges.
  2. Satisfy network requirements based on the control plane and workload nodes, and storage requirements based on the size of the models planning to deploy.
  3. Perform DNS and SSL/TLS setup with ONE of the following environments:
    1. (For production environments) Set up a DNS for the server and install an SSL/TLS certificate from a trusted Certificate Authority onto the system.
    2. (For development environments) Map a URL to localhost and create a self-signed SSL certificate.
  4. Generate a Hugging Face token and acquire permissions to desired models.

 

There are only two files that need to be configured before deploying the solution: hosts.yaml and inference-config.cfg.

The hosts.yaml file sets up all the control plane and worker nodes in the cluster or single node system. Example:

all:
  hosts:
    master:
      ansible_host: "{{ private_ip_control_plane_node }}"
      ansible_user: "username_of_user_running_automation"
      ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa"
    worker1:
      ansible_host: "{{ private_ip_workload_node_1 }}"
      ansible_user: "username_of_user_running_automation"
      ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa"
    worker2:
      ansible_host: "{{ private_ip_workload_node_2 }}"
      ansible_user: "username_of_user_running_automation"
      ansible_ssh_private_key_file: "/home/ubuntu/.ssh/id_rsa"
  children:
    kube_control_plane:
      hosts:
        master:
    kube_node:
      hosts:
        worker1:
        worker2:
    etcd:
      hosts:
        master:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

 

The inference-config.cfg file consists of variables and components the user can customize to include in the deployment, ranging from the base URL to access the models, hardware to deploy on, and whether to use the APISIX/Keycloak or GenAI gateway as the interface between the models and AI applications. For example, this will configure the base URL to api.example.com, use Keycloak/APISIX for identity and access management, and deploy model #1 (Llama 3.1 8B Instruct).

cluster_url=api.example.com
cert_file=~/certs/cert.pem
key_file=~/certs/key.pem
keycloak_client_id=my-client-id
keycloak_admin_user=your-keycloak-admin-user
keycloak_admin_password=changeme
hugging_face_token=your_hugging_face_token
hugging_face_token_falcon3=your_hugging_face_token
models=1
cpu_or_gpu=gaudi3
vault_pass_code=place-holder-123
deploy_kubernetes_fresh=on
deploy_ingress_controller=on
deploy_keycloak_apisix=on
deploy_genai_gateway=off
deploy_observability=off
deploy_llm_models=on
deploy_ceph=off
deploy_istio=off
uninstall_ceph=off

 

After the setup, a single script inference-stack-deploy.sh is executed to deploy Kubernetes, the ingress NGINX controller, APISIX/Keycloak or GenAI Gateway, observability components, and model deployments with vLLM. The models can be selected using hardware specific optimal configurations or by inputting the Hugging Face model card ID. As of the time this article is written, only Hugging Face models are supported. When fully deployed, the services running may look something like this:

 

Figure 3: Kubernetes pods deployed with Enterprise Inference using the GenAI gateway with models deployed with vLLM on Intel® Gaudi®.Figure 3: Kubernetes pods deployed with Enterprise Inference using the GenAI gateway with models deployed with vLLM on Intel® Gaudi®.

 

To access the models, whether deployed using APISIX/Keycloak or GenAI gateway, use the base URL set in the config file and an API key or token. This guide explains how to generate and access the keys with examples of curl commands to test inference using the deployed models. Note the instructions are different between APISIX/Keycloak and the GenAI gateway. Here is an example of a response from a query to a Llama-3.1-8B-Instruct model:

 

Figure 4: Sample cURL command to use the model endpoint to make a query and receive a response.Figure 4: Sample cURL command to use the model endpoint to make a query and receive a response.

 

More than one model can be deployed at the same time. This gives flexibility and versatility for a single node server or cluster to power different AI applications with these models at scale. The administrators of this system can update these models by adding or removing them as needed.

 

Partners

Intel has been working with several partners to make Enterprise Inference easier to deploy and use on-premises and on the cloud. These include IBM Cloud and Dell.

IBM Cloud offers deployable architectures, which are preconfigured, automated cloud solutions that combine multiple cloud resources into a single reusable unit. These are blueprints that deploy complex systems with networking, security, and applications quickly and consistently. Enterprise Inference has its own deployable architecture which can be deployed here. It provisions virtual machines on IBM Cloud using Terraform to deploy a Kubernetes cluster on single-node and multi-node configurations. There is a full web-based user interface to customize the deployment from the software drivers and firmware to the inference software stack.

Dell has provided bare-metal automated deployment scripts to install Ubuntu 22.04, boot it using Dell iDRAC Redfish Virtual Media, and deploy Enterprise Inference on a single-node system, such as the PowerEdge XE7740, on premises. The XE7740 comes with two Xeon 6 processors and 8 PCIe Gen5 slots that can be used with eight Gaudi 3 cards to meet enterprise inference and model fine-tuning demands. This solution specifically for Dell servers enables users to provision a fresh XE7740 to get it up and running with AI models in minutes with a few commands. Users can create custom Ubuntu ISO images, mount it, and boot using Terraform. Upon reboot, Enterprise Inference can be deployed with just a single command, automating the prerequisites and setup of config files normally done manually. Both pre-validated hardware-specific optimized models and any HuggingFace model1 can be deployed.

 

What’s New: Latest Features on Release v1.5.0

The release notes for Enterprise Inference v1.5.0 can be found here. Some of the most notable new features include:

  • Additional OS Support: Extended support for Ubuntu 24.04. Previously only Ubuntu 22.04 was supported.
  • Agentic AI workflow plugin: Flowise is integrated, making it easier to construct production-ready agent workflows using drag-and-drop without coding. Pre-built agent templates available. See the quick start guide.
  • MCP tools and server: A Helm chart template is provided for integrating MCP endpoints to support a vast range of tools essential for building agentic AI systems.
  • Balloon Policy: Fixed imbalanced CPU allocations among NUMA nodes, leading to better performance running services especially inference with vLLM.

 

Common Use Cases

With these performant and secure model deployments, the next step would be to use the model endpoints in AI applications, powering everything from chatbots to summarization tools to retrieval augmented generation (RAG) to agentic AI workflows.

Dell and Intel have provided sample solutions developed specifically to work with Enterprise Inference. These sample solutions serve as blueprints powered by Enterprise Inference and can be further customized to adapt to business and application use cases. Here is a summary of the currently available sample solutions:

Sample Solution

Description

CodeTranslation

A full-stack code translation application that converts code between programming languages using AI. The system integrates a FastAPI backend, alongside a modern React + Vite + Tailwind CSS frontend for an intuitive translation experience.

DocSummarization

A full-stack document summarization application that processes text and document files to generate concise summaries with enterprise inference integration.

MultiAgentQnA

A sophisticated multi-agent Q&A application featuring intelligent task delegation to specialized agents with enterprise inference integration.

PDFtoPodcast

AI-powered application that transforms PDF documents into engaging podcast-style audio conversations using enterprise inference endpoints for script generation and OpenAI TTS for audio synthesis.

RAGChatbot

A full-stack Retrieval-Augmented Generation (RAG) application that enables intelligent, document-based question answering. The system integrates a FastAPI backend powered by LangChain, FAISS, and AI models, alongside a modern React + Vite + Tailwind CSS frontend for an intuitive chat experience.

Table 1: Sample Solutions built specifically for model endpoints deployed with Enterprise Inference

 

These are only basic examples to quickly get started. With the model endpoints that can be accessed via OpenAI APIs, they can be consumed by virtually any AI application needing AI models to do inference. Whether it is generating text or running multimodal models, Enterprise Inference sets up the hardware and necessary software infrastructure. The models need only be deployed on a single or handful of servers, rather than every single node along with the application. This frees up hardware resources that can solely be used for AI applications.

 

Key Takeaways

Intel® AI for Enterprise Inference reduces AI operational complexity by automating model deployment and resource management.

Optimized for Intel hardware, enterprises gain top tier performance and cost efficiency, especially for large language model inference.

Cloud agnostic and OpenAI compatible, enabling flexible deployment and easy integration with existing workflows in the cloud or on-premises.

Ideal for scalable enterprise grade GenAI use cases, including RAG, copilots, and high throughput inference pipelines.

 

Demo Video

For a summary of the core components of Enterprise Inference and a demo, watch this video.

 

References

To get started with Intel® AI for Enterprise Inference, refer to the resources below.

 

 

 

1. Models selected by specifying the Hugging Face model card not on the pre-validated model list are not guaranteed to be functional and/or optimized in performance.

 

 

 

 

 

 

 

 

 

 

 

 

 

1 Comment
MEIRE
Novice

materia Importante