Bringing AI Back to the Device: Real-World Transformer Models on Intel® AI PCs

Eugenie_Wirz · ‎07-20-2025

The rapid advancement of AI has led to widespread deployment of transformer-based models across cloud infrastructure. While effective, this approach introduces challenges related to cost, latency, and data privacy particularly in real-time applications such as transcription and summarization. Cloud-based inference also creates dependency on network connectivity and central resources, which may not align with user expectations for responsiveness or control. As demand grows for AI that can operate closer to the user, local inference on consumer devices is emerging as a scalable alternative.

Intel AI PCs with integrated NPUs are designed to meet this demand. Through a recent collaboration, Fluid Inference, the team behind Slipbox — a privacy-focused AI meeting assistant — successfully deployed transformer models including Whisper v3 Turbo, Qwen3, and Phi-4-mini directly on Intel® Core™ Ultra processors.

These models now run entirely on-device, delivering real-time functionality without relying on cloud services. The same NPU-optimized models have also been integrated into a native AI application being developed by a Fortune 100 company for their next generation of hardware that the team is working with.

Optimized Transformer Models Go Local with Intel® AI PCs

The transcription model Whisper v3 Turbo, and the LLMs Qwen3 and Phi-4-mini are transformer models typically associated with cloud-based workloads and GPU-heavy infrastructure. But they're now running entirely on consumer laptops — powered by Intel® NPUs. Whisper v3 Turbo supports real-time transcription and voice dictation. Qwen3 (LLM) and Phi-4-mini (SLM) are used for language understanding tasks such as summarization, reasoning, and question answering.

Slipbox is one of the first applications to ship these models on Intel® AI PCs. Fluid Inference collaborated with Intel to adapt these state-of-the-art models for local use, enabling real-time transcription, speaker diarization, and intelligent summarization directly on-device.

Enabling AI-Native Applications with On-Device Inference

Modern AI applications increasingly rely on large transformer models, but until recently, these models were largely inaccessible to on-device environments. Developers frequently assumed that applications involving transcription, reasoning, or summarization required either cloud infrastructure or discrete GPUs creating challenges for use cases that demand privacy, responsiveness, or low power consumption. These issues are especially apparent during live meetings.

Intel® Core™ Ultra processors introduced powerful integrated NPUs capable of accelerating AI workloads locally. However, many of the latest open-source models, including Whisper v3 Turbo and Phi-4-mini, had not yet been optimized for this hardware.

The Fluid Inference team encountered this ecosystem gap firsthand when the team was initially building Slipbox's Windows version, which needed to deliver high performance without relying on cloud services or consuming excess power through CPU/GPU inference.

Optimizing Transformer Models for Intel® NPU

To address this ecosystem gap, Fluid Inference partnered with Intel to optimize these transformer models for on-device execution. Using the OpenVINO™ toolkit, these models were adapted to run efficiently on Intel NPUs. Benchmarks showed up to 40% latency reduction compared to CPU baselines and comparable accuracy to GPU-based inference, including real-time audio processing with no transcription quality degradation. The improvements made it possible for Slipbox to operate truly locally delivering privacy-first AI without compromising responsiveness or battery life.

The same NPU-optimized models were deployed in a second real-world application: an AI native application developed by a Fortune 100 company for their next generation of devices. This enterprise-grade application required strict privacy, high throughput, and a seamless end-user experience—all delivered using on-device inference powered by Intel hardware.

Intel AI PCs: A Platform for Local, Scalable AI

These successful transformer model deployments show that Intel AI PCs are capable of running complex AI workloads traditionally associated with cloud compute. Transformer models like Whisper and Phi-4-mini can now run natively on laptops and desktops, opening up new possibilities for developers and enterprises seeking to bring AI directly to the edge.

The engineering effort behind these deployments was led by Fluid Inference, an applied AI lab focused on enabling advanced model optimization for edge devices. Their work converting and tuning these models for Intel NPU made both their Slipbox product and the Fortune 100 AI deployment possible in a matter of weeks.

The Work: Optimizing AI for Local Use

To enable these on-device AI deployments, Fluid Inference collaborated with Intel in May and June 2025 to bridge the gap between state-of-the-art transformer models and NPU-capable hardware. The joint effort focused on five stages:

Model Selection: Whisper v3 Turbo was chosen for real-time speech transcription, while Qwen3 and Phi-4-mini were selected for summarization, question answering, and reasoning tasks;
Model Adaptation: using OpenVINO, the models were converted and optimized for low-latency, power-efficient inference on Intel NPUs;
Performance Validation: benchmarking confirmed up to 40% latency reduction, less power usage, and real-time processing with no degradation in accuracy;
Deployment: the optimized models were integrated into two production-grade applications: Slipbox for Windows (now in private beta), and a native AI application developed by a Fortune 100 company is being productionized for 2026 deployment.
Open Source Release: NPU-optimized models were made publicly available on Hugging Face, with additional developer tooling, including a native .NET library to support GenAI workloads in .NET apps, in progress.

The Results: Power Savings and Accuracy

The collaboration delivered tangible outcomes across multiple deployment environments. Whisper v3 Turbo achieved real-time transcription with zero degradation in accuracy compared to GPU-based inference, while reducing latency by 40% (from 0.31s to 0.19s per segment). Speaker diarization was enabled through the optimization of PyAnnote and WeSpeaker models for Intel® NPUs. Language models Qwen3 and Phi-4-mini demonstrated strong on-device performance, reaching approximately 70–75% of GPT-4 quality on summarization and factual QA tasks, with a memory footprint small enough to operate within typical consumer hardware constraints.

These results enabled Fluid Inference to ship Slipbox's Windows beta and the Fortune 100 company to complete a successful proof-of-concept for their AI native application. Both applications run fully offline, marking a significant advancement in deploying AI locally on Intel AI PCs.

What's Available Now

Slipbox is currently in private beta on Windows, offering real‑time transcription, speaker diarization, and summarization directly on Intel® AI PCs—all without requiring internet access. Meanwhile, the Fortune 100 company's AI native application has successfully completed its proof-of-concept stage and is entering production rollout, showcasing scalable, on-device processing. Developers and researchers can access the underlying NPU-optimized models via Hugging Face, where Fluid Inference maintains a public model repository for AI accelerator optimized models. Moreover, open-source tooling, including a native .NET library (in active development), is available to support generative AI deployment on Windows AI PCs.

About the Team

Fluid Inference (the same team behind Slipbox) enables rapid deployment of advanced AI models on edge devices, working closely with hardware providers to optimize AI for real-world conditions. The team has been working at the forefront of AI and brings deep AI and infrastructure experience from industry leaders including Databricks, LinkedIn, Amazon, and Microsoft. Their platform played a key role in adapting and optimizing Whisper v3 Turbo, Qwen3, and Phi-4-mini for Intel NPUs in a matter of weeks on Intel® NPU.

Final Notes: AI PCs as Real AI Platforms

This collaboration between Intel and Fluid Inference proves that Intel AI PCs are more than just productivity machines — they're genuine AI compute platforms.

The same models typically reserved for cloud and GPU deployments can now run efficiently on Intel AI laptops
NPUs deliver strong performance per watt without compromising model quality
Local AI is no longer a compromise, it's a competitive advantage
Production-grade AI workloads can run offline, privately, and in real time

The summary of the results is impressive:

Slipbox for Windows is now live in beta for Intel AI PCs
A Fortune 100 company's AI application is moving toward production after a successful proof of concept
Open-source models and tools are available to any developer building local AI experiences
A .NET GenAI library is on the way, making it even easier to deploy models using OpenVINO on Windows

Fluid Inference's work with Intel has unlocked a new tier of performance and privacy for customers looking to deploy transformer models on local devices. Developers and companies interested in building on-device, NPU-powered AI products can now follow the same path to bring AI native applications to production.

Intel Resources for Developers

Intel® Distribution of OpenVINO™ Toolkit
Intel® AI PC Developer Guide
Intel® Core™ Ultra Processors
Intel® AI Boost (NPU)
Intel® Liftoff for Startups
AI Inference Acceleration on Intel CPUs