Very poor NPU tests, what am I doing wrong?

Pablo_BR · ‎06-05-2025

My test

#Generate the IR model

import openvino as ov
from pathlib import Path
from nncf import compress_weights

MODEL_NAME = "efficientnet_b0"
MODEL_DIR = Path("Models")
MODEL_DIR.mkdir(parents=True, existence_ok=True)

weights = models.EfficientNet_B0_Weights.DEFAULT
model = models.efficientnet_b0(weights=weights)
model.eval()

batch_size=32
ov_model = ov.convert_model(model, input=[[batch_size, 3, 224, 224]])

ov.save_model(ov_model, MODEL_DIR / f"{MODEL_NAME}_{batch_size}_static.xml")

quantized_model = compress_weights(ov_model)
ov.save_model(quantized_model, MODEL_DIR / f"{MODEL_NAME}_{batch_size}_quantized.xml")

#Implementation Testing

...
batch_size=32
num_workers=0
val_dataset = datasets.ImageNet(root=IMAGENET_VAL_DIR, split='val', transform=val_transforms)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, drop_last=True)
...
compiled_model = core.compile_model(model=MODEL_PATH, device_name=device)
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)
...
for images, labels in val_loader:
inputs = images.numpy()
results = compiled_model(inputs={input_layer: inputs})
output = results[output_layer]

top1, top5 = accuracy(output, labels)
top1_total += top1
top5_total += top5
total += labels.size(0)

# Show progress bar
elapsed = time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time))
num_imput = num_imput + len(images)
print(f"\rmodel: {type_model} - device : {device} - Valid [{(num_imput):>7,}/{size_test:>7,}] - Elapsed: {elapsed} - ".replace(",", "."), end="")

batch_size:32, num_workers:0
model: quantized - device : GPU - Valid [ 49.984/ 50.000] - Elapsed: 00:03:42 - accuracy - Top-1: 77.01%, Top-5: 93.24%
model: quantized - device : CPU - Valid [ 49.984/ 50.000] - Elapsed: 00:06:50 - accuracy - Top-1: 77.01%, Top-5: 93.25%
model: quantized - device : NPU - Valid [ 49.984/ 50.000] - Elapsed: 00:12:21 - accuracy - Top-1: 77.03%, Top-5: 93.23%

model: static - device : GPU - Valid [ 49.984/ 50.000] - Elapsed: 00:03:46 - accuracy - Top-1: 77.65%, Top-5: 93.58%
model: static - device : CPU - Valid [ 49.984/ 50.000] - Elapsed: 00:06:46 - accuracy - Top-1: 77.68%, Top-5: 93.58%
model: static - device : NPU - Valid [ 49.984/ 50.000] - Elapsed: 00:12:25 - accuracy - Top-1: 77.68%, Top-5: 93.58%

Wan_Intel · ‎06-08-2025

Hi Pablo_BR,

Thank you for reaching out to us.

I have run your snippet of code but I encountered TypeError: Path.mkdir() got an unexpected keyword argument 'existence_ok' while running MODEL_DIR.mkdir(parents=True, existence_ok=True), and I encountered NameError: name 'models' is not defined while I removed argument 'existence_ok' and run weights = models.EfficientNet_B0_Weights.DEFAULT

Could you please provide the information below so that we can further investigate the issue?

Python version
Hardware specifications
Deep learning models
Code snippet or Python script to replicate the issue

If you have additional information that is helpful for us, please share it here as well. We will continue to troubleshoot the issue once we received the information.

Regards,

Wan

Pablo_BR · ‎06-09-2025

Hello, thank you for your interest in my question.

I'll try to answer your questions.

Hardware specifications

GEEKOM GT1 Mega
Versión de BIOS 0.50
Fecha 2024-08-12

Sistema operativo

Microsoft Windows 11 Pro (64 bits)
Versión de Compilación 24H2 (10.0.26100)

Descripción        Intel64 Family 6 Model 170 Stepping 4
Arquitectura      x64
Cantidad de núcleos       16
Cantidad de subprocesos            22
Frecuencia básica del procesador            2300 MHz
Voltaje actual    1.6
Caché de nivel 2               18432 KB
Caché de nivel 3               24576 KB
Identificación de procesador      0xA06A4

Gráficos

Detalles del controlador

Proveedor Intel Corporation
Versión 32.0.101.6874
Fecha 2025-05-25

Procesador de video Intel® Arc™ Graphics Family
ID de dispositivo PCI\VEN_8086&DEV_7D55&SUBSYS_22128086&REV_08\3&11583659&0&10

Proveedor Intel® Corporation
Nombre IntcUSB.sys

Detalles del controlador

Proveedor Realtek Semiconductor Corp.
Nombre RTKVHD64.sys

Detalles del dispositivo

Memoria física - Total    32 GB
Memoria física - Disponible        23,48 GB
Memoria virtual - Total 33,84 GB
Memoria virtual - Disponible      25,07 GB

Deep learning models

Obtain the pre-trained model with its weights from torchvision.models, models.efficientnet_b0

Create the OpenVino IR model, files:

efficientnet_b0_32_static.xml

efficientnet_b0_32_static.bin

Create the OpenVino Quantization model, files:

efficientnet_b0_32_quantized.xml

efficientnet_b0_32_quantized.bin

With these two models, validate the torchvision.datasets.ImageNet dataset and quantify the difference between implementations on the CPU, GPU, and NPU.

Python version

(OpenVINO_env) C:\IA\OpenVINO_env\Code>pip show torch torchvision openvino

Name: torch

Version: 2.7.0+xpu

---

Name: torchvision

Version: 0.22.0+xpu

---

Name: openvino

Version: 2025.1.0

Summary: OpenVINO(TM) Runtime

Code snippet or Python script to replicate the issue

import torch

import torchvision.transforms as transforms

import torchvision.datasets as datasets

import openvino as ov

import numpy as np

from pathlib import Path

from openvino import Core

import time

# Transformaciones

IMAGENET_VAL_DIR = Path(r"C:\IA\DataSets\ImageNet")

print('Transformaciones')

val_transforms = transforms.Compose([

transforms.Resize(256),

transforms.CenterCrop(224),

transforms.ToTensor(),

transforms.Normalize(mean=[0.485, 0.456, 0.406],

std=[0.229, 0.224, 0.225]),

])

# Dataset y DataLoader

print('Dataset y DataLoader')

batch_size=32

num_workers=0

val_dataset = datasets.ImageNet(root=IMAGENET_VAL_DIR, split='val', transform=val_transforms)

val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, drop_last=True)

# Función de precisión

print(f'Función de precisión: batch_size:{batch_size}, num_workers:{num_workers}')

def accuracy(output, target, topk=(1, 5)):

maxk = max(topk)

_, pred = torch.tensor(output).topk(maxk, 1, True, True)

pred = pred.t()

correct = pred.eq(target.view(1, -1).expand_as(pred))

return [correct[:k].reshape(-1).float().sum(0).item() for k in topk]

# Configuración

models = {'static','quantized'}

devices = {'CPU', 'GPU', 'NPU'}

MODEL_NAME = "efficientnet_b0"

MODEL_DIR = Path("Models")

core = Core()

for type_model in models:

for device in devices:

# Cargar modelo IR

if (type_model=='static'

MODEL_PATH = MODEL_DIR / f"{MODEL_NAME}_{batch_size}_static.xml"

elif (type_model=='quantized'

MODEL_PATH = MODEL_DIR / f"{MODEL_NAME}_{batch_size}_quantized.xml"

compiled_model = core.compile_model(model=MODEL_PATH, device_name=device)

input_layer = compiled_model.input(0)

output_layer = compiled_model.output(0)

# Validación

top1_total = 0

top5_total = 0

total = 0

# Datos iniciales barra de progreso

size_test = len(val_loader.dataset)

start_time = time.time()

num_imput = 0

for images, labels in val_loader:

inputs = images.numpy()

results = compiled_model(inputs={input_layer: inputs})

output = results[output_layer]

top1, top5 = accuracy(output, labels)

top1_total += top1

top5_total += top5

total += labels.size(0)

# Mostrar barra de progreso

elapsed = time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time))

num_imput = num_imput + len(images)

print(f"\rmodel: {type_model} - device : {device} - Valid [{(num_imput):>7,}/{size_test:>7,}] - Elapsed: {elapsed} - ".replace(",", "."), end="")

print(f"Precisión final - Top-1: {top1_total / total:.2%}, Top-5: {top5_total / total:.2%}")

print("Proceso finalizado")

If you have any further questions, please don't hesitate to contact me.

Thank you.

Wan_Intel · ‎06-10-2025

Hi Pablo_BR,

Thank you for sharing the information with us.

We will further investigate the issue and provide an update here as soon as possible.

Regards,

Wan

Wan_Intel · ‎06-11-2025

Hi Pablo_BR,

Thank you for your patience.

I have downloaded and converted efficientnet_b0 model into Intermediate Representation with the following command:

import torchvision
import torch
import openvino as ov

model = torchvision.models.efficientnet_b0(weights='DEFAULT')
ov_model = ov.convert_model(model)
ov.save_model(ov_model, 'model.xml')

However, I encountered error while inferencing with NPU plugin. The snippet code worked fine while using CPU and GPU plugin.

Could you please share the following models with us to further investigate the issue? For example, you may share it via Google Drive with us.

efficientnet_b0_32_static.xml
efficientnet_b0_32_static.bin
efficientnet_b0_32_quantized.xml
efficientnet_b0_32_quantized.bin

Regards,

Wan

Pablo_BR · ‎06-12-2025

Hello, happy to help.

I use this code to create the models.
Please run it to generate the model files.

import torch

import torchvision.models as models

import openvino as ov

from pathlib import Path

from nncf import compress_weights

# Configuración

MODEL_NAME = "efficientnet_b0"

MODEL_DIR = Path("Models")

MODEL_DIR.mkdir(parents=True, exist_ok=True)

# get default weights using available weights Enum for model

weights = models.EfficientNet_B0_Weights.DEFAULT

# Cargar modelo preentrenado de PyTorch

model = models.efficientnet_b0(weights=weights)

model.eval()

# Convert to OpenVINO IR (Intermediate Representation Format) format with static input shape
# This format, which consists of an XML file for the network topology and a BIN file for the weights and biases,
# is highly optimized for efficient inference on Intel hardware

batch_size=32

ov_model = ov.convert_model(model, input=[[batch_size, 3, 224, 224]])

# Save model IR

ov.save_model(ov_model, MODEL_DIR / f"{MODEL_NAME}_{batch_size}_static.xml")

print(f"Model IR save in : {MODEL_DIR}")

# Quantization is a very effective technique for accelerating inference on low-precision hardware like NPUs.
# If your model allows it, quantizing the model to INT8 can result in significant performance gains with minimal loss of accuracy.

quantized_model = compress_weights(ov_model)

# Save model Quantization

ov.save_model(quantized_model, MODEL_DIR / f"{MODEL_NAME}_{batch_size}_quantized.xml")

print(f"Model Quantization save in : {MODEL_DIR}")

If you have any further questions, please don't hesitate to contact me.

Thank you very much.

Wan_Intel · ‎06-14-2025

Hi Pablo_BR,

Thank you for sharing the code to generate the model files with us.

We will further investigate the issue, and we will provide an update here as soon as possible.

Regards,

Wan

Wan_Intel · ‎06-16-2025

Hi Pablo_BR,

Thank you for your patience.

I have downloaded and converted the efficientnet_b0 model into static and quantized Intermediate Representation. I have run both models on CPU, GPU, and NPU plugin with the Python code. I also encountered the similar issue:

CPU plugin:

cpu result.png

GPU plugin:

NPU plugin:

npu result.png

I will escalate the case to relevant team, and we will provide an update here as soon as possible.

Regards,

Wan

Wan_Intel · ‎06-28-2025

Hi Pablo_BR,

Thank you for your patience. We have received feedback from relevant team.

Batch is in experimental state for Intel® NPU plugin and it is not recommended to use especially in performance tests. Please decrease the number of batch to 1 on Intel® NPU plugin and try it out with different performance_hint modes (tput or latency) and different number of async infer requests in tput mode.

On another note, you may try to set batch to 2, but there is no point in running the batch more than 2 because of the HW resources limitation and the batching algorithm utilizing them.

Every topology might have a different performance ratio in different configurations, mostly because of Intel® NPU architecture and optimization applied.

Regards,

Wan

Wan_Intel · ‎07-05-2025

Hi Pablo_BR,

Thank you for your question.

If you need additional information from Intel, please submit a new question as this thread will no longer be monitored.

Regards,

Wan