Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6568 Discussions

Very poor NPU tests, what am I doing wrong?

Pablo_BR
Novice
6,210 Views

My test

#Generate the IR model

import openvino as ov
from pathlib import Path
from nncf import compress_weights

MODEL_NAME = "efficientnet_b0"
MODEL_DIR = Path("Models")
MODEL_DIR.mkdir(parents=True, existence_ok=True)

weights = models.EfficientNet_B0_Weights.DEFAULT
model = models.efficientnet_b0(weights=weights)
model.eval()

batch_size=32
ov_model = ov.convert_model(model, input=[[batch_size, 3, 224, 224]])

ov.save_model(ov_model, MODEL_DIR / f"{MODEL_NAME}_{batch_size}_static.xml")

quantized_model = compress_weights(ov_model)
ov.save_model(quantized_model, MODEL_DIR / f"{MODEL_NAME}_{batch_size}_quantized.xml")

#Implementation Testing

...
batch_size=32
num_workers=0
val_dataset = datasets.ImageNet(root=IMAGENET_VAL_DIR, split='val', transform=val_transforms)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, drop_last=True)
...
compiled_model = core.compile_model(model=MODEL_PATH, device_name=device)
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)
...
for images, labels in val_loader:
inputs = images.numpy()
results = compiled_model(inputs={input_layer: inputs})
output = results[output_layer]

top1, top5 = accuracy(output, labels)
top1_total += top1
top5_total += top5
total += labels.size(0)

# Show progress bar
elapsed = time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time))
num_imput = num_imput + len(images)
print(f"\rmodel: {type_model} - device : {device} - Valid [{(num_imput):>7,}/{size_test:>7,}] - Elapsed: {elapsed} - ".replace(",", "."), end="")


batch_size:32, num_workers:0
model: quantized - device : GPU - Valid [ 49.984/ 50.000] - Elapsed: 00:03:42 - accuracy - Top-1: 77.01%, Top-5: 93.24%
model: quantized - device : CPU - Valid [ 49.984/ 50.000] - Elapsed: 00:06:50 - accuracy  - Top-1: 77.01%, Top-5: 93.25%
model: quantized - device : NPU - Valid [ 49.984/ 50.000] - Elapsed: 00:12:21 - accuracy  - Top-1: 77.03%, Top-5: 93.23%

model: static - device : GPU - Valid [ 49.984/ 50.000] - Elapsed: 00:03:46 - accuracy  - Top-1: 77.65%, Top-5: 93.58%
model: static - device : CPU - Valid [ 49.984/ 50.000] - Elapsed: 00:06:46 - accuracy  - Top-1: 77.68%, Top-5: 93.58%
model: static - device : NPU - Valid [ 49.984/ 50.000] - Elapsed: 00:12:25 - accuracy  - Top-1: 77.68%, Top-5: 93.58%

 

 

0 Kudos
9 Replies
Wan_Intel
Moderator
6,072 Views

Hi Pablo_BR,

Thank you for reaching out to us.

 

I have run your snippet of code but I encountered TypeError: Path.mkdir() got an unexpected keyword argument 'existence_ok' while running MODEL_DIR.mkdir(parents=True, existence_ok=True), and I encountered NameError: name 'models' is not defined while I removed argument 'existence_ok' and run weights = models.EfficientNet_B0_Weights.DEFAULT

 

Could you please provide the information below so that we can further investigate the issue?

  • Python version
  • Hardware specifications
  • Deep learning models
  • Code snippet or Python script to replicate the issue


If you have additional information that is helpful for us, please share it here as well. We will continue to troubleshoot the issue once we received the information.

 

 

Regards,

Wan


0 Kudos
Pablo_BR
Novice
6,051 Views

Hello, thank you for your interest in my question.


I'll try to answer your questions.

  • Hardware specifications

GEEKOM GT1 Mega
Versión de BIOS       0.50
Fecha    2024-08-12

Sistema operativo

Microsoft Windows 11 Pro (64 bits)
Versión de Compilación 24H2 (10.0.26100)

Descripción        Intel64 Family 6 Model 170 Stepping 4
Arquitectura      x64
Cantidad de núcleos       16
Cantidad de subprocesos            22
Frecuencia básica del procesador            2300 MHz
Voltaje actual    1.6
Caché de nivel 2               18432 KB
Caché de nivel 3               24576 KB
Identificación de procesador      0xA06A4

Gráficos

Detalles del controlador

Proveedor          Intel Corporation
Versión 32.0.101.6874
Fecha    2025-05-25

Procesador de video      Intel® Arc™ Graphics Family
ID de dispositivo              PCI\VEN_8086&DEV_7D55&SUBSYS_22128086&REV_08\3&11583659&0&10

Proveedor          Intel® Corporation
Nombre               IntcUSB.sys

Detalles del controlador

Proveedor          Realtek Semiconductor Corp.
Nombre               RTKVHD64.sys

Detalles del dispositivo

Memoria física - Total    32 GB
Memoria física - Disponible        23,48 GB
Memoria virtual - Total 33,84 GB
Memoria virtual - Disponible      25,07 GB

  • Deep learning models

Obtain the pre-trained model with its weights from torchvision.models, models.efficientnet_b0

Create the OpenVino IR model, files:

efficientnet_b0_32_static.xml

efficientnet_b0_32_static.bin

Create the OpenVino Quantization model, files:

efficientnet_b0_32_quantized.xml

efficientnet_b0_32_quantized.bin

With these two models, validate the torchvision.datasets.ImageNet dataset and quantify the difference between implementations on the CPU, GPU, and NPU.

 

  • Python version

(OpenVINO_env) C:\IA\OpenVINO_env\Code>pip show torch torchvision openvino

Name: torch

Version: 2.7.0+xpu

---

Name: torchvision

Version: 0.22.0+xpu

---

Name: openvino

Version: 2025.1.0

Summary: OpenVINO(TM) Runtime

 

  • Code snippet or Python script to replicate the issue

import torch

import torchvision.transforms as transforms
import torchvision.datasets as datasets
import openvino as ov
import numpy as np
from pathlib import Path
from openvino import Core

import time

# Transformaciones
IMAGENET_VAL_DIR = Path(r"C:\IA\DataSets\ImageNet")
print('Transformaciones')
val_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

# Dataset y DataLoader
print('Dataset y DataLoader')
batch_size=32
num_workers=0
val_dataset = datasets.ImageNet(root=IMAGENET_VAL_DIR, split='val', transform=val_transforms)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, drop_last=True)  

# Función de precisión
print(f'Función de precisión: batch_size:{batch_size}, num_workers:{num_workers}')
def accuracy(output, target, topk=(1, 5)):
    maxk = max(topk)
    _, pred = torch.tensor(output).topk(maxk, 1, True, True)
    pred = pred.t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))
    return [correct[:k].reshape(-1).float().sum(0).item() for k in topk]

# Configuración

models = {'static','quantized'}
devices = {'CPU', 'GPU', 'NPU'}

MODEL_NAME = "efficientnet_b0"
MODEL_DIR = Path("Models")

core = Core()

for type_model in models:

    for device in devices:

        # Cargar modelo IR
        if (type_model=='static'
            MODEL_PATH = MODEL_DIR / f"{MODEL_NAME}_{batch_size}_static.xml"
        elif (type_model=='quantized'
            MODEL_PATH = MODEL_DIR / f"{MODEL_NAME}_{batch_size}_quantized.xml"
       
        compiled_model = core.compile_model(model=MODEL_PATH, device_name=device)
        input_layer = compiled_model.input(0)
        output_layer = compiled_model.output(0)

        # Validación
        top1_total = 0
        top5_total = 0
        total = 0

        # Datos iniciales barra de progreso
        size_test = len(val_loader.dataset)
        start_time = time.time()
        num_imput = 0  

        for images, labels in val_loader:
            inputs = images.numpy()
            results = compiled_model(inputs={input_layer: inputs})
            output = results[output_layer]

            top1, top5 = accuracy(output, labels)
            top1_total += top1
            top5_total += top5
            total += labels.size(0)

            # Mostrar barra de progreso
            elapsed = time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time))
            num_imput = num_imput + len(images)
            print(f"\rmodel: {type_model} - device : {device} - Valid [{(num_imput):>7,}/{size_test:>7,}] - Elapsed: {elapsed} - ".replace(",", "."), end="")

        print(f"Precisión final - Top-1: {top1_total / total:.2%}, Top-5: {top5_total / total:.2%}")
   
print("Proceso finalizado")

 

 

If you have any further questions, please don't hesitate to contact me.

Thank you.

Wan_Intel
Moderator
6,010 Views

Hi Pablo_BR,

Thank you for sharing the information with us.


We will further investigate the issue and provide an update here as soon as possible.



Regards,

Wan


0 Kudos
Wan_Intel
Moderator
5,974 Views

Hi Pablo_BR,

Thank you for your patience.

 

I have downloaded and converted efficientnet_b0 model into Intermediate Representation with the following command:

import torchvision
import torch
import openvino as ov

model = torchvision.models.efficientnet_b0(weights='DEFAULT')
ov_model = ov.convert_model(model)
ov.save_model(ov_model, 'model.xml')

 

However, I encountered error while inferencing with NPU plugin. The snippet code worked fine while using CPU and GPU plugin.

efficientnet_b0.png

 

Could you please share the following models with us to further investigate the issue? For example, you may share it via Google Drive with us.

  • efficientnet_b0_32_static.xml
  • efficientnet_b0_32_static.bin
  • efficientnet_b0_32_quantized.xml
  • efficientnet_b0_32_quantized.bin

 

 

Regards,

Wan

 

0 Kudos
Pablo_BR
Novice
5,917 Views

Hello, happy to help.


I use this code to create the models.
Please run it to generate the model files.

 

import torch
import torchvision.models as models
import openvino as ov
from pathlib import Path
from nncf import compress_weights

# Configuración
MODEL_NAME = "efficientnet_b0"
MODEL_DIR = Path("Models")
MODEL_DIR.mkdir(parents=True, exist_ok=True)

# get default weights using available weights Enum for model
weights = models.EfficientNet_B0_Weights.DEFAULT

# Cargar modelo preentrenado de PyTorch
model = models.efficientnet_b0(weights=weights)
model.eval()

# Convert to OpenVINO IR (Intermediate Representation Format) format with static input shape
# This format, which consists of an XML file for the network topology and a BIN file for the weights and biases,
# is highly optimized for efficient inference on Intel hardware
batch_size=32
ov_model = ov.convert_model(model, input=[[batch_size, 3, 224, 224]])

# Save model IR
ov.save_model(ov_model, MODEL_DIR / f"{MODEL_NAME}_{batch_size}_static.xml")

print(f"Model IR save in : {MODEL_DIR}")

# Quantization is a very effective technique for accelerating inference on low-precision hardware like NPUs.
# If your model allows it, quantizing the model to INT8 can result in significant performance gains with minimal loss of accuracy.
quantized_model = compress_weights(ov_model)

# Save model Quantization 
ov.save_model(quantized_model, MODEL_DIR / f"{MODEL_NAME}_{batch_size}_quantized.xml")

print(f"Model Quantization  save in : {MODEL_DIR}")

 

If you have any further questions, please don't hesitate to contact me.

 

Thank you very much.

Wan_Intel
Moderator
5,793 Views

Hi Pablo_BR,

Thank you for sharing the code to generate the model files with us.


We will further investigate the issue, and we will provide an update here as soon as possible.



Regards,

Wan


Wan_Intel
Moderator
5,707 Views

Hi Pablo_BR,

Thank you for your patience.

 

I have downloaded and converted the efficientnet_b0 model into static and quantized Intermediate Representation. I have run both models on CPU, GPU, and NPU plugin with the Python code. I also encountered the similar issue:

 

CPU plugin:

cpu result.png

 

GPU plugin:

gpu result.png

 

NPU plugin:

npu result.png

 

I will escalate the case to relevant team, and we will provide an update here as soon as possible.

 

 

Regards,

Wan

 

 

0 Kudos
Wan_Intel
Moderator
4,625 Views

Hi Pablo_BR,

Thank you for your patience. We have received feedback from relevant team.

 

Batch is in experimental state for Intel® NPU plugin and it is not recommended to use especially in performance tests. Please decrease the number of batch to 1 on Intel® NPU plugin and try it out with different performance_hint modes (tput or latency) and different number of async infer requests in tput mode.


On another note, you may try to set batch to 2, but there is no point in running the batch more than 2 because of the HW resources limitation and the batching algorithm utilizing them.


Every topology might have a different performance ratio in different configurations, mostly because of Intel® NPU architecture and optimization applied.

 

 

Regards,

Wan


0 Kudos
Wan_Intel
Moderator
3,456 Views

Hi Pablo_BR,

Thank you for your question.


If you need additional information from Intel, please submit a new question as this thread will no longer be monitored.



Regards,

Wan


0 Kudos
Reply