Solved: Compiled models on GPU, NPU output different results

ayf7 · ‎06-07-2024

Issue

When I compile the exact same model on different devices, I am receiving outputs that what I believe are non-trivial differences between CPU, GPU, NPU. Executed operations in OpenVINO are outputting values with difference up to 1e-4.

Reproducible Steps and Results

Here is some sample code I've run:

import torch
import torch.nn as nn
import torch.nn.functional as F
import openvino as ov

import numpy as np

class Function(nn.Module):
    def __init__(self):
        super(Function, self).__init__()

    def forward(self, x):
        y = F.softmax(x, dim=1) # replace with any function
        return y

if __name__ == "__main__":
    torch.set_printoptions(precision=8)

    model = Function()
    input_tensor = torch.randn(1, 5)

    # Native inference in PyTorch
    print("Input  \t", input_tensor, "\n")
    output = model(input_tensor).numpy()
    print("Native Output -\t", output, "\n")

    # Using OpenVINO conversion
    converted_model = ov.convert_model(model,
                                       input=("x", ov.Shape([1, 5])),
                                       example_input=input_tensor
                                       )

    compiled_model_cpu = ov.compile_model(converted_model, device_name="CPU")
    compiled_model_gpu = ov.compile_model(converted_model, device_name="GPU")
    compiled_model_npu = ov.compile_model(converted_model, device_name="NPU")

    output_cpu = compiled_model_cpu(input_tensor)[0]
    output_gpu = compiled_model_gpu(input_tensor)[0]
    output_npu = compiled_model_npu(input_tensor)[0]
    
    # Inference
    print("CPU Output -\t", output_cpu)
    print("GPU Output -\t", output_gpu)
    print("NPU Output -\t", output_npu, "\n")

    print("Error CPU vs. Native - \t", np.mean(np.abs(output - output_cpu)))
    print("Error GPU vs. Native - \t", np.mean(np.abs(output - output_gpu)))
    print("Error NPU vs. Native - \t", np.mean(np.abs(output - output_npu)))

In this file, I've defined a very simple PyTorch model that simply executes a single operation, softmax. I run inference natively on PyTorch, followed by compilation using OpenVINO for CPU, GPU, and NPU as well. For each OpenVINO model, I calculate the error relative to the native output by taking the absolute difference for each dimension, then taking the average of all differences.

Here is an example output:

Input    tensor([[-0.08171148,  1.05321479,  2.15006995, -0.07086628, -2.35258102]]) 

Native Output -  [[0.06876861 0.2139353  0.6406792  0.06951848 0.00709846]] 

CPU Output -     [[0.0687686  0.2139353  0.6406791  0.06951846 0.00709846]]
GPU Output -     [[0.06872559 0.21386719 0.640625   0.06958008 0.00709915]]
NPU Output -     [[0.06872559 0.21374512 0.640625   0.06951904 0.00708771]] 

Error CPU vs. Native -   1.6577541e-08
Error GPU vs. Native -   4.552165e-05
Error NPU vs. Native -   5.9740152e-05

I believe 1e-8 is epsilon difference, but 1e-5 is a pretty significant.

I've also replaced the provided PyTorch function as ReLU, sigmoid, and GeLU. I've also tested different dtypes f32, f16, i8, u8 as well. I've cross tested the different combinations and this is what I found:

ReLU, Sigmoid, GeLU were tested on 10000 dimensional vectors, while softmax was tested on a 5 dimensional vector (or something similar).

Hardware

Intel(R) Core(TM) Ultra 9 185H / Meteor Lake

OS

Ubuntu 22.04 LTS

Kernel version 6.9.3

Drivers

GPU Driver: Intel Compute Runtime 24.17.29377.6

NPU Driver: Linux NPU Driver v1.2.0

Level Zero: 1.16.15

Using OpenVINO 2024.1 as well.

Wan_Intel · ‎06-23-2024

Dear ayf7,

Thanks for your patience.

We've received feedback from our developer.

Our developer has responded that Intel® NPU doesn't have FP32 support on the NPU device. Therefore, all the FP32 models are natively executed as FP16, and the accuracy difference is expected behavior for this case.

On the other hand, since OpenVINO™ relies on the OpenCL kernels for the GPU implementation, it prefers FP16 inference precision over FP32.

For more information, please refer to the following links:

Sorry for the inconvenience and thank you for your support.

Regards,

Wan

View solution in original post

Wan_Intel · ‎06-08-2024

Hi ayf7,

Thanks for reaching out to us.

We will further investigate the issue and update you as soon as possible.

Regards,

Wan

Wan_Intel · ‎06-23-2024

Dear ayf7,

Thanks for your patience.

We've received feedback from our developer.

Our developer has responded that Intel® NPU doesn't have FP32 support on the NPU device. Therefore, all the FP32 models are natively executed as FP16, and the accuracy difference is expected behavior for this case.

On the other hand, since OpenVINO™ relies on the OpenCL kernels for the GPU implementation, it prefers FP16 inference precision over FP32.

For more information, please refer to the following links:

Sorry for the inconvenience and thank you for your support.

Regards,

Wan

Wan_Intel · ‎07-06-2024

Hi ayf7,

Thanks for your question.

If you need additional information from Intel, please submit a new question as this thread will no longer be monitored.

Regards,

Wan

Compiled models on GPU, NPU output different results

Inference Engine