Why processing is faster when input/output is float16?

WilsonChen0723 · ‎01-26-2024

How to configure Uint8 for faster speed.

Could you please tell me why processing is faster when input/output is float16, or how to configure Uint8 for faster speed?
We are currently working on a number of models and all of them are faster when measured with input/output (-ip/-op) set to F16.

Attached is a simple model that does only conv2d as an example.
This float32 non-quantized model is more than 4 times slower using U8 than F16.

example cmd: benchmark_app.exe -m model_conv2d_1080x1920_pad_fp32.xml -nireq 1 -niter 100 -d NPU -ip U8 -op U8
-ip/-op U8 F16 F32
Median: 78.38 17.00 27.25 ms
Average: 78.24 17.23 27.27
Min: 73.73 15.66 24.08
Max: 82.73 32.77 44.18

Also, F16 is the fastest in the int8 quantization model.

example cmd: benchmark_app.exe -m model_conv2d_1080x1920_pad_int8.xml -nireq 1 -niter 100 -d NPU -ip U8 -op U8
-ip/-op U8 F16 F32
Median: 22.25 11.80 16.91 ms
Average: 22.60 12.01 17.22
Min: 20.73 10.12 15.62
Max: 39.51 20.04 28.87

This one uses a model quantized with float32, but float16 is the fastest even for a model quantized to int8.
The same trend is true for other models such as add.
As an example of a model with multiple layers, comparing profiles with the "-report_type detailed_counters" option showed differences, especially for the first and last layers (etc: FakeQuantize).

Is it internally optimized for float16?
Or is it possible to change the optimal input/output by configuration?
Since uint8 is used in NV12 and other image formats, I would like to know if there is a setting that can achieve the same speed with uint8.

Wan_Intel · ‎01-26-2024

Hi WilsonChen0723,

Thanks for reaching out to us.

For your information, I've run Benchmark C++ Tool using face-detection-adas-0001 model with FP16 and INT8 on NPU plugin. I also observed the FPS of FP16 model is higher than INT8 model as shown in the attachment below:

82 FPS

fp16 npu.jpg

62 FPS

int8 npu.jpg

Let me check with the relevant team and we'll update you as soon as possible.

Regards,

Wan

Wan_Intel · ‎03-27-2024

Hi WilsonChen0723,

Thanks for your patience.

For your information, I've run the Benchmark C++ Tool to infer your FP32 and INT8 model with the NPU plugin on a Ubuntu machine using the latest version of the OpenVINO toolkit. The FPS when inferencing with the INT8 model is greater than the FP32 model. Could you please infer the model with the latest version of the OpenVINO toolkit and see if the issue can be resolved?

Regards,

Wan

Wan_Intel · ‎04-04-2024

Hello WilsonChen0723,

Thanks for your question.

If you need any additional information from Intel, please submit a new question as this thread will no longer be monitored.

Regards,

Wan