Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6503 Discussions

Difficulty optimizing performance using Model and Inference Precision

rm-kozgun
Beginner
663 Views

I'm performing inference in c++ using openvino 2023.3. I currently have a f32 precision model, and compile with f32 inference precision and ExecutionMode::PERFORMANCE. Using my GPU, I see a good performance boost and ~60% runtime reduction over CPU.

I'd like to further optimize runtime, so I've produced a comparable model using f16 precision. I've made three observations:

  • Using f16 model precision does not yield a runtime boost over f32 model precision, for either inference precision. (expected result)
  • Using f16 inference precision does not yield a runtime boost over f32 inference precision, for either model precision. (unexpected result)
  • Using the f16 inference precision with the GPU yields incorrect results, though it does run accurately with the CPU. (unexpected result)

Am I implementing something wrong?

In my code, I'm adjusting these settings:
compiled_model = core_.compile_model(model, "CPU",
ov::hint::execution_mode(ov::hint::ExecutionMode::ACCURACY),
ov::hint::inference_precision(ov::element::f32));

Thank you!

0 Kudos
5 Replies
Aznie_Intel
Moderator
617 Views

Hi Rm-kozgun,

 

Thanks for reaching out. May I know how you test the result and can you share all the results from your observation? For your information, the floating-point precision supported on GPU are f32 and f16, and for CPU are f32 and bf16.

 

If you are using a dynamic shape model with GPU, it is expected for significant performance drops. I would suggest you test your model with Benchmark App and compare the output for every precision with the CPU and GPU plugins.

 

 

Regards,

Aznie

 

0 Kudos
Rm-jstanton
Beginner
566 Views

Hi Aznie, 

 

Thanks for your reply. Our goal is to apply post-training quantization to a semantic segmentation model trained in TensorFlow with f32 model precision. Our ultimate goal is to covert the model precision to f16 and use f16 inference precision at runtime on a GPU. In our attempts to achieve this, we've been testing all combinations of f16, bf16, and f32 model and inference precisions. As Rm-kozgun described above, we are experiencing 2 main issues:

 

(1)  Runtime performance (in terms of frames per second (fps)) is not always improved with f16 inference precision (IP) compared to f32 regardless of model precision (MP) (see system 2 CPU results below). We also have seen that it is not always improved on the GPU vs CPU (see system 1 results below).

 

 System 1

F16 IP CPU

F32 IP CPU

F16 IP GPU

F32 IP GPU

F16 MP

30.4 fps

27.4 fps

30.9 fps (poor accuracy)

20.0 fps

F32 MP

31.9 fps

25.2 fps

31.1 fps (poor accuracy)

20.1 fps

 

 System 2

F16 IP CPU

F32 IP CPU

F16 IP GPU

F32 IP GPU

F16 MP

15.8 fps

16.9 fps

52.8 fps (poor accuracy)

37.2 fps

F32 MP

16.8 fps

16.9 fps

48.7 fps (poor accuracy)

35.3 fps

 


(2)  Model accuracy plummets with f16 inference precision on the GPU regardless of model precision. As Rm-kozgun outlined in (3) above, the model accuracy is severely impacted when we try to use f16 inference precision on the GPU with a model converted to f16 model precision. However, model accuracy is maintained with f16 model precision when using f16 or f32 inference precision on the CPU or f32 inference precision on the GPU. We see similar results with bf16 model precision (i.e., model accuracy plumets with bf16 inference precision on the GPU - which perhaps is expected based on your comment above - but is maintained with bf16 inference precision on the CPU). 

 

I've attached an image that details our exact workflow in Python: (1) Convert f32 keras model to f16 TF Lite model (2) Convert f16 TF Lite model to f16 OV model (3) Compile OV model for inference, specifying GPU and f16 inference precision and (4) run the inference step on example image data. In practice, we perform (1) and (2) in Python and (3) and (4) in C++, but for the purposes of this QC we've kept it all in Python. 

 

Please let me know if we can send any additional information that would be helpful for debugging. 

0 Kudos
Aznie_Intel
Moderator
530 Views

Hi Rm-jstanton,

 

Are you able to share your model files (xml and bin)? For the Post Training Quantization it can be done with NNCF, you can perform 8-bit quantization, using mainly the two flows:

 

Basic quantization (simple):

Requires only a representative calibration dataset.

 

Accuracy-aware Quantization (advanced):

Ensures the accuracy of the resulting model does not drop below a certain value. To do so, it requires both a calibration and a validation datasets, as well as a validation function to calculate the accuracy metric.

 

Can you reproduce the behavior with OpenVINO Benchmark_app -d GPU?

 

 

Regards,

Aznie


0 Kudos
Aznie_Intel
Moderator
423 Views

Hi RM-jstanton,

 

Do you still need help with this issue? Please share your model files for us to further check this.

 


Regards,

Aznie


0 Kudos
Aznie_Intel
Moderator
213 Views

Hi RM-jstanton,


Thank you for your question. If you need any additional information from Intel, please submit a new question as this thread is no longer being monitored.



Regards,

Aznie


0 Kudos
Reply