Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6404 Discussions

Performance Issue about quantization from FP32 to INT8

spartazhc_
Beginner
727 Views

I am trying to do quantization refering to 302-pytorch-quantization-aware-training.ipynb and testing throughput for now.

it looks good on my server(Intel(R) Xeon(R) Platinum 8280M CPU @ 2.70GHz), about 3.35x speed up

 

Benchmark FP32 model (IR)

Count: 385 iterations

Duration: 10012.54 ms

Latency:

Throughput: 45.04 FPS

Benchmark INT8

model (IR) Count: 993 iterations

Duration: 10001.73 ms

Latency:

Throughput: 151.35 FPS

 

however, I downloaded the model and run on my laptop (Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz), less than 2x speed up is aquired.

benchmark_app.exe -m .\net_r2c32s_int8.xml -d CPU -api sync -t 20

...

Latency:
Median: 234.69 ms
AVG: 229.48 ms
MIN: 98.20 ms
MAX: 317.67 ms
Throughput: 4.26 FPS

benchmark_app.exe -m .\net_r2c32s_fp32.xml -d CPU -api sync -t 20

...
Latency:
Median: 372.04 ms
AVG: 326.20 ms
MIN: 154.71 ms
MAX: 677.04 ms
Throughput: 2.69 FPS

 

What's the reason of this performance issue?

 

BR,

Spartazhc.

0 Kudos
1 Solution
Peh_Intel
Moderator
687 Views

Hi Spartazhc,


Thanks for reaching out to us.


Inferencing model on different platform (hardware) is the main reason for getting different performance.


Quantizing a FP32 model into an INT8 model helps to improve the performance (FPS increased) on the same platform. But it is expected to be not getting the same ratio of the increased speed from various platform.


You can refer to Intel® Distribution of OpenVINO™ toolkit Benchmark Results to observe the performance (throughput) of various platform. You can notice that the ratio of the increased speed from inferencing a FP32 model to an INT8 model are different on various platform.



Regards,

Peh


View solution in original post

0 Kudos
5 Replies
Peh_Intel
Moderator
688 Views

Hi Spartazhc,


Thanks for reaching out to us.


Inferencing model on different platform (hardware) is the main reason for getting different performance.


Quantizing a FP32 model into an INT8 model helps to improve the performance (FPS increased) on the same platform. But it is expected to be not getting the same ratio of the increased speed from various platform.


You can refer to Intel® Distribution of OpenVINO™ toolkit Benchmark Results to observe the performance (throughput) of various platform. You can notice that the ratio of the increased speed from inferencing a FP32 model to an INT8 model are different on various platform.



Regards,

Peh


0 Kudos
spartazhc_
Beginner
678 Views

Thanks for your reply!

 

So I would like to get it clear that, the reason of different speedup ratio is just the platform optimization. Not because of I do the quantization on Xeon but benchmark it on Core?

 

BR,

Spartazhc

0 Kudos
Peh_Intel
Moderator
662 Views

Hi Spartazhc,


Yes, you are correct.


For your information, when downloading Intel’s Pre-Trained Models without specifying precision, you will get three precision models, FP32, FP16 and INT8. Hence, this INT8 model does not limited to be used on the specific platform that used to quantize the model. This INT8 model is useable for any supported platform.



Regards,

Peh


Peh_Intel
Moderator
620 Views

Hi Spartazhc,


This thread will no longer be monitored since your question has been answered. If you need any additional information from Intel, please submit a new question.



Regards,

Peh


0 Kudos
Reply