Solved: Performance Issue about quantization from FP32 to INT8

spartazhc_ · ‎08-02-2022

I am trying to do quantization refering to 302-pytorch-quantization-aware-training.ipynb and testing throughput for now.

it looks good on my server(Intel(R) Xeon(R) Platinum 8280M CPU @ 2.70GHz), about 3.35x speed up

Benchmark FP32 model (IR)

Count: 385 iterations

Duration: 10012.54 ms

Latency:

Throughput: 45.04 FPS

Benchmark INT8

model (IR) Count: 993 iterations

Duration: 10001.73 ms

Latency:

Throughput: 151.35 FPS

however, I downloaded the model and run on my laptop (Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz), less than 2x speed up is aquired.

benchmark_app.exe -m .\net_r2c32s_int8.xml -d CPU -api sync -t 20

...

Latency:
Median: 234.69 ms
AVG: 229.48 ms
MIN: 98.20 ms
MAX: 317.67 ms
Throughput: 4.26 FPS

benchmark_app.exe -m .\net_r2c32s_fp32.xml -d CPU -api sync -t 20

...
Latency:
Median: 372.04 ms
AVG: 326.20 ms
MIN: 154.71 ms
MAX: 677.04 ms
Throughput: 2.69 FPS

What's the reason of this performance issue?

BR,

Spartazhc.

Peh_Intel · ‎08-03-2022

Hi Spartazhc,

Thanks for reaching out to us.

Inferencing model on different platform (hardware) is the main reason for getting different performance.

Quantizing a FP32 model into an INT8 model helps to improve the performance (FPS increased) on the same platform. But it is expected to be not getting the same ratio of the increased speed from various platform.

You can refer to Intel® Distribution of OpenVINO™ toolkit Benchmark Results to observe the performance (throughput) of various platform. You can notice that the ratio of the increased speed from inferencing a FP32 model to an INT8 model are different on various platform.

Regards,

Peh

View solution in original post

Peh_Intel · ‎08-03-2022

Hi Spartazhc,

Thanks for reaching out to us.

Inferencing model on different platform (hardware) is the main reason for getting different performance.

Quantizing a FP32 model into an INT8 model helps to improve the performance (FPS increased) on the same platform. But it is expected to be not getting the same ratio of the increased speed from various platform.

You can refer to Intel® Distribution of OpenVINO™ toolkit Benchmark Results to observe the performance (throughput) of various platform. You can notice that the ratio of the increased speed from inferencing a FP32 model to an INT8 model are different on various platform.

Regards,

Peh

spartazhc_ · ‎08-03-2022

Thanks for your reply!

So I would like to get it clear that, the reason of different speedup ratio is just the platform optimization. Not because of I do the quantization on Xeon but benchmark it on Core?

BR,

Spartazhc

Ray_Lo_Intel · ‎08-05-2022

https://www.intel.com/content/dam/www/public/us/en/documents/product-overviews/dl-boost-product-overview.pdf

This may help. The AVX-512_VNNI for example.

Peh_Intel · ‎08-03-2022

Hi Spartazhc,

Yes, you are correct.

For your information, when downloading Intel’s Pre-Trained Models without specifying precision, you will get three precision models, FP32, FP16 and INT8. Hence, this INT8 model does not limited to be used on the specific platform that used to quantize the model. This INT8 model is useable for any supported platform.

Regards,

Peh

Peh_Intel · ‎08-08-2022

Hi Spartazhc,

This thread will no longer be monitored since your question has been answered. If you need any additional information from Intel, please submit a new question.

Regards,

Peh