- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to do quantization refering to 302-pytorch-quantization-aware-training.ipynb and testing throughput for now.
it looks good on my server(Intel(R) Xeon(R) Platinum 8280M CPU @ 2.70GHz), about 3.35x speed up
Benchmark FP32 model (IR)
Count: 385 iterations
Duration: 10012.54 ms
Latency:
Throughput: 45.04 FPS
Benchmark INT8
model (IR) Count: 993 iterations
Duration: 10001.73 ms
Latency:
Throughput: 151.35 FPS
however, I downloaded the model and run on my laptop (Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz), less than 2x speed up is aquired.
benchmark_app.exe -m .\net_r2c32s_int8.xml -d CPU -api sync -t 20
...
Latency:
Median: 234.69 ms
AVG: 229.48 ms
MIN: 98.20 ms
MAX: 317.67 ms
Throughput: 4.26 FPS
benchmark_app.exe -m .\net_r2c32s_fp32.xml -d CPU -api sync -t 20
...
Latency:
Median: 372.04 ms
AVG: 326.20 ms
MIN: 154.71 ms
MAX: 677.04 ms
Throughput: 2.69 FPS
What's the reason of this performance issue?
BR,
Spartazhc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Spartazhc,
Thanks for reaching out to us.
Inferencing model on different platform (hardware) is the main reason for getting different performance.
Quantizing a FP32 model into an INT8 model helps to improve the performance (FPS increased) on the same platform. But it is expected to be not getting the same ratio of the increased speed from various platform.
You can refer to Intel® Distribution of OpenVINO™ toolkit Benchmark Results to observe the performance (throughput) of various platform. You can notice that the ratio of the increased speed from inferencing a FP32 model to an INT8 model are different on various platform.
Regards,
Peh
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Spartazhc,
Thanks for reaching out to us.
Inferencing model on different platform (hardware) is the main reason for getting different performance.
Quantizing a FP32 model into an INT8 model helps to improve the performance (FPS increased) on the same platform. But it is expected to be not getting the same ratio of the increased speed from various platform.
You can refer to Intel® Distribution of OpenVINO™ toolkit Benchmark Results to observe the performance (throughput) of various platform. You can notice that the ratio of the increased speed from inferencing a FP32 model to an INT8 model are different on various platform.
Regards,
Peh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply!
So I would like to get it clear that, the reason of different speedup ratio is just the platform optimization. Not because of I do the quantization on Xeon but benchmark it on Core?
BR,
Spartazhc
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This may help. The AVX-512_VNNI for example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Spartazhc,
Yes, you are correct.
For your information, when downloading Intel’s Pre-Trained Models without specifying precision, you will get three precision models, FP32, FP16 and INT8. Hence, this INT8 model does not limited to be used on the specific platform that used to quantize the model. This INT8 model is useable for any supported platform.
Regards,
Peh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Spartazhc,
This thread will no longer be monitored since your question has been answered. If you need any additional information from Intel, please submit a new question.
Regards,
Peh
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page