Int8 quantized model slower than unquantized one

a99user — Wed, 16 Sep 2020 08:27:21 GMT

Hi!

I'm trying to quantize FaceMesh model with POT tool using following config (based on default config example):

{ /* Model parameters */ "model": { "model_name": "facemesh", // Model name "model": "./facemesh.xml", // Path to model (.xml format) "weights": "./facemesh.bin" // Path to weights (.bin format) }, /* Parameters of the engine used for model inference */ "engine": { /* Simplified mode */ "type": "simplified", "data_source": "./data" }, /* Optimization hyperparameters */ "compression": { "target_device": "CPU", "algorithms": [ { "name": "DefaultQuantization", "params": { "preset": "performance", "stat_subset_size": 300, "shuffle_data": false } } ] } }

Quantized model becomes ~4 times smaller, although its inference time increases ~37%.

Unquantized model benchmark log:

[Step 1/11] Parsing and validating input arguments /opt/intel/openvino_2020.4.287/python/python3.6/openvino/tools/benchmark/main.py:29: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn(" -nstreams default value is determined automatically for a device. " [ WARNING ] -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. [Step 2/11] Loading Inference Engine [ INFO ] InferenceEngine: API version............. 2.1.2020.4.0-359-21e092122f4-releases/2020/4 [ INFO ] Device info CPU MKLDNNPlugin............ version 2.1 Build................... 2020.4.0-359-21e092122f4-releases/2020/4 [Step 3/11] Setting device configuration [ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance,but it still may be non-optimal for some cases, for more information look at README. [Step 4/11] Reading the Intermediate Representation network [ INFO ] Read network took 31.38 ms [Step 5/11] Resizing network to match image sizes and given batch [ INFO ] Network batch size: 1 [Step 6/11] Configuring input of the model [Step 7/11] Loading the model to the device [ INFO ] Load network took 199.60 ms [Step 8/11] Setting optimal runtime parameters [Step 9/11] Creating infer requests and filling input blobs with images [ INFO ] Network input 'image' precision U8, dimensions (NCHW): 1 3 192 192 /opt/intel/openvino_2020.4.287/python/python3.6/openvino/tools/benchmark/utils/inputs_filling.py:71: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn("No input files were given: all inputs will be filled with random values!") [ WARNING ] No input files were given: all inputs will be filled with random values! [ INFO ] Infer Request 0 filling [ INFO ] Fill input 'image' with random values (image is expected) [ INFO ] Infer Request 1 filling [ INFO ] Fill input 'image' with random values (image is expected) [ INFO ] Infer Request 2 filling [ INFO ] Fill input 'image' with random values (image is expected) [ INFO ] Infer Request 3 filling [ INFO ] Fill input 'image' with random values (image is expected) [Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration) [Step 11/11] Dumping statistics report Count: 64424 iterations Duration: 60006.06 ms Latency: 3.60 ms Throughput: 1073.62 FPS

Quantized model benchmark log:

[Step 1/11] Parsing and validating input arguments /opt/intel/openvino_2020.4.287/python/python3.6/openvino/tools/benchmark/main.py:29: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn(" -nstreams default value is determined automatically for a device. " [ WARNING ] -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. [Step 2/11] Loading Inference Engine [ INFO ] InferenceEngine: API version............. 2.1.2020.4.0-359-21e092122f4-releases/2020/4 [ INFO ] Device info CPU MKLDNNPlugin............ version 2.1 Build................... 2020.4.0-359-21e092122f4-releases/2020/4 [Step 3/11] Setting device configuration [ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance,but it still may be non-optimal for some cases, for more information look at README. [Step 4/11] Reading the Intermediate Representation network [ INFO ] Read network took 67.49 ms [Step 5/11] Resizing network to match image sizes and given batch [ INFO ] Network batch size: 1 [Step 6/11] Configuring input of the model [Step 7/11] Loading the model to the device [ INFO ] Load network took 294.29 ms [Step 8/11] Setting optimal runtime parameters [Step 9/11] Creating infer requests and filling input blobs with images [ INFO ] Network input 'image' precision U8, dimensions (NCHW): 1 3 192 192 /opt/intel/openvino_2020.4.287/python/python3.6/openvino/tools/benchmark/utils/inputs_filling.py:71: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn("No input files were given: all inputs will be filled with random values!") [ WARNING ] No input files were given: all inputs will be filled with random values! [ INFO ] Infer Request 0 filling [ INFO ] Fill input 'image' with random values (image is expected) [ INFO ] Infer Request 1 filling [ INFO ] Fill input 'image' with random values (image is expected) [ INFO ] Infer Request 2 filling [ INFO ] Fill input 'image' with random values (image is expected) [ INFO ] Infer Request 3 filling [ INFO ] Fill input 'image' with random values (image is expected) [Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration) [Step 11/11] Dumping statistics report Count: 48160 iterations Duration: 60007.22 ms Latency: 4.93 ms Throughput: 802.57 FPS

Could you check please, is it expected result for such model?

BR,
Alexey.

Re: Int8 quantized model slower than unquantized one

isomov — Wed, 16 Sep 2020 14:18:26 GMT

Hi!

Having the same issue with exact the same config file.

Waiting for an answer from intel.

Re:Int8 quantized model slower than unquantized one

IntelSupport — Thu, 17 Sep 2020 11:07:24 GMT

Hi Alexey,

Thanks for reaching out.

I tested your xml file for both quantized and unquantized. I am getting the same result as you.

OpenVINO quantization depends on specific libraries and devices. It's probably due to unsupported layers in 8-bit integer computation mode for your model to be quantized.

You can refer here for more details: https://github.com/intel/webml-polyfill/issues/1239

Also please check the topologies that have been validated for 8-bit inference feature here.

Regards,

Aznie

Re:Int8 quantized model slower than unquantized one

IntelSupport — Tue, 22 Sep 2020 13:17:57 GMT

Hi Alexey,

This thread will no longer be monitored since this issue has been resolved. If you need any additional information from Intel, please submit a new question.

Best Regards,

Aznie

topic Re:Int8 quantized model slower than unquantized one in Intel® Distribution of OpenVINO™ Toolkit

Int8 quantized model slower than unquantized one

Re: Int8 quantized model slower than unquantized one

Re:Int8 quantized model slower than unquantized one

Re:Int8 quantized model slower than unquantized one