Solved: Re:Int8 quantized model slower than unquantized one

a99user · ‎09-16-2020

Hi!

I'm trying to quantize FaceMesh model with POT tool using following config (based on default config example):

{
    /* Model parameters */

    "model": {
        "model_name": "facemesh", // Model name
        "model": "./facemesh.xml", // Path to model (.xml format)
        "weights": "./facemesh.bin" // Path to weights (.bin format)
    },

    /* Parameters of the engine used for model inference */
    "engine": {
        /* Simplified mode */
        "type": "simplified", 
        "data_source": "./data" 
    },

    /* Optimization hyperparameters */
    "compression": {
        "target_device": "CPU", 
        "algorithms": [
            {
                "name": "DefaultQuantization",
                "params": {
                    "preset": "performance",
                    "stat_subset_size": 300,
                    "shuffle_data": false
                }
            }
        ]
    }
}

Quantized model becomes ~4 times smaller, although its inference time increases ~37%.

Unquantized model benchmark log:

[Step 1/11] Parsing and validating input arguments
/opt/intel/openvino_2020.4.287/python/python3.6/openvino/tools/benchmark/main.py:29: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(" -nstreams default value is determined automatically for a device. "
[ WARNING ]  -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. 
[Step 2/11] Loading Inference Engine
[ INFO ] InferenceEngine:
         API version............. 2.1.2020.4.0-359-21e092122f4-releases/2020/4
[ INFO ] Device info
         CPU
         MKLDNNPlugin............ version 2.1
         Build................... 2020.4.0-359-21e092122f4-releases/2020/4

[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance,but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading the Intermediate Representation network
[ INFO ] Read network took 31.38 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[Step 7/11] Loading the model to the device
[ INFO ] Load network took 199.60 ms
[Step 8/11] Setting optimal runtime parameters
[Step 9/11] Creating infer requests and filling input blobs with images
[ INFO ] Network input 'image' precision U8, dimensions (NCHW): 1 3 192 192
/opt/intel/openvino_2020.4.287/python/python3.6/openvino/tools/benchmark/utils/inputs_filling.py:71: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn("No input files were given: all inputs will be filled with random values!")
[ WARNING ] No input files were given: all inputs will be filled with random values!
[ INFO ] Infer Request 0 filling
[ INFO ] Fill input 'image' with random values (image is expected)
[ INFO ] Infer Request 1 filling
[ INFO ] Fill input 'image' with random values (image is expected)
[ INFO ] Infer Request 2 filling
[ INFO ] Fill input 'image' with random values (image is expected)
[ INFO ] Infer Request 3 filling
[ INFO ] Fill input 'image' with random values (image is expected)
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)
[Step 11/11] Dumping statistics report
Count:      64424 iterations
Duration:   60006.06 ms
Latency:    3.60 ms
Throughput: 1073.62 FPS

Quantized model benchmark log:

[Step 1/11] Parsing and validating input arguments
/opt/intel/openvino_2020.4.287/python/python3.6/openvino/tools/benchmark/main.py:29: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(" -nstreams default value is determined automatically for a device. "
[ WARNING ]  -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. 
[Step 2/11] Loading Inference Engine
[ INFO ] InferenceEngine:
         API version............. 2.1.2020.4.0-359-21e092122f4-releases/2020/4
[ INFO ] Device info
         CPU
         MKLDNNPlugin............ version 2.1
         Build................... 2020.4.0-359-21e092122f4-releases/2020/4

[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance,but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading the Intermediate Representation network
[ INFO ] Read network took 67.49 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[Step 7/11] Loading the model to the device
[ INFO ] Load network took 294.29 ms
[Step 8/11] Setting optimal runtime parameters
[Step 9/11] Creating infer requests and filling input blobs with images
[ INFO ] Network input 'image' precision U8, dimensions (NCHW): 1 3 192 192
/opt/intel/openvino_2020.4.287/python/python3.6/openvino/tools/benchmark/utils/inputs_filling.py:71: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn("No input files were given: all inputs will be filled with random values!")
[ WARNING ] No input files were given: all inputs will be filled with random values!
[ INFO ] Infer Request 0 filling
[ INFO ] Fill input 'image' with random values (image is expected)
[ INFO ] Infer Request 1 filling
[ INFO ] Fill input 'image' with random values (image is expected)
[ INFO ] Infer Request 2 filling
[ INFO ] Fill input 'image' with random values (image is expected)
[ INFO ] Infer Request 3 filling
[ INFO ] Fill input 'image' with random values (image is expected)
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)
[Step 11/11] Dumping statistics report
Count:      48160 iterations
Duration:   60007.22 ms
Latency:    4.93 ms
Throughput: 802.57 FPS

Could you check please, is it expected result for such model?

BR,
Alexey.

IntelSupport · ‎09-17-2020

Hi Alexey,

Thanks for reaching out.

I tested your xml file for both quantized and unquantized. I am getting the same result as you.

OpenVINO quantization depends on specific libraries and devices. It's probably due to unsupported layers in 8-bit integer computation mode for your model to be quantized.

You can refer here for more details: https://github.com/intel/webml-polyfill/issues/1239

Also please check the topologies that have been validated for 8-bit inference feature here.

Regards,

Aznie

View solution in original post

isomov · ‎09-16-2020

Hi!

Having the same issue with exact the same config file.

Waiting for an answer from intel.

IntelSupport · ‎09-17-2020

Hi Alexey,

Thanks for reaching out.

I tested your xml file for both quantized and unquantized. I am getting the same result as you.

OpenVINO quantization depends on specific libraries and devices. It's probably due to unsupported layers in 8-bit integer computation mode for your model to be quantized.

You can refer here for more details: https://github.com/intel/webml-polyfill/issues/1239

Also please check the topologies that have been validated for 8-bit inference feature here.

Regards,

Aznie

IntelSupport · ‎09-22-2020

Hi Alexey,

This thread will no longer be monitored since this issue has been resolved. If you need any additional information from Intel, please submit a new question.

Best Regards,

Aznie

Int8 quantized model slower than unquantized one

Post training Optimizer Tool