Solved: mixed precision quantization, but onnx size does not change...

timosy · ‎08-30-2022

Becuase I'd like to get faster inference performance, I performed mixed precision quantization with INT8 + INT4 while refering to this web page: https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#mixed_precision_quantization

However, when I compare the size of onnx model of FP32 and MixedPrecision model, both are same... the following is the code I used for the mixed precision quantization. Do I mistake something?

    train_dataset = ....
    model = ....
    criterion = torch.nn.CrossEntropyLoss().to(device)
    
    dummy_input = torch.randn(1, 3, image_size, image_size).to(device) 
    torch.onnx.export(model, dummy_input, str(outdir)+"model_fp32.onnx", opset_version=10)

    train_dataloader = DataLoader(train_dataset, batch_size=batch_size,
                  shuffle=True, num_workers=workers, pin_memory=True)

    nncf_config_mpq_dict = {
        "model": "network",
        "pretrained": 1,
        "input_info": {"sample_size": [1, 3, image_size, image_size] },
        "num_classes": classes,
        "batch_size": g_batch_size,
        "log_dir": str(outdir),
        "optimizer": {
            "base_lr": 3.1e-4,
            "schedule_type": "plateau",
            "type": "Adam",
            "schedule_params": {
                "threshold": 0.1,
                "cooldown": 3
            },
            "weight_decay": 1e-05
        },
        "compression": {
            "algorithm": "quantization",
            "initializer": {
                "precision": {
                    "type": "hawq",
                    "bits": [4,8],
                    #"bits": [4],
                    "compression_ratio": 1.5,
                }
            }
        }
    }
    nncf_config = NNCFConfig.from_dict(nncf_config_mpq_dict)
    nncf_config = register_default_init_args(
        #nncf_config, train_dataloader
        nncf_config, train_dataloader, criterion
    )
    # Create a quantized model from a pre-trained FP32 model and configuration object.
    compress_ctrl, compress_model = create_compressed_model(
        model, nncf_config
    )
    compress_ctrl.export_model(str(outdir)+"model_int8.onnx")

-rwxrwxrwx 1 user user 242031227 Aug 30 19:19 model_test/model_fp32.onnx
-rwxrwxrwx 1 user user 242116708 Aug 31 02:10 model_test/model_int8.onnx

Is it possible to confirm whether the model is 4bit or not if I use "bit:[4]".

Wan_Intel · ‎09-01-2022

Hi Timosy,

Thanks for your patience.

We've got feedback from our development team. Currently, Mixed-Precision quantization is supported for VPU and iGPU, but it is not supported for CPU. Our development team has captured this feature in their product roadmap, but we cannot confirm the actual version releases.

Hope this clarifies.

Regards,

Wan

View solution in original post

Wan_Intel · ‎08-30-2022

Hi Timosy,

Thanks for reaching out to us.

Referring to this thread, the model size will decrease after converting ONNX model into Intermediate Representation. Could you please convert your model into Intermediate Representation and see if it’s able to resolve your problem?

Regards,

Wan

timosy · ‎08-30-2022

Thanks for your comments, I changes the compression configuration below, and compressed converted to IR

        "optimizer": {
            "base_lr": 3.1e-4,
            "schedule_type": "plateau",
            "type": "Adam",
            "schedule_params": {
                "threshold": 0.1,
                "cooldown": 3
            },
            "weight_decay": 1e-05
        },
        "compression": {
            "algorithm": "quantization",
            "weights": {
                "mode": "asymmetric",
                #"per_channel": True,
                "bits": 4
            },
            "activations": {
                "mode": "asymmetric",
                #"per_channel": True,
                "bits": 4
            },
            "initializer": {
                "precision": {
                    "type": "hawq",
                    "bits": [4,8],
                    #"bits": [4,4],
                    "compression_ratio": 2.0,
                }
            }

and files I got are

-rwxrwxrwx 1 user user 242031227 Aug 31 11:33 model_fp32.onnx
-rwxrwxrwx 1 user user 242035308 Aug 31 14:06 model_quant.int4.onnx
-rwxrwxrwx 1 user user 242116708 Aug 31 13:09 model_quant.int8.onnx
-rwxrwxrwx 1 user user 242035308 Aug 31 13:10 model_quant.mix48.onnx
-rwxrwxrwx 1 user user 242116708 Aug 31 15:18 model_quant.mix48_test2.onnx

-rwxrwxrwx 1 user user 225552 Aug 31 14:59 model_quant.int4.bin
-rwxrwxrwx 1 user user 451082 Aug 31 15:00 model_quant.int8.bin
-rwxrwxrwx 1 user user 225552 Aug 31 15:01 model_quant.mix48.bin
-rwxrwxrwx 1 user user 451082 Aug 31 15:25 model_quant.mix48_test2.bin

-rwxrwxrwx 1 user user 31642 Aug 31 14:59 model_quant.int4.xml
-rwxrwxrwx 1 user user 28488 Aug 31 15:00 model_quant.int8.xml
-rwxrwxrwx 1 user user 31644 Aug 31 15:01 model_quant.mix48.xml
-rwxrwxrwx 1 user user 28502 Aug 31 15:25 model_quant.mix48_test2.xml

It seems I got INT4 model, however, mixed mode seems to be failed.

It seems automatic optimization(?) "type": "hawq" does not work.

If I increase INT4 precision, sould I incease or decrease "compression_ratio"?

or Should I change the configuration of the optimization part?

In addition, inference time of INT4 above is similar with FP32 (not INT8), I mistook somethong ... though I can confirme that data type is certainly "i4"

            "initializer": {
                "precision": {
                    "type": "hawq",
                    "bits": [4,8],
                    #"bits": [4,4],
                    "compression_ratio": 2.0,
                }
            }

        <layer id="10" name="97" type="Const" version="opset1">
            <data element_type="i4" shape="96, 3, 14, 14" offset="4" size="28224"/>
            <output>
                <port id="0" precision="I4">
                    <dim>96</dim>
                    <dim>3</dim>
                    <dim>14</dim>
                    <dim>14</dim>

Wan_Intel · ‎08-31-2022

Hi Timosy,

Thanks for reaching out to us.

You can check inference performance with Benchmark C++ Tool.

On the other hand, referring to HAWQ in Uniform Quantization with Fine-Tuning, you can lower the compression ratio to avoid huge accuracy drop.

Hope it helps.

Regards,

Wan

timosy · ‎08-31-2022

The reason I got "I4" below is that I set "target_device": "TRIAL" in my config file.

This may be the reaosn that INT4 model is slow, same with FP32 model.

<data element_type="i4" shape="96, 3, 14, 14" offset="4" size="28224"/> ...

However, though I set "target_device": "CPU" and int4 parameters in the config,

the output IR model (converted from onnx) is still

<data element_type="i8" ...

I do not know the reason why I8 is stilll I8 though I set 4 bit...

it's difficult... My PC does not satisfy saomething? according to error I got in my terminal

RuntimeError: Quantization parameter constraints specified in NNCF config are incompatible with HW capabilities as specified in HW config type 'CPU'. First conflicting quantizer location: AlexNet/Sequential[features]/NNCFConv2d[0]/conv2d_0

Wan_Intel · ‎08-31-2022

Hi Timosy,

Thanks for reaching out to us.

The following error may be due to your setting in quantizer parameters that are inconsistent with constraints from CPU config.

RuntimeError: Quantization parameter constraints specified in NNCF config are incompatible with HW capabilities as specified in HW config type 'CPU'. First conflicting quantizer location: Alexnet/Sequential[features]/NNCFConv2d[0]

Could you please set your target device to “NONE” and see if it’s able to resolve your issue? You may refer to this GitHub thread for more information.

Regards,

Wan

timosy · ‎08-31-2022

I appreciate your additional help,

When run it with NONE: "target_device": "NONE", I got an error as follows

jsonschema.exceptions.ValidationError: 'NONE' is not one of ['ANY', 'CPU', 'GPU', 'VPU', 'TRIAL']. See documentation or /mnt/c/Users/221344/mywork/deep/openvino/venv_py39_ovino22.1/lib/python3.9/site-packages/nncf/config/schema.py for an NNCF configuration file JSON schema definition

With option "target_device": "ANY", I got

RuntimeError: Quantization parameter constraints specified in NNCF config are incompatible with HW capabilities as specified in HW config type 'CPU'. First conflicting quantizer location: AlexNet/Sequential[features]/NNCFConv2d[0]/conv2d_0

So, to generate INT4 model, do I have to chabge CPU config file?

Wan_Intel · ‎08-31-2022

Hi Timosy,

Thanks for sharing your information with us.

Referring to this thread, our developer mentioned that CPU supports only int8 quantization, therefore, the error your encountered is expected.

For mixed precision configuration, as you have successfully converted your model before, you should specify VPU or TRIAL device.

Perhaps you can open a feature request here so that our developer will provide a workaround to solve your issue. Hope it helps.

Regards,

Wan

timosy · ‎08-31-2022

Thanks for the additional information.

The situation that the mixed-quantization is still trial and CPU is not supported yet is sad informaiton for me, actually. I expected that I can use mixed-quantization since I found that the "the mixed precision" is supported for Pytorch in Web: https://docs.openvino.ai/latest/docs_nncf_introduction.html#neural-network-compression-framework

Anyway, I appreciate your information on the situation of the mixed precision!

If there were no help from you, I would have checked how I can use it for a few days.

I could save my time !

Wan_Intel · ‎08-31-2022

Hi Timosy,

Let us check with our engineering team, and we will update you once we've obtained feedback from them.

Regards,

Wan

Wan_Intel · ‎09-01-2022

Hi Timosy,

Thanks for your patience.

We've got feedback from our development team. Currently, Mixed-Precision quantization is supported for VPU and iGPU, but it is not supported for CPU. Our development team has captured this feature in their product roadmap, but we cannot confirm the actual version releases.

Hope this clarifies.

Regards,

Wan

Wan_Intel · ‎09-01-2022

Hi Timosy,

Thanks for your question.

This thread will no longer be monitored since we have provided information.

If you need any additional information from Intel, please submit a new question.

Best regards,

Wan

mixed precision quantization, but onnx size does not change...

Code Samples

Deployment Manager

Inference Engine

Model Optimizer

Post training Optimizer Tool