Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6403 Discussions

mixed precision quantization, but onnx size does not change...

timosy
New Contributor I
1,553 Views

Becuase I'd like to get faster inference performance, I performed mixed precision quantization with INT8 + INT4 while refering to this web page: https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#mixed_precision_quantization

However, when I compare the size of onnx model of FP32 and MixedPrecision model, both are same... the following is the code I used for the mixed precision quantization. Do I mistake something?

 

    train_dataset = ....
    model = ....
    criterion = torch.nn.CrossEntropyLoss().to(device)
    
    dummy_input = torch.randn(1, 3, image_size, image_size).to(device) 
    torch.onnx.export(model, dummy_input, str(outdir)+"model_fp32.onnx", opset_version=10)

    train_dataloader = DataLoader(train_dataset, batch_size=batch_size,
                  shuffle=True, num_workers=workers, pin_memory=True)

    nncf_config_mpq_dict = {
        "model": "network",
        "pretrained": 1,
        "input_info": {"sample_size": [1, 3, image_size, image_size] },
        "num_classes": classes,
        "batch_size": g_batch_size,
        "log_dir": str(outdir),
        "optimizer": {
            "base_lr": 3.1e-4,
            "schedule_type": "plateau",
            "type": "Adam",
            "schedule_params": {
                "threshold": 0.1,
                "cooldown": 3
            },
            "weight_decay": 1e-05
        },
        "compression": {
            "algorithm": "quantization",
            "initializer": {
                "precision": {
                    "type": "hawq",
                    "bits": [4,8],
                    #"bits": [4],
                    "compression_ratio": 1.5,
                }
            }
        }
    }
    nncf_config = NNCFConfig.from_dict(nncf_config_mpq_dict)
    nncf_config = register_default_init_args(
        #nncf_config, train_dataloader
        nncf_config, train_dataloader, criterion
    )
    # Create a quantized model from a pre-trained FP32 model and configuration object.
    compress_ctrl, compress_model = create_compressed_model(
        model, nncf_config
    )
    compress_ctrl.export_model(str(outdir)+"model_int8.onnx")

 

 

-rwxrwxrwx 1 user user 242031227 Aug 30 19:19 model_test/model_fp32.onnx
-rwxrwxrwx 1 user user 242116708 Aug 31 02:10 model_test/model_int8.onnx

 

Is it possible to confirm whether the model is 4bit or not if I use "bit:[4]".

 

0 Kudos
1 Solution
Wan_Intel
Moderator
1,417 Views

Hi Timosy,

Thanks for your patience.

 

We've got feedback from our development team. Currently, Mixed-Precision quantization is supported for VPU and iGPU, but it is not supported for CPU. Our development team has captured this feature in their product roadmap, but we cannot confirm the actual version releases.

 

Hope this clarifies.

 

 

Regards,

Wan


View solution in original post

11 Replies
Wan_Intel
Moderator
1,523 Views

Hi Timosy,

Thanks for reaching out to us.

 

Referring to this thread, the model size will decrease after converting ONNX model into Intermediate Representation. Could you please convert your model into Intermediate Representation and see if it’s able to resolve your problem?

 

 

Regards,

Wan


0 Kudos
timosy
New Contributor I
1,514 Views

Thanks for your comments, I changes the compression configuration below, and compressed converted to IR

 

        "optimizer": {
            "base_lr": 3.1e-4,
            "schedule_type": "plateau",
            "type": "Adam",
            "schedule_params": {
                "threshold": 0.1,
                "cooldown": 3
            },
            "weight_decay": 1e-05
        },
        "compression": {
            "algorithm": "quantization",
            "weights": {
                "mode": "asymmetric",
                #"per_channel": True,
                "bits": 4
            },
            "activations": {
                "mode": "asymmetric",
                #"per_channel": True,
                "bits": 4
            },
            "initializer": {
                "precision": {
                    "type": "hawq",
                    "bits": [4,8],
                    #"bits": [4,4],
                    "compression_ratio": 2.0,
                }
            }

 

and files I got are

 

-rwxrwxrwx 1 user user 242031227 Aug 31 11:33 model_fp32.onnx
-rwxrwxrwx 1 user user 242035308 Aug 31 14:06 model_quant.int4.onnx
-rwxrwxrwx 1 user user 242116708 Aug 31 13:09 model_quant.int8.onnx
-rwxrwxrwx 1 user user 242035308 Aug 31 13:10 model_quant.mix48.onnx
-rwxrwxrwx 1 user user 242116708 Aug 31 15:18 model_quant.mix48_test2.onnx

-rwxrwxrwx 1 user user 225552 Aug 31 14:59 model_quant.int4.bin
-rwxrwxrwx 1 user user 451082 Aug 31 15:00 model_quant.int8.bin
-rwxrwxrwx 1 user user 225552 Aug 31 15:01 model_quant.mix48.bin
-rwxrwxrwx 1 user user 451082 Aug 31 15:25 model_quant.mix48_test2.bin

-rwxrwxrwx 1 user user 31642 Aug 31 14:59 model_quant.int4.xml
-rwxrwxrwx 1 user user 28488 Aug 31 15:00 model_quant.int8.xml
-rwxrwxrwx 1 user user 31644 Aug 31 15:01 model_quant.mix48.xml
-rwxrwxrwx 1 user user 28502 Aug 31 15:25 model_quant.mix48_test2.xml

 

It seems I got INT4 model, however, mixed mode seems to be failed.

It seems automatic optimization(?) "type": "hawq" does not work.

If I increase INT4 precision, sould I incease or decrease "compression_ratio"?

or Should I change the configuration of the optimization part?

In addition, inference time of INT4 above is similar with FP32 (not INT8), I mistook somethong ... though I can confirme that data type is certainly "i4" 

 

            "initializer": {
                "precision": {
                    "type": "hawq",
                    "bits": [4,8],
                    #"bits": [4,4],
                    "compression_ratio": 2.0,
                }
            }

 

 

        <layer id="10" name="97" type="Const" version="opset1">
            <data element_type="i4" shape="96, 3, 14, 14" offset="4" size="28224"/>
            <output>
                <port id="0" precision="I4">
                    <dim>96</dim>
                    <dim>3</dim>
                    <dim>14</dim>
                    <dim>14</dim>

 

 

 

0 Kudos
Wan_Intel
Moderator
1,510 Views

Hi Timosy,

Thanks for reaching out to us.

 

You can check inference performance with Benchmark C++ Tool.


On the other hand, referring to HAWQ in Uniform Quantization with Fine-Tuning, you can lower the compression ratio to avoid huge accuracy drop.

 

Hope it helps.

 

 

Regards,

Wan


timosy
New Contributor I
1,500 Views

The reason I got "I4" below is that I set "target_device": "TRIAL" in my config file.

This may be the reaosn that INT4 model is slow, same with FP32 model.

<data element_type="i4" shape="96, 3, 14, 14" offset="4" size="28224"/> ...

<output> <port id="0" precision="I4">

 

However, though I set "target_device": "CPU" and int4 parameters in the config,

the output IR model (converted from onnx) is still

<data element_type="i8" ...

<port id="0" precision="I8">

 

I do not know the reason why I8 is stilll I8 though I set 4 bit...

it's difficult... My PC does not satisfy saomething? according to error I got in my terminal

RuntimeError: Quantization parameter constraints specified in NNCF config are incompatible with HW capabilities as specified in HW config type 'CPU'. First conflicting quantizer location: AlexNet/Sequential[features]/NNCFConv2d[0]/conv2d_0

0 Kudos
Wan_Intel
Moderator
1,495 Views

Hi Timosy,

Thanks for reaching out to us.

 

The following error may be due to your setting in quantizer parameters that are inconsistent with constraints from CPU config.

 

RuntimeError: Quantization parameter constraints specified in NNCF config are incompatible with HW capabilities as specified in HW config type 'CPU'. First conflicting quantizer location: Alexnet/Sequential[features]/NNCFConv2d[0]

 

Could you please set your target device to “NONE” and see if it’s able to resolve your issue? You may refer to this GitHub thread for more information.

 

 

Regards,

Wan


0 Kudos
timosy
New Contributor I
1,491 Views

I appreciate your additional help, 

When run it with NONE: "target_device": "NONE", I got an error as follows

 

jsonschema.exceptions.ValidationError: 'NONE' is not one of ['ANY', 'CPU', 'GPU', 'VPU', 'TRIAL']. See documentation or /mnt/c/Users/221344/mywork/deep/openvino/venv_py39_ovino22.1/lib/python3.9/site-packages/nncf/config/schema.py for an NNCF configuration file JSON schema definition

 

With option "target_device": "ANY", I got

RuntimeError: Quantization parameter constraints specified in NNCF config are incompatible with HW capabilities as specified in HW config type 'CPU'. First conflicting quantizer location: AlexNet/Sequential[features]/NNCFConv2d[0]/conv2d_0

 

So, to generate INT4 model, do I have to chabge CPU config file? 

0 Kudos
Wan_Intel
Moderator
1,482 Views

Hi Timosy,

Thanks for sharing your information with us.

 

Referring to this thread, our developer mentioned that CPU supports only int8 quantization, therefore, the error your encountered is expected.


For mixed precision configuration, as you have successfully converted your model before, you should specify VPU or TRIAL device.

 

Perhaps you can open a feature request here so that our developer will provide a workaround to solve your issue. Hope it helps.

 

 

Regards,

Wan


0 Kudos
timosy
New Contributor I
1,469 Views

Thanks for the additional information.

The situation that the mixed-quantization is still trial and CPU is not supported yet is sad informaiton for me, actually. I expected that I can use mixed-quantization since I found that the "the mixed precision" is supported for Pytorch in Web: https://docs.openvino.ai/latest/docs_nncf_introduction.html#neural-network-compression-framework

 

Anyway, I appreciate your information on the situation of the mixed precision!

If there were no help from you, I would have checked how I can use it for a few days.

I could save my time !

 

Wan_Intel
Moderator
1,444 Views

Hi Timosy,

Let us check with our engineering team, and we will update you once we've obtained feedback from them.

 

 

Regards,

Wan


Wan_Intel
Moderator
1,418 Views

Hi Timosy,

Thanks for your patience.

 

We've got feedback from our development team. Currently, Mixed-Precision quantization is supported for VPU and iGPU, but it is not supported for CPU. Our development team has captured this feature in their product roadmap, but we cannot confirm the actual version releases.

 

Hope this clarifies.

 

 

Regards,

Wan


Wan_Intel
Moderator
1,361 Views

Hi Timosy,

Thanks for your question.

This thread will no longer be monitored since we have provided information. 

If you need any additional information from Intel, please submit a new question.

 

 

Best regards,

Wan


0 Kudos
Reply