AI Tools from Intel
Find answers to your toolkit installation, configuration, and get-started questions.

Convert model input layer to INT16

Anand_Viswanath
Employee
1,256 Views

Hi,

 

I have a frozen FP32 TF model. I want to convert the model to the below requirement.

 

 

Input layer : signed 16 bit integer

matrix weights : signed 8 bit integer

bias weights : signed 32 bit integer.

 

When quantized, the input layer is still Float32. Is there a way in Intel's Neural Compressor to achieve the requirement?

 

Regards,

Anand Viswanath A

0 Kudos
3 Replies
Rahila_T_Intel
Employee
1,229 Views

Hi Anand,

 

Thank you for posting in Intel Communities.

 

Quantization is the replacement of floating-point arithmetic computations (FP32) with integer arithmetic (INT8). Using lower-precision data reduces memory bandwidth and accelerates performance.

 

8-bit computations (INT8) offer better performance compared to higher-precision computations (FP32) because they enable loading more data into a single processor instruction. Using lower-precision data requires less data movement, which reduces memory bandwidth.

 

Intel neural compressor supports automatic quantization tuning flow by converting quantizable layers to INT8 and allowing users to control model accuracy and performance.

 

Intel® Neural Compressor Quantization Working Flow

====================================================

Rahila_T_Intel_0-1661324407053.png

 

 

Quantization methods include the following three types:

Rahila_T_Intel_1-1661324447167.png

 

Post-Training Static Quantization:

---------------------------------

Performs quantization on already trained models, it requires an additional pass over the dataset to work, only activations do calibration.

Rahila_T_Intel_2-1661324474714.png

Post-Training Dynamic Quantization:

----------------------------------

Simply multiplies input values by a scaling factor, then rounds the result to the nearest, it determines the scale factor for activations dynamically based on the data range observed at runtime. Weights are quantized ahead of time but the activations are dynamically quantized during inference.

 

Rahila_T_Intel_3-1661324502529.png

Quantization-aware Training (QAT):

---------------------------------

Quantizes models during training and typically provides higher accuracy comparing with post-training quantization, but QAT may require additional hyper-parameter tuning and it may take more time to deployment.

Rahila_T_Intel_4-1661324525526.png

 

Could you please refer the below links

https://www.intel.com/content/www/us/en/developer/articles/technical/lower-numerical-precision-deep-learning-inference-and-training.html

https://github.com/intel/neural-compressor/blob/master/docs/bench.md

https://github.com/intel/neural-compressor/blob/master/docs/Quantization.md

https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Quantizing-ONNX-Models-using-Intel-Neural-Compressor/post/1355237

 

 

Hope this will clarify your doubts.

 

 

Thanks

Rahila

 

0 Kudos
Rahila_T_Intel
Employee
1,212 Views

Hi,


We have not heard back from you. Please let me know if we can go ahead and close this case?


Thanks


0 Kudos
Rahila_T_Intel
Employee
1,173 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question. 


Thanks


0 Kudos
Reply