A pruned model could easily have a size of about 1/10th of the original model. However, there doesn't seem to be much latency improvement when running inference on a pruned model versus the original one. 1. Why would that be the case? 2. Are there any plans for speeding up the processing of pruned models? Thank you very much for your help.
Generally, using OpenVINO Model Optimizer to convert a native model into Intermediate Representation (IR) would improve the performance of the model since it's been optimized during the conversion process.
You may refer here for further info: https://docs.openvinotoolkit.org/latest/openvino_docs_MO_DG_Deep_Learning_Model_Optimizer_DevGuide.h...
Besides, the OpenVINO Post-Training Optimization Tool (POT) could be used to accelerate the inference of DL models by applying special methods without model retraining.
This is the official guide:
Intel will no longer monitor this thread since we have provided a solution. If you need any additional information from Intel, please submit a new question.