Solved: Concat extremly slow on HD 600 GPU

Gerald · ‎04-28-2020

Hello,

I am playing with the VoVNet-Architecture which uses concatenation layers in every module:

    def OSAModule(self, input_tensor, channel, bottleneck, aggr_times=5):
        x = input_tensor
        aggr = []
        for i in range(aggr_times):
            x = self.conv_bn_relu(x, channel)
            aggr.append(x)

        x = Concatenate()(aggr)
        x = self.conv_bn_relu(x, bottleneck, kernel=1)
        return x

Executing this architecture on CPU works fine but running it on an HD 600 GPU device, it gets horribly slow. So I made created a performance report which shows, that the concatenation is "optimized out" but obviously in a worse way.

Name              | execStatus        | layerType   | execType                       | realTime (ms)   | cpuTime (ms)

OSA_10-0_Conv     | EXECUTED          | Convolution | convolution_gpu_bfyx_f1        | 69.320000       | 0.023000

OSA_10-1_Conv     | EXECUTED          | Convolution | fused_conv_eltwise_gpu_ref     | 293.476000      | 0.024000

OSA_10-2_Conv     | EXECUTED          | Convolution | fused_conv_eltwise_gpu_ref     | 293.353000      | 0.022000

OSA_10_Concat     | OPTIMIZED_OUT     | Concat      | undef                          | 0.000000        | 0.000000

OSA_10_Projection | EXECUTED          | Convolution | convolution_gpu_bfyx_f16_1x1   | 5.434000        | 0.024000

Any ideas on how to "unoptimize" this (avoid this optimization)?

System: Intel Celeron N4000 - GPU Intel® UHD Graphics 600

OS: Ubuntu 18.04

OpenVino: 2020.2

Best Gerald

Munesh_Intel · ‎05-16-2020

Hi Gerald,

Thank you for sharing information about your model and providing the updates.

Optimization wise, the GPU plugin supports algorithms that fuse several operations into one optimized operation. Among them is ‘Optimizing Layers Out’, where Concatenate layer is optimized out under certain conditions.

More information is available at the following link:

https://docs.openvinotoolkit.org/latest/_docs_IE_DG_supported_plugins_CL_DNN.html#optimizing_layers_out

Apart from that, for your information, the GPU plugin uses the Intel® Compute Library for Deep Neural Networks (clDNN) to infer deep neural networks, and it is important to note that clDNN support is not optimized for Intel® UHD Graphics 600 processor.

The list of integrated graphics processors that clDNN is optimized for is available at the following link under the section ‘System Requirements’.

https://github.com/intel/clDNN

The following paper, ‘Accelerate Deep Learning Inference with Integrated Intel® Processor Graphics Rev 2.0’ contains more relevant information, and is available at the following link:

https://software.intel.com/content/www/us/en/develop/articles/accelerate-deep-learning-inference-with-integrated-intel-processor-graphics-rev-2-0.html

Regards,

Munesh

View solution in original post

Munesh_Intel · ‎05-12-2020

Hi Gerald,

Greetings to you.

VoVNet-Architecture is not currently a supported topology of OpenVINO.

Having said that, moving to your question on how to turn off Concatenate layer from being 'optimized_out', do try adding the following general (framework-agnostic) parameter --finegrain_fusing to your Model Optimizer launch script.

You can obtain more information at the following pages:

https://docs.openvinotoolkit.org/latest/_docs_MO_DG_prepare_model_Model_Optimization_Techniques.html#disable_fusing

https://docs.openvinotoolkit.org/2020.2/_docs_MO_DG_prepare_model_convert_model_Converting_Model_General.html

Please share more information about your model, whether it's an object/classification model, the layers used if it's a custom model, command given to Model Optimizer to convert the trained model to Intermediate Representation (IR), sample codes to run the model, and also environment details (versions of Python, CMake, etc.). If possible, please share the trained model files for us to reproduce your issue (files can be shared via Private Message).

Also, do share with us on how you created the performance report that you’ve posted.

Regards,

Munesh

Gerald · ‎05-14-2020

Hi Munesh,

thanks for your answer. I'm not allowed to share the model but I hope the following details will you give you a better picture.

Source model format : Caffe 1.0 (Training is done using TF 1.14 and the keras model then converted to caffe)

Programs used/tried : benchmark demo to create the performance report with --detailed_counters

Model code: VoV-Net for keras

I've tried to disable the optimization by using --finegrain_fusing OSA_10_Concat,OSA_20_Concat etc. but the resulting model xml and bin files look exactly the same as the ones without the flag set. And the performance report also shows that the concatenation layer is being optimized.

Additionally, I tried to add the --disable_fusing flag but that also doesn't change anything.

If I execute the same model on a MYRIAD (NCS2) device I can see that the concatenation is being executed.

To me, it seems to be some kind of "on-the-fly"-optimization of the GPU-plug in. Could it be?

Munesh_Intel · ‎05-16-2020

Hi Gerald,

Thank you for sharing information about your model and providing the updates.

Optimization wise, the GPU plugin supports algorithms that fuse several operations into one optimized operation. Among them is ‘Optimizing Layers Out’, where Concatenate layer is optimized out under certain conditions.

More information is available at the following link:

https://docs.openvinotoolkit.org/latest/_docs_IE_DG_supported_plugins_CL_DNN.html#optimizing_layers_out

Apart from that, for your information, the GPU plugin uses the Intel® Compute Library for Deep Neural Networks (clDNN) to infer deep neural networks, and it is important to note that clDNN support is not optimized for Intel® UHD Graphics 600 processor.

The list of integrated graphics processors that clDNN is optimized for is available at the following link under the section ‘System Requirements’.

https://github.com/intel/clDNN

The following paper, ‘Accelerate Deep Learning Inference with Integrated Intel® Processor Graphics Rev 2.0’ contains more relevant information, and is available at the following link:

https://software.intel.com/content/www/us/en/develop/articles/accelerate-deep-learning-inference-with-integrated-intel-processor-graphics-rev-2-0.html

Regards,

Munesh