GetPerformanceCounts gives wrong result for custom operation on GPU

Senfter__Thomas · ‎03-16-2021

Hello,

The GetPerformanceCounts function tells me that my custom layer was optimized out when running on GPU while it's not optimized out and working correctly when running on CPU. The layer is definitely not optimized out, it's an warp affine layer and the output is the correctly warped image.

I implemented the layer using this guide: https://docs.openvinotoolkit.org/latest/openvino_docs_HOWTO_Custom_Layers_Guide.html
The implementation can be found here: https://github.com/accessio-gmbh/arivo_custom_openvino_layers

On startup i configure the inference engine like this:

auto extension_ptr = InferenceEngine::make_so_pointer<InferenceEngine::IExtension>("/opt/iie/libcustom_cpu_extensions.so");
instance_->iie_core_.AddExtension(extension_ptr, "CPU");
instance_->iie_core_.SetConfig({{InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE, "/opt/iie/custom_layers.xml"}}, "GPU");
instance_->iie_core_.SetConfig({{InferenceEngine::PluginConfigParams::KEY_PERF_COUNT, InferenceEngine::PluginConfigParams::YES}}, "GPU");

Am I doing something wrong? (the guide for custom GPU plugins seems to be not up to date, as there is no for example "cldnn_global_custom_kernels.xml" )
Or is there a bug?

Thanks,

Thomas

Iffa_Intel · ‎03-18-2021

Hi,

Could you provide your GetPerformanceCounts function's output, for both CPU and GPU?

Sincerely,

Iffa

Senfter__Thomas · ‎03-19-2021

Hi,

Following is the GetPerformanceCounts of the interesting part of the model. The custom layer is the WarpAffine layer.

CPU:

image: Input unknown_FP32 0, RealTime: 0, CPUTime: 0
loc_in: WarpAffine unknown_FP32 2, RealTime: 225, CPUTime: 225
ocr_in: WarpAffine unknown_FP32 2, RealTime: 99, CPUTime: 99
ocr_trans: Gemm gemm_any_FP32 2, RealTime: 7, CPUTime: 7
out_loc_in: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_ocr_in: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_ocr_trans: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_pred: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
pred: Permute unknown_FP32 2, RealTime: 3, CPUTime: 3
rnet_trans: Input unknown_FP32 0, RealTime: 0, CPUTime: 0

GPU:

image: Input_layout undef 2, RealTime: 0, CPUTime: 0
input:image_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
input:rnet_trans_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
loc_in: WarpAffine undef 1, RealTime: 0, CPUTime: 0
loc_in_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
loc_in_cldnn_output_postprocess: Reorder undef 2, RealTime: 19, CPUTime: 12
ocr_in: WarpAffine undef 1, RealTime: 0, CPUTime: 0
ocr_in_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
ocr_in_cldnn_output_postprocess: Reorder undef 2, RealTime: 11, CPUTime: 3
ocr_trans: Gemm gemm_tiled_opt 2, RealTime: 7, CPUTime: 3
ocr_trans_cldnn_in0_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_in1_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_out_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_output_postprocess: Reorder reorder_data_fast_b1 2, RealTime: 4, CPUTime: 3
pred: Permute undef 1, RealTime: 0, CPUTime: 0
pred_cldnn_output_postprocess: Reorder permute_ref 2, RealTime: 6, CPUTime: 3
rnet_trans: Input_layout undef 2, RealTime: 0, CPUTime: 0

How I formatted the output:

std::cout << layer << ": " << perf_info.layer_type << " " << perf_info.exec_type << " " << perf_info.status << ", RealTime: " << perf_info.realTime_uSec << ", CPUTime: " << perf_info.cpu_uSec << std::endl;

Thanks for looking into it,
Thomas

Iffa_Intel · ‎03-23-2021

Based on our findings on custom_layer.xml provided, several possible causes that can lead to the inaccurate result. Here are our recommendations:

1) Different format of the configuration file

Your file (arivo_custom_openvino_layers/custom_layers.xml at main · accessio-gmbh/arivo_custom_openvino_layers (github.com)) uses different format compared to our OV documentation (How to Implement Custom GPU Operations - OpenVINO™ Toolkit) - missing Define parameter
We advise you to follow our proposed format

2) Incomplete global WorkSizes

Your file uses an array of 2 integers [Y , ((X + 31)/32)*32)] vs. our documentation of an array of up to 3 integers (or formulas) for defining the OpenCL work-sizes to be used (How to Implement Custom GPU Operations - OpenVINO™ Toolkit)
We advise you to use an array of 3 integers instead
Samples (GPU & VPU): openvino/custom_layer_example.xml at d18073260bc742d7bf14d262d6919a1b660e2b61 · openvinotoolkit/openvino (github.com) & openvino/customLayerBindings.xml at d18073260bc742d7bf14d262d6919a1b660e2b61 · openvinotoolkit/openvino (github.com)

Please do share your output once you run the amended xml.

Sincerely,

Iffa

Senfter__Thomas · ‎03-24-2021

Changes in the xml do not change the output. I made a model only with the custom kernel and the result is:

image: Input_layout undef 2, RealTime: 0, CPUTime: 0
input:image_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
input:matrix_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
matrix: Input_layout undef 2, RealTime: 0, CPUTime: 0
warped_image: WarpAffine undef 1, RealTime: 0, CPUTime: 0
warped_image_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
warped_image_cldnn_output_postprocess: Reorder undef 2, RealTime: 316, CPUTime: 12

So the warp affine seems to be shown as "warped_image_cldnn_output_postprocess". I guess that's the problem.

Here the new xml: No define as i don't need any defines in my Kernel

<CustomLayer name="WarpAffine" type="SimpleGPU" version="1">
  <Kernel entry="warp_affine">
    <Source filename="warp_affine.cl"/>
  </Kernel>
  <Buffers>
    <Tensor arg-index="0" type="input" port-index="0"  format="BFYX"/>
    <Tensor arg-index="1" type="input" port-index="1"  format="BFYX"/>
    <Tensor arg-index="2" type="output" port-index="0" format="BFYX"/>
  </Buffers>
  <CompilerOptions options="-cl-mad-enable"/>
  <WorkSizes global="Y,((X+31)/32)*32,B*F" local="1,32,1"/>
</CustomLayer>

Thanks,
Thomas

Iffa_Intel · ‎03-24-2021

Just to clarify, did your problem solved by doing that?

If so, I shall close this thread.

Sincerely,

Iffa

Senfter__Thomas · ‎03-25-2021

If there is always a similar named Reorder layer after a custom layer showing the correct performance counts then the problem is solved (Maybe this should be changed in future releases to work the same as for CPU)

However I'm not sure if this is the case.

Thanks,

Thomas

Iffa_Intel · ‎04-20-2021

Hi,

The issue has been escalated to our developers for rectification and they might need some collaterals to assist on their investigation and root-cause finding.

Hence, Could you help to give us your IR collaterals with the network in which the operation is use?

Our developers are looking on this issue right now and we will provide any updates once available.

Sincerely,

Iffa

Senfter__Thomas · ‎04-22-2021

Hi,

attached are the model, the extension (it can also be found here: https://github.com/accessio-gmbh/arivo_custom_openvino_layers) and also the result of the benchmark app.

For the benchmark_app sample to work (this is another problem), https://github.com/openvinotoolkit/openvino/blob/b3ac14c8d45e4abf4adce56374c23bf0a548a9da/inference-engine/samples/benchmark_app/main.cpp#L174 hast to be changed to

if (!FLAGS_l.empty()) {

the benchmark_app was then run with

./benchmark_app -m warp_affine.xml -d GPU -report_folder . -report_type detailed_counters --exec_graph_path /tmp/ -c custom_layers.xml -l libcustom_cpu_extensions.so

Thanks for looking into it,
Thomas

Iffa_Intel · ‎06-17-2021

Hi,

Our development had looked into the issue and the fix is going to be implemented in one of the future releases (developer target on 2022.1)

We apologize for the delay in our response as our developers need time to replicate and come out with solution/fixes.

Sincerely,

Iffa

Iffa_Intel · ‎06-24-2021

Greetings,

Intel will no longer monitor this thread since we have provided a solution. If you need any additional information from Intel, please submit a new question

Sincerely,

Iffa

GetPerformanceCounts gives wrong result for custom operation on GPU

Benchmarking

Inference Engine