Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6404 Discussions

GetPerformanceCounts gives wrong result for custom operation on GPU

Senfter__Thomas
Beginner
1,912 Views

Hello,

The GetPerformanceCounts function tells me that my custom layer was optimized out when running on GPU while it's not optimized out and working correctly when running on CPU. The layer is definitely not optimized out, it's an warp affine layer and the output is the correctly warped image.

I implemented the layer using this guide: https://docs.openvinotoolkit.org/latest/openvino_docs_HOWTO_Custom_Layers_Guide.html
The implementation can be found here: https://github.com/accessio-gmbh/arivo_custom_openvino_layers

On startup i configure the inference engine like this:

 

auto extension_ptr = InferenceEngine::make_so_pointer<InferenceEngine::IExtension>("/opt/iie/libcustom_cpu_extensions.so");
instance_->iie_core_.AddExtension(extension_ptr, "CPU");
instance_->iie_core_.SetConfig({{InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE, "/opt/iie/custom_layers.xml"}}, "GPU");
instance_->iie_core_.SetConfig({{InferenceEngine::PluginConfigParams::KEY_PERF_COUNT, InferenceEngine::PluginConfigParams::YES}}, "GPU");

 

 Am I doing something wrong? (the guide for custom GPU plugins seems to be not up to date, as there is no for example "cldnn_global_custom_kernels.xml" )
Or is there a bug?

Thanks,

Thomas

Labels (2)
0 Kudos
10 Replies
Iffa_Intel
Moderator
1,862 Views

Hi,


Could you provide your GetPerformanceCounts function's output, for both CPU and GPU?



Sincerely,

Iffa


0 Kudos
Senfter__Thomas
Beginner
1,853 Views

Hi,

Following is the GetPerformanceCounts of the interesting part of the model. The custom layer is the WarpAffine layer.

CPU:

image: Input unknown_FP32 0, RealTime: 0, CPUTime: 0
loc_in: WarpAffine unknown_FP32 2, RealTime: 225, CPUTime: 225
ocr_in: WarpAffine unknown_FP32 2, RealTime: 99, CPUTime: 99
ocr_trans: Gemm gemm_any_FP32 2, RealTime: 7, CPUTime: 7
out_loc_in: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_ocr_in: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_ocr_trans: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_pred: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
pred: Permute unknown_FP32 2, RealTime: 3, CPUTime: 3
rnet_trans: Input unknown_FP32 0, RealTime: 0, CPUTime: 0

 GPU:

image: Input_layout undef 2, RealTime: 0, CPUTime: 0
input:image_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
input:rnet_trans_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
loc_in: WarpAffine undef 1, RealTime: 0, CPUTime: 0
loc_in_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
loc_in_cldnn_output_postprocess: Reorder undef 2, RealTime: 19, CPUTime: 12
ocr_in: WarpAffine undef 1, RealTime: 0, CPUTime: 0
ocr_in_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
ocr_in_cldnn_output_postprocess: Reorder undef 2, RealTime: 11, CPUTime: 3
ocr_trans: Gemm gemm_tiled_opt 2, RealTime: 7, CPUTime: 3
ocr_trans_cldnn_in0_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_in1_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_out_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_output_postprocess: Reorder reorder_data_fast_b1 2, RealTime: 4, CPUTime: 3
pred: Permute undef 1, RealTime: 0, CPUTime: 0
pred_cldnn_output_postprocess: Reorder permute_ref 2, RealTime: 6, CPUTime: 3
rnet_trans: Input_layout undef 2, RealTime: 0, CPUTime: 0

 How I formatted the output:

 

std::cout << layer << ": " << perf_info.layer_type << " " << perf_info.exec_type << " " << perf_info.status << ", RealTime: " << perf_info.realTime_uSec << ", CPUTime: " << perf_info.cpu_uSec << std::endl;

 

Thanks for looking into it,
Thomas

0 Kudos
Iffa_Intel
Moderator
1,830 Views

Based on our findings on custom_layer.xml provided, several possible causes that can lead to the inaccurate result. Here are our recommendations:

 

1) Different format of the configuration file

 

2) Incomplete global WorkSizes

 

Please do share your output once you run the amended xml.


Sincerely,

Iffa


0 Kudos
Senfter__Thomas
Beginner
1,818 Views

Changes in the xml do not change the output. I made a model only with the custom kernel and the result is:

image: Input_layout undef 2, RealTime: 0, CPUTime: 0
input:image_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
input:matrix_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
matrix: Input_layout undef 2, RealTime: 0, CPUTime: 0
warped_image: WarpAffine undef 1, RealTime: 0, CPUTime: 0
warped_image_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
warped_image_cldnn_output_postprocess: Reorder undef 2, RealTime: 316, CPUTime: 12

So the warp affine seems to be shown as "warped_image_cldnn_output_postprocess". I guess that's the problem.

Here the new xml: No define as i don't need any defines in my Kernel

 

<CustomLayer name="WarpAffine" type="SimpleGPU" version="1">
  <Kernel entry="warp_affine">
    <Source filename="warp_affine.cl"/>
  </Kernel>
  <Buffers>
    <Tensor arg-index="0" type="input" port-index="0"  format="BFYX"/>
    <Tensor arg-index="1" type="input" port-index="1"  format="BFYX"/>
    <Tensor arg-index="2" type="output" port-index="0" format="BFYX"/>
  </Buffers>
  <CompilerOptions options="-cl-mad-enable"/>
  <WorkSizes global="Y,((X+31)/32)*32,B*F" local="1,32,1"/>
</CustomLayer>

 

Thanks,
Thomas

0 Kudos
Iffa_Intel
Moderator
1,801 Views

Just to clarify, did your problem solved by doing that?

If so, I shall close this thread.



Sincerely,

Iffa


0 Kudos
Senfter__Thomas
Beginner
1,796 Views

If there is always a similar named Reorder layer after a custom layer showing the correct performance counts then the problem is solved (Maybe this should be changed in future releases to work the same as for CPU)

However I'm not sure if this is the case.

Thanks,

Thomas

0 Kudos
Iffa_Intel
Moderator
1,715 Views

Hi,


The issue has been escalated to our developers for rectification and they might need some collaterals to assist on their investigation and root-cause finding.

 

Hence, Could you help to give us your IR collaterals with the network in which the operation is use?

 

Our developers are looking on this issue right now and we will provide any updates once available.



Sincerely,

Iffa

 


0 Kudos
Senfter__Thomas
Beginner
1,699 Views

Hi,

attached are the model, the extension (it can also be found here: https://github.com/accessio-gmbh/arivo_custom_openvino_layers) and also the result of the benchmark app.

For the benchmark_app sample to work (this is another problem), https://github.com/openvinotoolkit/openvino/blob/b3ac14c8d45e4abf4adce56374c23bf0a548a9da/inference-engine/samples/benchmark_app/main.cpp#L174 hast to be changed to

 

if (!FLAGS_l.empty()) {

 

the benchmark_app was then run with

 

./benchmark_app -m warp_affine.xml -d GPU -report_folder . -report_type detailed_counters --exec_graph_path /tmp/ -c custom_layers.xml -l libcustom_cpu_extensions.so

 

 

Thanks for looking into it,
Thomas

0 Kudos
Iffa_Intel
Moderator
1,598 Views

Hi,

Our development had looked into the issue and the fix is going to be implemented in one of the future releases (developer target on 2022.1)

 

We apologize for the delay in our response as our developers need time to replicate and come out with solution/fixes.

 

 

Sincerely,

Iffa

 

0 Kudos
Iffa_Intel
Moderator
1,586 Views

Greetings,


Intel will no longer monitor this thread since we have provided a solution. If you need any additional information from Intel, please submit a new question



Sincerely,

Iffa


0 Kudos
Reply