Intel® Distribution of OpenVINO™ Toolkit
Community support and discussions about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all things computer vision-related on Intel® platforms.
5772 Discussions

GetPerformanceCounts gives wrong result for custom operation on GPU

Senfter__Thomas
Beginner
1,118 Views

Hello,

The GetPerformanceCounts function tells me that my custom layer was optimized out when running on GPU while it's not optimized out and working correctly when running on CPU. The layer is definitely not optimized out, it's an warp affine layer and the output is the correctly warped image.

I implemented the layer using this guide: https://docs.openvinotoolkit.org/latest/openvino_docs_HOWTO_Custom_Layers_Guide.html
The implementation can be found here: https://github.com/accessio-gmbh/arivo_custom_openvino_layers

On startup i configure the inference engine like this:

 

auto extension_ptr = InferenceEngine::make_so_pointer<InferenceEngine::IExtension>("/opt/iie/libcustom_cpu_extensions.so");
instance_->iie_core_.AddExtension(extension_ptr, "CPU");
instance_->iie_core_.SetConfig({{InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE, "/opt/iie/custom_layers.xml"}}, "GPU");
instance_->iie_core_.SetConfig({{InferenceEngine::PluginConfigParams::KEY_PERF_COUNT, InferenceEngine::PluginConfigParams::YES}}, "GPU");

 

 Am I doing something wrong? (the guide for custom GPU plugins seems to be not up to date, as there is no for example "cldnn_global_custom_kernels.xml" )
Or is there a bug?

Thanks,

Thomas

Labels (2)
0 Kudos
10 Replies
Iffa_Intel
Moderator
1,068 Views

Hi,


Could you provide your GetPerformanceCounts function's output, for both CPU and GPU?



Sincerely,

Iffa


Senfter__Thomas
Beginner
1,059 Views

Hi,

Following is the GetPerformanceCounts of the interesting part of the model. The custom layer is the WarpAffine layer.

CPU:

image: Input unknown_FP32 0, RealTime: 0, CPUTime: 0
loc_in: WarpAffine unknown_FP32 2, RealTime: 225, CPUTime: 225
ocr_in: WarpAffine unknown_FP32 2, RealTime: 99, CPUTime: 99
ocr_trans: Gemm gemm_any_FP32 2, RealTime: 7, CPUTime: 7
out_loc_in: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_ocr_in: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_ocr_trans: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_pred: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
pred: Permute unknown_FP32 2, RealTime: 3, CPUTime: 3
rnet_trans: Input unknown_FP32 0, RealTime: 0, CPUTime: 0

 GPU:

image: Input_layout undef 2, RealTime: 0, CPUTime: 0
input:image_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
input:rnet_trans_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
loc_in: WarpAffine undef 1, RealTime: 0, CPUTime: 0
loc_in_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
loc_in_cldnn_output_postprocess: Reorder undef 2, RealTime: 19, CPUTime: 12
ocr_in: WarpAffine undef 1, RealTime: 0, CPUTime: 0
ocr_in_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
ocr_in_cldnn_output_postprocess: Reorder undef 2, RealTime: 11, CPUTime: 3
ocr_trans: Gemm gemm_tiled_opt 2, RealTime: 7, CPUTime: 3
ocr_trans_cldnn_in0_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_in1_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_out_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_output_postprocess: Reorder reorder_data_fast_b1 2, RealTime: 4, CPUTime: 3
pred: Permute undef 1, RealTime: 0, CPUTime: 0
pred_cldnn_output_postprocess: Reorder permute_ref 2, RealTime: 6, CPUTime: 3
rnet_trans: Input_layout undef 2, RealTime: 0, CPUTime: 0

 How I formatted the output:

 

std::cout << layer << ": " << perf_info.layer_type << " " << perf_info.exec_type << " " << perf_info.status << ", RealTime: " << perf_info.realTime_uSec << ", CPUTime: " << perf_info.cpu_uSec << std::endl;

 

Thanks for looking into it,
Thomas

Iffa_Intel
Moderator
1,036 Views

Based on our findings on custom_layer.xml provided, several possible causes that can lead to the inaccurate result. Here are our recommendations:

 

1) Different format of the configuration file

 

2) Incomplete global WorkSizes

 

Please do share your output once you run the amended xml.


Sincerely,

Iffa


Senfter__Thomas
Beginner
1,024 Views

Changes in the xml do not change the output. I made a model only with the custom kernel and the result is:

image: Input_layout undef 2, RealTime: 0, CPUTime: 0
input:image_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
input:matrix_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
matrix: Input_layout undef 2, RealTime: 0, CPUTime: 0
warped_image: WarpAffine undef 1, RealTime: 0, CPUTime: 0
warped_image_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
warped_image_cldnn_output_postprocess: Reorder undef 2, RealTime: 316, CPUTime: 12

So the warp affine seems to be shown as "warped_image_cldnn_output_postprocess". I guess that's the problem.

Here the new xml: No define as i don't need any defines in my Kernel

 

<CustomLayer name="WarpAffine" type="SimpleGPU" version="1">
  <Kernel entry="warp_affine">
    <Source filename="warp_affine.cl"/>
  </Kernel>
  <Buffers>
    <Tensor arg-index="0" type="input" port-index="0"  format="BFYX"/>
    <Tensor arg-index="1" type="input" port-index="1"  format="BFYX"/>
    <Tensor arg-index="2" type="output" port-index="0" format="BFYX"/>
  </Buffers>
  <CompilerOptions options="-cl-mad-enable"/>
  <WorkSizes global="Y,((X+31)/32)*32,B*F" local="1,32,1"/>
</CustomLayer>

 

Thanks,
Thomas

Iffa_Intel
Moderator
1,007 Views

Just to clarify, did your problem solved by doing that?

If so, I shall close this thread.



Sincerely,

Iffa


Senfter__Thomas
Beginner
1,002 Views

If there is always a similar named Reorder layer after a custom layer showing the correct performance counts then the problem is solved (Maybe this should be changed in future releases to work the same as for CPU)

However I'm not sure if this is the case.

Thanks,

Thomas

Iffa_Intel
Moderator
921 Views

Hi,


The issue has been escalated to our developers for rectification and they might need some collaterals to assist on their investigation and root-cause finding.

 

Hence, Could you help to give us your IR collaterals with the network in which the operation is use?

 

Our developers are looking on this issue right now and we will provide any updates once available.



Sincerely,

Iffa

 


Senfter__Thomas
Beginner
905 Views

Hi,

attached are the model, the extension (it can also be found here: https://github.com/accessio-gmbh/arivo_custom_openvino_layers) and also the result of the benchmark app.

For the benchmark_app sample to work (this is another problem), https://github.com/openvinotoolkit/openvino/blob/b3ac14c8d45e4abf4adce56374c23bf0a548a9da/inference-... hast to be changed to

 

if (!FLAGS_l.empty()) {

 

the benchmark_app was then run with

 

./benchmark_app -m warp_affine.xml -d GPU -report_folder . -report_type detailed_counters --exec_graph_path /tmp/ -c custom_layers.xml -l libcustom_cpu_extensions.so

 

 

Thanks for looking into it,
Thomas

Iffa_Intel
Moderator
804 Views

Hi,

Our development had looked into the issue and the fix is going to be implemented in one of the future releases (developer target on 2022.1)

 

We apologize for the delay in our response as our developers need time to replicate and come out with solution/fixes.

 

 

Sincerely,

Iffa

 

Iffa_Intel
Moderator
792 Views

Greetings,


Intel will no longer monitor this thread since we have provided a solution. If you need any additional information from Intel, please submit a new question



Sincerely,

Iffa


Reply