- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
The GetPerformanceCounts function tells me that my custom layer was optimized out when running on GPU while it's not optimized out and working correctly when running on CPU. The layer is definitely not optimized out, it's an warp affine layer and the output is the correctly warped image.
I implemented the layer using this guide: https://docs.openvinotoolkit.org/latest/openvino_docs_HOWTO_Custom_Layers_Guide.html
The implementation can be found here: https://github.com/accessio-gmbh/arivo_custom_openvino_layers
On startup i configure the inference engine like this:
auto extension_ptr = InferenceEngine::make_so_pointer<InferenceEngine::IExtension>("/opt/iie/libcustom_cpu_extensions.so");
instance_->iie_core_.AddExtension(extension_ptr, "CPU");
instance_->iie_core_.SetConfig({{InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE, "/opt/iie/custom_layers.xml"}}, "GPU");
instance_->iie_core_.SetConfig({{InferenceEngine::PluginConfigParams::KEY_PERF_COUNT, InferenceEngine::PluginConfigParams::YES}}, "GPU");
Am I doing something wrong? (the guide for custom GPU plugins seems to be not up to date, as there is no for example "cldnn_global_custom_kernels.xml
" )
Or is there a bug?
Thanks,
Thomas
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you provide your GetPerformanceCounts function's output, for both CPU and GPU?
Sincerely,
Iffa
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Following is the GetPerformanceCounts of the interesting part of the model. The custom layer is the WarpAffine layer.
CPU:
image: Input unknown_FP32 0, RealTime: 0, CPUTime: 0
loc_in: WarpAffine unknown_FP32 2, RealTime: 225, CPUTime: 225
ocr_in: WarpAffine unknown_FP32 2, RealTime: 99, CPUTime: 99
ocr_trans: Gemm gemm_any_FP32 2, RealTime: 7, CPUTime: 7
out_loc_in: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_ocr_in: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_ocr_trans: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
out_pred: Output unknown_FP32 0, RealTime: 0, CPUTime: 0
pred: Permute unknown_FP32 2, RealTime: 3, CPUTime: 3
rnet_trans: Input unknown_FP32 0, RealTime: 0, CPUTime: 0
GPU:
image: Input_layout undef 2, RealTime: 0, CPUTime: 0
input:image_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
input:rnet_trans_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
loc_in: WarpAffine undef 1, RealTime: 0, CPUTime: 0
loc_in_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
loc_in_cldnn_output_postprocess: Reorder undef 2, RealTime: 19, CPUTime: 12
ocr_in: WarpAffine undef 1, RealTime: 0, CPUTime: 0
ocr_in_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
ocr_in_cldnn_output_postprocess: Reorder undef 2, RealTime: 11, CPUTime: 3
ocr_trans: Gemm gemm_tiled_opt 2, RealTime: 7, CPUTime: 3
ocr_trans_cldnn_in0_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_in1_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_out_reshape: Gemm undef 1, RealTime: 0, CPUTime: 0
ocr_trans_cldnn_output_postprocess: Reorder reorder_data_fast_b1 2, RealTime: 4, CPUTime: 3
pred: Permute undef 1, RealTime: 0, CPUTime: 0
pred_cldnn_output_postprocess: Reorder permute_ref 2, RealTime: 6, CPUTime: 3
rnet_trans: Input_layout undef 2, RealTime: 0, CPUTime: 0
How I formatted the output:
std::cout << layer << ": " << perf_info.layer_type << " " << perf_info.exec_type << " " << perf_info.status << ", RealTime: " << perf_info.realTime_uSec << ", CPUTime: " << perf_info.cpu_uSec << std::endl;
Thanks for looking into it,
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Based on our findings on custom_layer.xml provided, several possible causes that can lead to the inaccurate result. Here are our recommendations:
1) Different format of the configuration file
- Your file (arivo_custom_openvino_layers/custom_layers.xml at main · accessio-gmbh/arivo_custom_openvino_layers (github.com)) uses different format compared to our OV documentation (How to Implement Custom GPU Operations - OpenVINO™ Toolkit) - missing Define parameter
- We advise you to follow our proposed format
2) Incomplete global WorkSizes
- Your file uses an array of 2 integers [Y , ((X + 31)/32)*32)] vs. our documentation of an array of up to 3 integers (or formulas) for defining the OpenCL work-sizes to be used (How to Implement Custom GPU Operations - OpenVINO™ Toolkit)
- We advise you to use an array of 3 integers instead
- Samples (GPU & VPU): openvino/custom_layer_example.xml at d18073260bc742d7bf14d262d6919a1b660e2b61 · openvinotoolkit/openvino (github.com) & openvino/customLayerBindings.xml at d18073260bc742d7bf14d262d6919a1b660e2b61 · openvinotoolkit/openvino (github.com)
Please do share your output once you run the amended xml.
Sincerely,
Iffa
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Changes in the xml do not change the output. I made a model only with the custom kernel and the result is:
image: Input_layout undef 2, RealTime: 0, CPUTime: 0
input:image_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
input:matrix_cldnn_input_preprocess: Reorder undef 1, RealTime: 0, CPUTime: 0
matrix: Input_layout undef 2, RealTime: 0, CPUTime: 0
warped_image: WarpAffine undef 1, RealTime: 0, CPUTime: 0
warped_image_cldnn_custom_postprocess: Warpaffine undef 1, RealTime: 0, CPUTime: 0
warped_image_cldnn_output_postprocess: Reorder undef 2, RealTime: 316, CPUTime: 12
So the warp affine seems to be shown as "warped_image_cldnn_output_postprocess". I guess that's the problem.
Here the new xml: No define as i don't need any defines in my Kernel
<CustomLayer name="WarpAffine" type="SimpleGPU" version="1">
<Kernel entry="warp_affine">
<Source filename="warp_affine.cl"/>
</Kernel>
<Buffers>
<Tensor arg-index="0" type="input" port-index="0" format="BFYX"/>
<Tensor arg-index="1" type="input" port-index="1" format="BFYX"/>
<Tensor arg-index="2" type="output" port-index="0" format="BFYX"/>
</Buffers>
<CompilerOptions options="-cl-mad-enable"/>
<WorkSizes global="Y,((X+31)/32)*32,B*F" local="1,32,1"/>
</CustomLayer>
Thanks,
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just to clarify, did your problem solved by doing that?
If so, I shall close this thread.
Sincerely,
Iffa
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If there is always a similar named Reorder layer after a custom layer showing the correct performance counts then the problem is solved (Maybe this should be changed in future releases to work the same as for CPU)
However I'm not sure if this is the case.
Thanks,
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The issue has been escalated to our developers for rectification and they might need some collaterals to assist on their investigation and root-cause finding.
Hence, Could you help to give us your IR collaterals with the network in which the operation is use?
Our developers are looking on this issue right now and we will provide any updates once available.
Sincerely,
Iffa
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
attached are the model, the extension (it can also be found here: https://github.com/accessio-gmbh/arivo_custom_openvino_layers) and also the result of the benchmark app.
For the benchmark_app sample to work (this is another problem), https://github.com/openvinotoolkit/openvino/blob/b3ac14c8d45e4abf4adce56374c23bf0a548a9da/inference-engine/samples/benchmark_app/main.cpp#L174 hast to be changed to
if (!FLAGS_l.empty()) {
the benchmark_app was then run with
./benchmark_app -m warp_affine.xml -d GPU -report_folder . -report_type detailed_counters --exec_graph_path /tmp/ -c custom_layers.xml -l libcustom_cpu_extensions.so
Thanks for looking into it,
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Our development had looked into the issue and the fix is going to be implemented in one of the future releases (developer target on 2022.1)
We apologize for the delay in our response as our developers need time to replicate and come out with solution/fixes.
Sincerely,
Iffa
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Greetings,
Intel will no longer monitor this thread since we have provided a solution. If you need any additional information from Intel, please submit a new question
Sincerely,
Iffa
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page