YOLO and Facenet on CPU (MKL-DNN)

nikos1 · ‎08-31-2018

Thank you for YOLO and Facenet support in R3. Model optimizer runs fine and execution for both FP16 and FP32 is smooth on GPU devices (clDNN).

One issue we are experiencing is with FP32 on CPU device (MKL-DNN plug-in). We get various crashes on both Windows and Linux. Is it a supported configuration?

Severine_H_Intel · ‎09-03-2018

Hi Nikos,

yes, FP32 is supported on CPU. Which crashes do you experience exactly? Can you report them here?

Best,

Severine

nikos1 · ‎09-04-2018

Hi Severine, Thank you for confirming that FP32 CPU inference of YOLO & facenet is supported. It seems I am having some trouble to properly link/use the required intel64/libcpu_extension*so and loading YOLO / Facenet networks fail on both Windows and Ubuntu. When I debug the crash I do not get meaningful information as I do not have symbols. Let me investigate a bit more and update soon.

nikos1 · ‎09-16-2018

The problem is only on CPU path and is related to output/YoloRegion - if I remove it CPU runs fine too.

-d GPU clDNN path works fine. For example:

./object_detection_demo_ssd_async -m  tiny-yolo.xml  -i test.mp4   -d GPU -pc

performance counts:

0-convolutional               EXECUTED       layerType: Convolution        realTime: 1215       cpu: 3              execType: convolution_gpu_bfyx_os_iyx_osv16
11-maxpool                    EXECUTED       layerType: Pooling            realTime: 49         cpu: 2              execType: pooling_gpu_bfyx_block_opt
12-convolutional              EXECUTED       layerType: Convolution        realTime: 1312       cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
14-maxpool                    EXECUTED       layerType: Pooling            realTime: 37         cpu: 3              execType: pooling_gpu_bfyx_block_opt
15-convolutional              EXECUTED       layerType: Convolution        realTime: 1299       cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
17-maxpool                    EXECUTED       layerType: Pooling            realTime: 34         cpu: 2              execType: pooling_gpu_bfyx_block_opt
18-convolutional              EXECUTED       layerType: Convolution        realTime: 4119       cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
2-maxpool                     EXECUTED       layerType: Pooling            realTime: 642        cpu: 2              execType: pooling_gpu_bfyx_block_opt
20-convolutional              EXECUTED       layerType: Convolution        realTime: 8201       cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
22-convolutional              EXECUTED       layerType: Convolution        realTime: 102        cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
3-convolutional               EXECUTED       layerType: Convolution        realTime: 1243       cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
5-maxpool                     EXECUTED       layerType: Pooling            realTime: 162        cpu: 2              execType: pooling_gpu_bfyx_block_opt
6-convolutional               EXECUTED       layerType: Convolution        realTime: 1164       cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
8-maxpool                     EXECUTED       layerType: Pooling            realTime: 90         cpu: 2              execType: pooling_gpu_bfyx_block_opt
9-convolutional               EXECUTED       layerType: Convolution        realTime: 1145       cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
LeakyReLU_                    NOT_RUN        layerType: ReLU               realTime: 0          cpu: 0              execType: undef
LeakyReLU_372                 NOT_RUN        layerType: ReLU               realTime: 0          cpu: 0              execType: undef
LeakyReLU_373                 NOT_RUN        layerType: ReLU               realTime: 0          cpu: 0              execType: undef
LeakyReLU_374                 NOT_RUN        layerType: ReLU               realTime: 0          cpu: 0              execType: undef
LeakyReLU_375                 NOT_RUN        layerType: ReLU               realTime: 0          cpu: 0              execType: undef
LeakyReLU_376                 NOT_RUN        layerType: ReLU               realTime: 0          cpu: 0              execType: undef
LeakyReLU_377                 NOT_RUN        layerType: ReLU               realTime: 0          cpu: 0              execType: undef
LeakyReLU_378                 NOT_RUN        layerType: ReLU               realTime: 0          cpu: 0              execType: undef
input_cldnn_input_preprocess  EXECUTED       layerType: Reorder            realTime: 143        cpu: 6              execType: reorder_data
output/YoloRegion             NOT_RUN        layerType: RegionYolo         realTime: 0          cpu: 0              execType: undef
output/YoloRegion_cldnn_ou... EXECUTED       layerType: Reorder            realTime: 152        cpu: 2              execType: region_yolo_gpu_ref
Total time: 21109    microseconds
[ INFO ] Execution successful

CPU path however (if I keep output/YoloRegion) seems to fail to load. Is YoloRegion supported in MKLDNNPlugin ?

	API version ............ 1.2
	Build .................. lnx_20180510
	Description ....... MKLDNNPlugin
[ INFO ] Loading network files
[ INFO ] Batch size is forced to  1.
[ INFO ] Checking that the inputs are as the sample expects
[ INFO ] Checking that the outputs are as the sample expects
[ INFO ] Loading model to the plugin
[ ERROR ] std::exception

Severine_H_Intel · ‎09-17-2018

Dear Nikos,

I could reproduce your issue. As I had to do it, did you commented few lines to be able to run the model through the sample (tell me if not), all the checks on models and output size for example. First, I need to investigate this as it shows that Yolo model is not completely adapted for the sample and might explain the errors we have further with the CPU plugin.

Best,

Severine

nikos1 · ‎09-17-2018

Hi Severine,

Thank you for confirming repro with CPU plug-in. Just to confirm that I had to slightly modify object_detection_demo_ssd_async as you suggested. I am sorry I forgot to mention that. This enabled to load and run tiny YOLO on the GPU device without any issues.

Same code however, fails on CPU. One workaround would be to edit the generated xml and remove "output/YoloRegion" .

                <layer id="28" name="output/YoloRegion" precision="FP32" type="RegionYolo">
                        <data axis="1" classes="20" coords="4" do_softmax="1" end_axis="3" num="3"/>
                        <input>
                                <port id="0">
                                        <dim>1</dim>
                                        <dim>30</dim>
                                        <dim>26</dim>
                                        <dim>26</dim>
                                </port>
                        </input>
                        <output>
                                <port id="1">
                                        <dim>1</dim>
                                        <dim>20280</dim>
                                </port>
                        </output>
                </layer>

and also the edge

                <edge from-layer="27" from-port="3" to-layer="28" to-port="0"/>

Then we can run fine on CPU too but need to implement YoloRegion separately.

In a future SDK release it would be nice to have a new python or C++ sample application that demonstrates end-to-end YOLO detection. Something like a new object_detection_demo_yolo_async would be nice.

Severine_H_Intel · ‎09-19-2018

Hi Nikos,

I analyzed tiny-yolo output and I realize it is not adapted for the sample. The sample expects a vector of dimension 4 output while tiny-yolo output is of dimension 2. Compiling the samples in Debug mode makes this issue more apparent as it will crash for both CPU and GPU.

In Release mode, it has unexpected behavior and does not crash even when you call a vector out of its range. This is what we were experimenting in CPU and GPU ( in my case, GPU is working half the time).

The issue is not the model, but the sample that is not adapted and the output reading that needs to be changed.

Best,

Severine

Kamarol__Amalina1 · ‎09-25-2018

Nikos wrote:

Hi Severine,

Thank you for confirming repro with CPU plug-in. Just to confirm that I had to slightly modify object_detection_demo_ssd_async as you suggested. I am sorry I forgot to mention that. This enabled to load and run tiny YOLO on the GPU device without any issues.

Same code however, fails on CPU. One workaround would be to edit the generated xml and remove "output/YoloRegion" .
                <layer id="28" name="output/YoloRegion" precision="FP32" type="RegionYolo">
                        <data axis="1" classes="20" coords="4" do_softmax="1" end_axis="3" num="3"/>
                        <input>
                                <port id="0">
                                        <dim>1</dim>
                                        <dim>30</dim>
                                        <dim>26</dim>
                                        <dim>26</dim>
                                </port>
                        </input>
                        <output>
                                <port id="1">
                                        <dim>1</dim>
                                        <dim>20280</dim>
                                </port>
                        </output>
                </layer>
and also the edge
                <edge from-layer="27" from-port="3" to-layer="28" to-port="0"/>
Then we can run fine on CPU too but need to implement YoloRegion separately.

In a future SDK release it would be nice to have a new python or C++ sample application that demonstrates end-to-end YOLO detection. Something like a new object_detection_demo_yolo_async would be nice.

Hi Nikos, do you mind sharing the modified cpp code that you modified? I have been trying but I still have a problem reading the output from yolo.