I ported some CNNs from Tensorflow to OpenVino using the model converter. While most ported CNNs work fine, one is loading very slow. The used code is shown below. Loading this model (it has 29 layers and the .bin file is 3.4 MB) takes over a minute while other CNNs of similar size are loading in a few seconds. The hardware is a NUC7i3.
plugin = IEPlugin(device="GPU") net2 = IENetwork(model=os.path.join(str(output_dir), str(run_name), str("net2.xml")), weights=os.path.join(str(output_dir), str(run_name), str("net2.bin"))) self.ocr_net = plugin.load(network=net2) # <- this lines takes over a minute
What can be the reason that the CNN takes so long to load or how can I find out, why it takes so long?
Same here. Most likely this is clDNN trying to build OpenCL kernels for the GPU device. You can monitor and profile this activity if you use an OpenCL profiler. There are a few ways to work around this but the best way is for OpenVino and clDNN to handle this. Let's submit a feature request.
[ ref https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clBuildProgram.html ]
Yeah, it's the clDNN building the kernel, which takes so long for convolution layers with large kernels.
Our layer (listed below) with the 11x128 kernel takes about 30 seconds to build, while a layer with a 1x128 kernel takes only about one second.
Is there a way to speed up the kernel building (saving the build kernel, build settings,...)?
<layer id="36" name="ocr_conv1/convolution" precision="FP32" type="Convolution"> <data dilations="1,1" group="1" kernel="11,128" output="128" pads_begin="2,0" pads_end="2,0" strides="1,1"/> <input> <port id="0"> <dim>1</dim> <dim>1</dim> <dim>38</dim> <dim>128</dim> </port> </input> <output> <port id="3"> <dim>1</dim> <dim>128</dim> <dim>32</dim> <dim>1</dim> </port> </output> <blobs> <weights offset="3439384" size="720896"/> <biases offset="4160280" size="512"/> </blobs> </layer>
> Is there a way to speed up the kernel building (saving the build kernel, build settings,...)?
Yes, OpenCL spec allows this but may need to build your own clDNN library from source, cache OpenCL kernel binaries, and use clBuildProgram to load so that you speed up start times. If you do that please push your changes and submit a pull request - will help us all.
Good find Thomas! Was not aware of that. Will try shortly. BTW this is at a lower level than clDNN, at the OpenCL level, and most likely will only work on Linux with the NEO driver (?). Nevertheless would help many systems, just not the older Linux non-NEO driver and not Windows. Unless of course the Windows driver has same cl_cache feature.
Just to summarize here for the benefit of other forum users:
For example, start times for object_detection_demo_yolov3_async are almost half when cl_cache is used
mkdir cl_cache # First run - cache binaries : 28 seconds time ./object_detection_demo_yolov3_async -i test.mp4 -m frozen_darknet_yolov3.xml -d GPU -t 0.3 -pc real 0m28.044s # Second run : 14 seconds time ./object_detection_demo_yolov3_async -i test.mp4 -m frozen_darknet_yolov3.xml -d GPU -t 0.3 -pc real 0m14.872s
I can confirm it works also in windows: speed up loading of our model from 36 sec to 20 sec on the GPU.
Any idea how to do it on the NCS2? it takes 2 min to load the same model to the NCS2
May I ask how did you set up the caching on Windows?
I'm using OpenVino & NCS1 on Windows, and a rather simple fully-connected network (.pb converted from Keras, with only two dense ReLU layers and a linear output layer), and I'm trying to use it for a project that is time-sensitive so I'm trying to get the fastest inference time.
The problem is, loading the model with "plugin.LoadNetwork()" takes almost 300ms, while the actual inference takes only about 3ms. I haven't tried it on Raspberry Pi yet but I'm assuming it might take an even longer time to load the model, since, as people have mentioned above, the CPU might actually be building the model into a kernel for the device.
I'm on Windows (and not using OpenCL) so I don't know if the cl_cache method works. May I ask how did you set it up on Windows and for NCS? Thanks a lot!