Inference Engine Network is loading very slow

Senfter__Thomas · ‎01-07-2019

Hello

I ported some CNNs from Tensorflow to OpenVino using the model converter. While most ported CNNs work fine, one is loading very slow. The used code is shown below. Loading this model (it has 29 layers and the .bin file is 3.4 MB) takes over a minute while other CNNs of similar size are loading in a few seconds. The hardware is a NUC7i3.

plugin = IEPlugin(device="GPU") 
net2 = IENetwork(model=os.path.join(str(output_dir), str(run_name), str("net2.xml")),
                 weights=os.path.join(str(output_dir), str(run_name), str("net2.bin"))) 
self.ocr_net = plugin.load(network=net2) # <- this lines takes over a minute

What can be the reason that the CNN takes so long to load or how can I find out, why it takes so long?

Thanks

Thomas

nikos1 · ‎01-08-2019

Hi Thomas,

Same here. Most likely this is clDNN trying to build OpenCL kernels for the GPU device. You can monitor and profile this activity if you use an OpenCL profiler. There are a few ways to work around this but the best way is for OpenVino and clDNN to handle this. Let's submit a feature request.

Cheers,

Nikos

[ ref https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clBuildProgram.html ]

Senfter__Thomas · ‎01-16-2019

Yeah, it's the clDNN building the kernel, which takes so long for convolution layers with large kernels.

Our layer (listed below) with the 11x128 kernel takes about 30 seconds to build, while a layer with a 1x128 kernel takes only about one second.

Is there a way to speed up the kernel building (saving the build kernel, build settings,...)?

<layer id="36" name="ocr_conv1/convolution" precision="FP32" type="Convolution">
			<data dilations="1,1" group="1" kernel="11,128" output="128" pads_begin="2,0" pads_end="2,0" strides="1,1"/>
			<input>
				<port id="0">
					<dim>1</dim>
					<dim>1</dim>
					<dim>38</dim>
					<dim>128</dim>
				</port>
			</input>
			<output>
				<port id="3">
					<dim>1</dim>
					<dim>128</dim>
					<dim>32</dim>
					<dim>1</dim>
				</port>
			</output>
			<blobs>
				<weights offset="3439384" size="720896"/>
				<biases offset="4160280" size="512"/>
			</blobs>
		</layer>

nikos1 · ‎01-16-2019

Hello Thomas,

> Is there a way to speed up the kernel building (saving the build kernel, build settings,...)?

Yes, OpenCL spec allows this but may need to build your own clDNN library from source, cache OpenCL kernel binaries, and use clBuildProgram to load so that you speed up start times. If you do that please push your changes and submit a pull request - will help us all.

Thanks,

nikos

Senfter__Thomas · ‎01-18-2019

Hi nikos

actually there is a feature for caching kernel binaries, see cl_cache at https://github.com/intel/compute-runtime/blob/master/documentation/FAQ.md

Cheers,

Thomas

nikos1 · ‎01-18-2019

Good find Thomas! Was not aware of that. Will try shortly. BTW this is at a lower level than clDNN, at the OpenCL level, and most likely will only work on Linux with the NEO driver (?). Nevertheless would help many systems, just not the older Linux non-NEO driver and not Windows. Unless of course the Windows driver has same cl_cache feature.

Cheers,

Nikos

nikos1 · ‎01-18-2019

Just to summarize here for the benefit of other forum users:

When using -d GPU the clDNN backend will have to build binaries for OpenCL kernels.
Thomas discovered that if we create a cl_cache directory OpenCL will be cached and then loaded from cl_cache on subsequent runs.
We have verified speed gains on Linux when the NEO driver is used.
Not sure if similar functionality exists on Windows.

For example, start times for object_detection_demo_yolov3_async are almost half when cl_cache is used

mkdir cl_cache

# First run - cache binaries : 28 seconds

time ./object_detection_demo_yolov3_async -i test.mp4 -m frozen_darknet_yolov3.xml -d GPU -t 0.3 -pc

real    0m28.044s

# Second run : 14 seconds
time ./object_detection_demo_yolov3_async -i test.mp4 -m frozen_darknet_yolov3.xml -d GPU -t 0.3 -pc

real    0m14.872s

Yaniv · ‎02-24-2019

Hi,

I can confirm it works also in windows: speed up loading of our model from 36 sec to 20 sec on the GPU.

Any idea how to do it on the NCS2? it takes 2 min to load the same model to the NCS2

Thanks,

Yaniv

Yaniv · ‎02-25-2019

Hi,

I can confirm its also working on windows.

The cache is generate only when using GPU but it speed up also NCS loading time

Wong__Mike · ‎03-07-2019

@Yaniv

May I ask how did you set up the caching on Windows?

I'm using OpenVino & NCS1 on Windows, and a rather simple fully-connected network (.pb converted from Keras, with only two dense ReLU layers and a linear output layer), and I'm trying to use it for a project that is time-sensitive so I'm trying to get the fastest inference time.

The problem is, loading the model with "plugin.LoadNetwork()" takes almost 300ms, while the actual inference takes only about 3ms. I haven't tried it on Raspberry Pi yet but I'm assuming it might take an even longer time to load the model, since, as people have mentioned above, the CPU might actually be building the model into a kernel for the device.

I'm on Windows (and not using OpenCL) so I don't know if the cl_cache method works. May I ask how did you set it up on Windows and for NCS? Thanks a lot!