Inferencing large images using NCS' in parallel

idata · ‎01-29-2018

I'm trying to use 4 NCS' to inference a large image. Currently I'm breaking the image into 32x32 pixel images and loading an image each onto an NCS using LoadTensor(). Once the images are loaded I iterate over the NCS' and call GetResult(), at which point the result is added to 2d array of probabilities. I repeat this process until the large image is entirely inferenced like so:

while len(inputs) != 0:

graph_handle[0].LoadTensor(inputs.pop())

graph_handle[1].LoadTensor(inputs.pop())

graph_handle[2].LoadTensor(inputs.pop())

graph_handle[3].LoadTensor(inputs.pop())

res1 = graph_handle[0].GetResult(…)

res2 = graph_handle[1].GetResult(…)

res3 = graph_handle[2].GetResult(…)

res4 = graph_handle[3].GetResult(…)

process results

As you can see this means that the inferencing occurs sequentially - only one NCS is being used at any given time. Is there anyway to have EACH of these NCS' managed in parallel processes that continually calls LoadTensor() and GetResult() (or any other way to have each NCS inferencing in parallel)? I would like to have each NCS making inferences in parallel with one another to reduce processing time, as the main bottleneck is the inferencing (the input data is streamed at a faster rate then the inferences occur).

I've tried using os.fork() and python's multiprocessing module with pipes to have one process taking care of one NCS but inevitably each time GetResult() fails when called. I have tried initialising the NCS' (loading compiled graph and opening device) before the fork in the parent process and after the fork in the child process but either way GetResult() still fails when called.

My current results on a subset of the large image suggest that the majority of the time is spent inferencing, so ideally this time could be quartered when using 4 NCS':

Time spent:

-evaluating (total): 0:04:09.721723

-recieving & formatting input: 0:00:31.952764

-loading tensors: 0:00:15.504957

-inferencing: 0:03:22.264002

Is this possible? Any advice would be much appreciated.

idata · ‎01-29-2018

@Isaac You can take a look at the examples in the Ncappzoo (https://github.com/movidius/ncappzoo/tree/master/apps). Some of the examples, like our MultiStick GoogLeNet example (https://github.com/movidius/ncappzoo/blob/master/apps/MultiStick_GoogLeNet/MultiStick_GoogLeNet.py), use threading to run multiple inferences from multiple sticks.

idata · ‎01-30-2018

@Tome_at_Intel . I am able to successfully run those examples you've provided, thanks.

I was initially unable to run my own implementation, every time I called GetResult() in any parallel thread I'd receive the following error:

File "/usr/local/lib/python3.5/dist-packages/mvnc/mvncapi.py", line 264, in GetResult

raise Exception(Status(status))

Exception: mvncStatus.ERROR

which is followed by the device(s) disconnecting and not reconnecting (dmesg -w output):

[95834.145350] usb 2-2.2: USB disconnect, device number 60

[95834.224190] usb 1-2.1.3: USB disconnect, device number 74

However the problem seems to be fixed by introducing a small wait between LoadTensor() and GetResult():

GRAPH_HANDLES[device_number].LoadTensor(image_input, image_loc_string)

time.sleep(0.08)

output, image_loc_string = GRAPH_HANDLES[device_number].GetResult()

The custom tensorflow graph I'm using is very simple (3 conv layers, 2 dense layers), and the images I'm loading are 32x32x3 float16s, I'm not sure if that has anything to do with it. Note that if I reduce the wait to less than 0.06 seconds then the original error will occur, 0.08 seconds seems to be the minimum.

Would you know of any other workaround? I'll be running inferences in batches of 100,000+ images so a 0.08 second wait per inference will begin to add up. Perhaps I can make the graph more complex and see if this makes any difference.

idata · ‎01-30-2018

By switching from multiprocessing's Process to threading's Thread I was able to get the NCS parallelism working.

I am still unsure why time.sleep(0.08) prevented an error when using GetResult() when using multithreading's Process.

idata · ‎01-30-2018

@Isaac I think that the problem is occurring because you may not have set the graph option to block. After the script runs loadTensor(), it runs getResult() right away, but because it takes a little bit of time to process loadTensor(), the result may not actually be available yet so that may be the reason your getResult() is giving you errors. You can read more about the python API and setting the graph options @ https://movidius.github.io/ncsdk/py_api/ and https://movidius.github.io/ncsdk/py_api/GraphOption.html and https://movidius.github.io/ncsdk/py_api/Graph.html.

idata · ‎01-31-2018

@Tome_at_Intel I can confirm that the graph's were explicitly set to block (though I believe that is the default behaviour) in the following manner:

//Set graph options so that calls to LoadTensor() block until complete

graph_handles[device_index].SetGraphOption(mvnc.GraphOption.DONT_BLOCK, 0)

Nonetheless the implementation is working since switching to Threading. Thank you for your help.