Re: Queueing multiple input tensors

idata · ‎06-14-2018

The TensorFlow implementation of our network can process (inference only) over 10,000 images in ~1.5 seconds running on an NVidia GPU. On my MacBook, the same implementation can do it in ~45 seconds.

I am currently in the process of benchmarking the performance of the same graph after being ingested into the NCSDK and running on an Up Board with an AI Core Neural Accelerator. The performance is significantly worse, taking about 5 minutes to process 10000 images. As far as I can tell, the bottleneck seems to be related to how many images can be queued for processing simultaneously.

In TensorFlow it is possible to queue multiple inputs. In our network we queue batches of images via:

wtih tf.Session as sess:
    ...
    sess.run(output,  feed_dict={input: input_tensor[0:batch_size]})
    ....

In NCSDK, this does not seem to be possible. In the documentation for a TensorDescriptor (https://movidius.github.io/ncsdk/ncapi/ncapi2/py_api/TensorDescriptor.html) there is indeed a field for number of tensors, ie. batch size:

n int The number of tensors in the batch.

and Fifo.allocate() takes a n_elem argument for the number of elements that the Fifio can contain. (I am unclear on the relationship between these fields, can anybody clarify?)

However, in the documentation for the GraphOption class (https://movidius.github.io/ncsdk/ncapi/ncapi2/py_api/GraphOption.html) the RO_INPUT_COUNT and RO_OUTPUT_COUNT descriptions state that only one input and output are currently supported.

Can someone confirm or deny that it is only possible to feed one tensor at a time? Can I simply set the Graph's TensorDescriptors to contain more tensors? This bottleneck currently makes the Up Board and NCSDK unusable so I am hoping I am just missing/misunderstanding something in the documentation.

Matt

idata · ‎06-14-2018

@MWright You can only process one inference at a time with the NCSDK, however with NCSDK v 2.04.xx.xx you can queue up multiple inferences with the fifo write_elem() call or the queue_inference_with_fifo_elem() call. You can then get the result using the fifo read_elem() API call.

For the Tensor descriptor option that you mentioned, it is used to determine the maximum capacity of the fifo itself. That can be adjusted through the fifo allocate() (see the example code at the bottom of the page) or the graph_allocate_with_fifos() call. These options will increase the number of items you can queue up for inference, although inference processing is still limited to one at a time and there will be limits based on available memory on the device.

If you are using threads, you can queue up inferences with one thread and read inferences with another thread and you won't need to have a large fifo capacity to achieve a higher throughput. Here is an example that uses threading and multiple NCS devices.

Batch inference where multiple inferences are performed at the same time is not supported at the moment with the current NCSDK (2.04.00.06).

idata · ‎06-15-2018

Thanks for following up. I did try using write_elem(), queue_inference() and read_elem() but I required one call to queue_inference() for each call to write_elem() which seemed to defeat the purpose; it is still serialized and takes the same amount of time, I'm just making the API calls myself. I am not clear why I need to queue inference multiple times, since the Fifo I'm queueing contains all the tensors. I want to just queue the whole Fifo. The threading would help to write and read elements more quickly but the bottleneck is still there.

In the documentation you linked it says, regarding the number of tensors in the batch: _Only 1 currently supported._ What is the timeline on implementing more than one tensor per batch?

Matt

idata · ‎06-15-2018

@MWright I understand what you're saying and you're correct about the bottleneck if you're planning the workload/benchmark procedure that you have in mind. We don't have a time table for batch inference processing right now.

idata · ‎06-15-2018

Gotchya, thanks for the followup.