NCSDK2 ncFifoReadElem too slow

idata · ‎01-29-2019

I am currently using NCS1 with NCSDK2. I found that the largest time cost lied within the function to read final result ncFifoReadElem(bufferOut...).

So I traced down the calling ncFifoReadElem -> XLinkReadData -> dispatcherWaitEventComplete -> sem_wait. It appears that it´s waiting for a thread to return which I don't know more information about because I cannot enduring looking deeper into the code….

Is the thread, by my guess, dealing with the computation inside stick or something else? Because by the meaning ofncFifoReadElem, by instinct, I thought it's just reading the final output from USB. But that would be weird because the time cost for ncFifoReadElem is much bigger than that for ncFifWriteElem, considering the size of input image and output result…..

I would really appreciate it if someone could help me with this issue. So far I can only achieve 4 FPS at most for YOLOV2-tiny.

idata · ‎01-29-2019

Hi @BenjaminLiu

What command are you using to compile your graph file? Compiling with the -s 12 option gives the NCS device the option of using all 12 SHAVE processors that should speed up inference speeds. Tiny Yolo v1&v2 both require some post processing that is done on the CPU. They both also have larger input sizes than most of the other models in the NCAPPZOO which is why they could take a little longer to compute the inference result.

I hope this information was helpful.

Best Regards,

Sahira

idata · ‎01-30-2019

Hi @Sahira_at_Intel

Thanks for you replay. I did use mvNCCompile $yolocfgcaffe -w $yoloweightcaffe -s 12 like you suggested. And the time cost does not include any post processing. I do understand the input size for tiny Yolo V2 is large. I just got confused by the name of function ncFifoReadElem which is after ncGraphQueueInference, because it seems that the function ncFifoReadElem is just used for reading the results from USB.

Is the function ncFifoReadElem actually some kind of blocking function that needs to wait until the completion of inference? So the time cost is actually that for doing inference inside Movidius stick, which is bounded by the network size and computation speed that I cannot change?

Thanks you!

idata · ‎01-31-2019

Hi @BenjaminLiu,

Yes, ncFifoReadElem() is a blocking function. It blocks and waits for the NCS to finish processing the inference so that it can retrieve the result. When ncQueueInference() is called, one input tnesor is removed from the input fifo and sent to the NCS.

Tiny Yolo v1 & v2 are slow because their native input size is large compared to other networks (448x488 for v1 and 416x416 for v2 compared to SSD MobileNet Caffe which is 300x300). These networks were designed with an overall higher resolution for better accuracy (but this meant lower speeds).

Also, if you want to time the inference being done on the NCS a little more accurately, time the ncQueueInference() and ncReadElem() calls. ncWriteElem() actually doesn't trigger any of the inference processing on the NCS device. Keep in mind that some USB transfer time has to be taken into account, but that only takes up a very small portion of the total inference time. Looking at our old Yolo reports, we were getting about 5-6 fps with both Tiny Yolo V1 & V2, while SSD MobileNet was much faster at 10-12 fps.

Please let me know if this was helpful.

Best Regards,

Sahira