mvNCProfile Output Not Matching Tests

idata · ‎04-11-2018

So I compiled a caffe network with mvNCCompile, set the amount out of shave cores to 12 and tested the network running on the NCS in some python code. I timed it with time.clock() and was getting inference times on the order of 150ms/image. However, when I run the mvNCProfile tool on the same network, also with "-s 12" it's telling me that inference is taking 1280ms. I don't understand how there could be an order of magnitude difference between my python tests and the results from mvNCProfile.

The Python Code is something like this:

for test_image in inference_images:
    try:
        t0 = time.clock()
        graph.LoadTensor(test_image.astype(np.float16), 'user object')
    except:
        print("Error Loading Tensor into Movidius Graph")

    try:
        output, userobj = graph.GetResult()
        t1 = time.clock()
    except:
        print("Error getting result from NCS")

    times.append(t1 - t0)

Printing times and the average of times yields:

[0.1364160000000001, 0.14857699999999996, 0.15344800000000003, 0.14042500000000002, 0.15225599999999995, 0.16213199999999972, 0.1494970000000002, 0.16063800000000006, 0.14944800000000003, 0.1511720000000003]
0.15040090000000003

According to the documentation these values are seconds represented with floats so I'm right around 150ms/image

However, here is what I see from mvNCProfile:

mvNCProfile v02.00, Copyright @ Movidius Ltd 2016

/usr/local/bin/ncsdk/Controllers/FileIO.py:52: UserWarning: You are using a large type. Consider reducing your data sizes for best performance
"Consider reducing your data sizes for best performance\033[0m")
USB: Transferring Data...
Time to Execute :  1358.95  ms
USB: Myriad Execution Finished
Time to Execute :  1286.72  ms
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Network Summary

Detailed Per Layer Profile
                                                           Bandwidth   time
Name                                                MFLOPs  (MB/s)    (ms)

===============================================================================
0    data                                                   0.0102011.9   0.005
1    conv1_1                                              311.0   298.2  15.556
2    conv1_2                                             6635.5   668.4 148.043
3    pool1                                                  5.8   956.9  11.481
4    conv2_1                                             3317.8   419.5  59.264
5    conv2_2                                             6635.5   482.0 103.162
6    pool2                                                  2.9   944.1   5.819
7    conv3_1                                             3317.8   168.7  76.599
8    conv3_2                                             6635.5   176.2 146.712
.
.
.
71   conv9_2_mbox_priorbox                                  0.0    20.4   0.024
72   mbox_conf_reshape                                      0.0   171.0   0.390
73   mbox_conf_softmax                                      0.1   526.9   0.126
74   mbox_conf_flatten                                      0.0   374.0   0.178
75   detection_out                                          0.0    38.3   1.738

Total inference time                 1281.14

Generating Profile Report 'output_report.html'...

Can anyone give me some insight as to why using the Profile tool yields an inference time about 10x larger than what I see when I actually test the function calls in Python?

One thought I had was data type - what data type does mvNCProfile use? FP16, FP32, or FP64? Is there any way to change this and test inference times with different data types?

I'm using FP16 in my python code if that is relevant information.

Thanks!

idata · ‎04-12-2018

@sgiardi3 The time output that the mvNCProfile tool reports is strictly the time spent processing on the chip. It should typically be less vs LoadTensor() + GetResult(). The issue you are seeing could be due to a bug. I would like to reproduce this issue on my bench if you can provide a link to your network. mvNCProfile uses FP16 data. Also by the way, in your code, the LoadTensor() line, its actually timing the numpy fp16 conversion also. if you remove the fp16 line to outside of the timed code, you may even get a lower result. Thanks for reporting this. I hope you can provide a network for testing purposes.

idata · ‎04-12-2018

@Tome_at_Intel Thanks for the response! It seems as though I am having this issue with the SSD networks/weights available on github as well as the one's we fine tuned. For testing you can download the Pascal VOC Model from the SSD github.

https://github.com/weiliu89/caffe/tree/ssd

If you scroll down to the bottom I just verified this issue on the 07+12+COCO: SSD300* model/weights.

This was the Download Link

All I did was remove the following lines from the deploy.prototxt to remove the dependency on the label map:

save_output_param {
    label_map_file: "data/VOC0712/labelmap_voc.prototxt"
}

I then compiled with:

mvNCCompile deploy.prototxt -w VGG_VOC0712_SSD_300x300_ft_iter_120000.caffemodel -s 12 -in conv1_1 -on detection_out -o mov_graph_coco

Profiled with:

mvNCProfile deploy.prototxt -w VGG_VOC0712_SSD_300x300_ft_iter_120000.caffemodel -s 12 -in conv1_1 -on detection_out

Profile Result:

mvNCProfile v02.00, Copyright @ Movidius Ltd 2016

/usr/local/bin/ncsdk/Controllers/FileIO.py:52: UserWarning: You are using a large type. Consider reducing your data sizes for best performance
"Consider reducing your data sizes for best performance\033[0m")
USB: Transferring Data...
Time to Execute :  1412.68  ms
USB: Myriad Execution Finished
Time to Execute :  1336.27  ms
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Network Summary

Detailed Per Layer Profile
                                                           Bandwidth   time
#    Name                                                MFLOPs  (MB/s)    (ms)
===============================================================================
0    data                                                   0.0 98093.3   0.005
1    conv1_1                                              311.0   298.0  15.569
2    conv1_2                                             6635.5   668.3 148.056
3    pool1                                                  5.8   956.8  11.483
4    conv2_1                                             3317.8   420.0  59.199
5    conv2_2                                             6635.5   482.3 103.094
6    pool2                                                  2.9   945.4   5.811
7    conv3_1                                             3317.8   169.4  76.295
.
.
.
69   conv9_2_mbox_conf_perm                                 0.0     7.0   0.023
70   conv9_2_mbox_conf_flat                                 0.0    22.4   0.014
71   conv9_2_mbox_priorbox                                  0.0     4.6   0.107
72   mbox_conf_reshape                                      0.0   152.7   2.290
73   mbox_conf_softmax                                      0.6   600.7   0.582
74   mbox_conf_flatten                                      0.0   196.3   1.782
75   detection_out                                          0.0    19.3   3.468
-------------------------------------------------------------------------------
                               Total inference time                 1330.09
-------------------------------------------------------------------------------

Python Testing code:

import mvnc.mvncapi as fx
import os
import cv2
import numpy as np
import time

fx.SetGlobalOption(fx.GlobalOption.LOG_LEVEL, 2)

dim = (300, 300)
ncs_names = fx.EnumerateDevices()

if len(ncs_names) < 1:
    print('Error: No NCS devices detected!')
    quit()

ncs_device = fx.Device(ncs_names[0])

network_graph = #Path to "mov_graph_coco"
test_image_path = #Path to test images
inference_images = []

for image_name in os.listdir(test_image_path):
    image = cv2.imread(os.path.join(test_image_path, image_name))
    image = cv2.resize(image, dim)
    image = image.astype(np.float32)
    inference_images.append(image)
    print('Image')

try:
    ncs_device.OpenDevice()
except:
    print('Error: NCS Device could not be opened')
    quit()

print('Device Opened Normally!')
with open(network_graph, mode='rb') as f:
    blob = f.read()

graph = ncs_device.AllocateGraph(blob)

img_num = 0
times = []
for test_image in inference_images:

    try:
        t0 = time.clock()
        graph.LoadTensor(test_image.astype(np.float16), 'user object')
    except:
        print("Error Loading Tensor into Movidius Graph")

    try:
        output, userobj = graph.GetResult()
        t1 = time.clock()
    except:
        print("Error getting result from NCS")

    times.append(t1 - t0)

    try:
        detection = []
        detections = []
        for i in range(len(output)):
            detection.append(output[i])

            if i > 0 and i % 7 == 0:
                print("Detection ", int(i/7), " for Image: ", img_num, "-------------------")
                print(detection)
                detections.append(detection)
                detection = []
    except:
        print("Detection Processing Error")


try:
    ncs_device.CloseDevice()
except:
    print('Error: Could not close NCS Device!')
    quit()

print('Closed NCS Device!  Everything worked properly')
print(times)
print(np.average(times))

Python Script Results:

[0.16148699999999971, 0.15405300000000022, 0.15758600000000023, 0.1580459999999997, 0.16678099999999985, 0.16459499999999982, 0.16467399999999977, 0.170118, 0.15936899999999987, 0.16317100000000018]
0.16198799999999994

And again the mvNCProfile tool is returning an inference time almost 10x larger than my python test.

Thanks for your help!

idata · ‎04-13-2018

@sgiardi3 Thanks for providing the network and the thorough instructions. We were able to reproduce the issue and are investigating. Thank you for your patience.

idata · ‎04-23-2018

@Tome_at_Intel Just wanted to check in and see if there has been any update on this.

idata · ‎04-23-2018

@sgiardi3 Sorry no update for this particular issue yet. As soon as I find something out, I will let you know. Thanks.

idata · ‎05-23-2018

I'm having the same issue. Profile tool reporting 540ms. I'm measuring 45ms timing in Python (and that includes the time to send and receive the data from the compute stick).

An update on this would be very helpful.

idata · ‎05-24-2018

@sgiardi3 - try time.time(). My results are matching profiler now.