Re: Inference performance

idata · ‎08-07-2017

Hi, I successfully ran ncs-fullcheck example and used it to inference several pictures. The performance of Alexnet is around 200ms and GoogLeNet is around 550ms. However, when I ran the profiling from tool kit (make example), it shows both AlexNet and GoogLeNet inference is around 90ms. There seem to be a gap between profile data and real inference time. Does anyone know where is this gap comes from (transfer image to the stick and retrieve result out, i.e.), and how do I get the performance the same as profiled?

Another question is the inference result seems different from caffe running on the same caffemodel (using cpp classifier), how do I get same result as using caffe?

Caffe: AlexNet

0.3094 - "n02124075 Egyptian cat"

0.1761 - "n02123159 tiger cat"

0.1221 - "n02123045 tabby, tabby cat"

0.1132 - "n02119022 red fox, Vulpes vulpes"

0.0421 - "n02085620 Chihuahua"

NCS

AlexNet

Egyptian cat (69.19%) tabby, tabby cat (6.59%) grey fox, gray fox, Urocyon cinereoargenteus (5.42%) tiger cat (3.93%) hare (3.52%)

idata · ‎08-11-2017

Hi akey,

We found an issue with our "ncapi/tools/convert_models.sh" script. You need to add the argument "-s 12" to mvNCCompile.pyc to enable all the vector engines. Please execute that script to regenerate the graph files and you should see the performance similar to that you were seeing with "make example01"

Thank You

Ramana @ Intel

Before the change

ubuntu@ubuntu-UP:~/workspace/MvNC_SDK/ncapi/c_examples$ ./ncs-fullcheck ../networks/GoogLeNet/ ../images/512_Amplifier.jpg

OpenDevice 4 succeeded

Graph allocated

radio, wireless (46.97%) CD player (31.79%) tape player (11.16%) cassette player (6.71%) cassette (1.78%)

Inference time: 569.302185 ms, total time 575.650308 ms

radio, wireless (46.97%) CD player (31.79%) tape player (11.16%) cassette player (6.71%) cassette (1.78%)

Inference time: 556.881409 ms, total time 562.636079 ms

Deallocate graph, rc=0

Device closed, rc=0

Change

cd ../tools

vi convert_models.sh

** Add -s 12 to all the compiles

!/bin/sh

NCS_TOOLKIT_ROOT='../../bin'

echo $NCS_TOOLKIT_ROOT

python3 $NCS_TOOLKIT_ROOT/mvNCCompile.pyc ../networks/SqueezeNet/NetworkConfig.prototxt -w ../networks/SqueezeNet/squeezenet_v1.0.caffemodel -o ../networks/SqueezeNet/graph -s 12

python3 $NCS_TOOLKIT_ROOT/mvNCCompile.pyc ../networks/GoogLeNet/NetworkConfig.prototxt -w ../networks/GoogLeNet/bvlc_googlenet.caffemodel -o ../networks/GoogLeNet/graph -s 12

python3 $NCS_TOOLKIT_ROOT/mvNCCompile.pyc ../networks/Gender/NetworkConfig.prototxt -w ../networks/Gender/gender_net.caffemodel -o ../networks/Gender/graph -s 12

python3 $NCS_TOOLKIT_ROOT/mvNCCompile.pyc ../networks/Age/deploy_age.prototxt -w ../networks/Age/age_net.caffemodel -o ../networks/Age/graph -s 12

python3 $NCS_TOOLKIT_ROOT/mvNCCompile.pyc ../networks/AlexNet/NetworkConfig.prototxt -w ../networks/AlexNet/bvlc_alexnet.caffemodel -o ../networks/AlexNet/graph -s 12

Execute the script

./convert_models.sh

cd ../c_examples

After the change

ubuntu@ubuntu-UP:~/workspace/MvNC_SDK/ncapi/c_examples$ ./ncs-fullcheck ../networks/GoogLeNet/ ../images/512_Amplifier.jpg

OpenDevice 4 succeeded

Graph allocated

radio, wireless (46.97%) CD player (31.79%) tape player (11.16%) cassette player (6.71%) cassette (1.78%)

Inference time: 108.950851 ms, total time 115.101073 ms

radio, wireless (46.97%) CD player (31.79%) tape player (11.16%) cassette player (6.71%) cassette (1.78%)

Inference time: 88.571877 ms, total time 95.765275 ms

Deallocate graph, rc=0

Device closed, rc=0

idata · ‎08-24-2017

Much faster now. Continous inference speed from webcam is about 9.5 FPS for GoogleNet. Thanks!

idata · ‎09-14-2017

@akey can you tell me how you calculate the FPS for GoogleNet, please ?

idata · ‎09-14-2017

@ibrahimsoliman in python you can use:


from timeit import default_timer as timer
time_start = timer()
CODE
time_end = timer()
print('FPS: %.2f fps' % (1000/(time_end-time_start)))

idata · ‎09-14-2017

One thing I don't get though with NCS speed is why it is not running at full 100 GOPS as advertised. For example, in SqueezeNet example below and all other networks, we can see

MFLOPS estimate is 2x compared to actual op count. Is that because of fp16?

MFLOPS are calculated at ~1/3 speed of 100 GOPS. This ratio varies from 1/4 to 1/2 depending on tensor and convolution type.

Movidus/Intel guys could you explain this and may be give some advise how to increase NCS efficiency?

Detailed Per Layer Profile

Layer Name MFLOPs Bandwidth MB/s time(ms)

…

25 fire9/squeeze1x1 12.845 587.19 0.43

26 fire9/expand1x1 6.423 150.65 0.37

27 fire9/expand3x3 57.803 318.67 1.57

28 conv10 200.704 272.92 4.28

29 pool10 0.392 722.59 0.52

30 prob 0.003 10.49 0.18

Total inference time 26.89