topic Thanks for the additional in Intel® Distribution of OpenVINO™ Toolkit

Inference Engine's classification sample batch performance

RSun9 — Wed, 07 Jun 2017 22:00:01 GMT

Hi,

I'd like to ask a question about the classification sample of the Inference Engine. For fp16 precision and batch processing, the average running time looks pretty bad. Why is that?

So far I have only tested with the Alexnet. I first ran the model optimizer (MO) to set the precision and batch size, and then I ran the classification sample with the xml and weights generated by MO. See below for details.

Test 1, precision = fp32, batch size = 1.

ModelOptimizer -w ./bvlc_alexnet.caffemodel -p FP32 -d ./deploy_alexnet.prototxt -f 1 -b 1 --target APLK -i
Start working...
Framework plugin: CAFFE
Target type: APLK
Network type: CLASSIFICATION
Batch size: 1
Precision: FP32
Layer fusion: true
Output directory: Artifacts
Custom kernels directory:
Network input normalization: 1
Writing binary data to: Artifacts/AlexNet/AlexNet.bin

./classification_sample -i ice-creams-227x227.bmp -m ./Artifacts/AlexNet/AlexNet.xml -l ./synset_words.txt -d GPU
InferenceEngine:
   API version ............ 1.0
   Build .................. 2778
****
   API version ............ 0.1
   Build .................. manual-01121
   Description ....... clDNNPlugin
Average running time of one iteration: 14 ms

Test 2, precision = fp16, batch size = 1.

ModelOptimizer -w ./bvlc_alexnet.caffemodel -p FP16 -d ./deploy_alexnet.prototxt -f 1 -b 1 --target APLK -i
Start working...
Framework plugin: CAFFE
Target type: APLK
Network type: CLASSIFICATION
Batch size: 1
Precision: FP16
Layer fusion: true
Output directory: Artifacts
Custom kernels directory:
Network input normalization: 1
Writing binary data to: Artifacts/AlexNet/AlexNet.bin

Test 3, precision = fp32, batch size = 8.

ModelOptimizer -w ./bvlc_alexnet.caffemodel -p FP32 -d ./deploy_alexnet.prototxt -f 1 -b 8 --target APLK -i
Start working...
Framework plugin: CAFFE
Target type: APLK
Network type: CLASSIFICATION
Batch size: 8
Precision: FP32
Layer fusion: true
Output directory: Artifacts
Custom kernels directory:
Network input normalization: 1
Writing binary data to: Artifacts/AlexNet/AlexNet.bin

./classification_sample -i ice-creams-227x227.bmp -i tiger-eyes-227x227.bmp -i cat.bmp -i tiger-227x227.bmp -m ./Artifacts/AlexNet/AlexNet.xml -l ./synset_words.txt -d GPU
InferenceEngine:
   API version ............ 1.0
   Build .................. 2778
****
   API version ............ 0.1
   Build .................. manual-01121
   Description ....... clDNNPlugin
Average running time of one iteration: 52 ms

Test 4, precision = fp16, batch size = 8.

ModelOptimizer -w ./bvlc_alexnet.caffemodel -p FP16 -d ./deploy_alexnet.prototxt -f 1 -b 8 --target APLK -i
Start working...
Framework plugin: CAFFE
Target type: APLK
Network type: CLASSIFICATION
Batch size: 8
Precision: FP16
Layer fusion: true
Output directory: Artifacts
Custom kernels directory:
Network input normalization: 1
Writing binary data to: Artifacts/AlexNet/AlexNet.bin

The classification top 10 results were consistent and seemingly correct, so I snipped them for clarity. At precision fp32, the average running time looked normal. However, at fp16, the average running time for batch size 8 is 40 times that for batch size 1, much worse than doing no-batch 8 times.

I repeated my tests with different batch sizes, and saw similar performance trend. For fp16, the batch performance looked really bad.

I can think of three possible causes for this behavior:

1. I was doing something wrong.

2. There was something wrong with the classification sample.

3. There was something wrong with the Inference Engine.

Could you look into this issue?

Thanks,

-Robby

Hi Robby,

Jeffrey_M_Intel1 — Thu, 08 Jun 2017 01:13:00 GMT

Hi Robby,

I've replicated the higher than expected execution time for batch size 8 for FP16 for AlexNet classification, and we are investigating.

However, with a few more data points you should be able to see that even with the strange behavior for that single combination the expected general pattern is there:

Higher batch sizes provide better performance.
Lower batch sizes allow you to trade performance for lower latency.
FP16 on GPU is roughly 2x performance vs FP32

For reference, I gathered a snapshot (NOT an official benchmark!) of Alexnet classification rates by running the classification sample on my test machine which has the CVSDK beta installed. It has an i7-6770HQ processor with Iris Pro Graphics 580.

Do you see similar patterns if you test more batch sizes?

FP16    Avg ms  ms/img  imgs/sec
       1      42    42.0    23.8
       2      42    21.0    47.6
       4      80    20.0    50.0
       8     144    18.0    55.6
      16      60     3.8   266.7
      32      65     2.0   492.3
      64     105     1.6   609.5
     128     203     1.6   630.5


FP32    Avg ms  ms/img  imgs/sec
       1      18    18.0    55.6
       2      18     9.0   111.1
       4      25     6.3   160.0
       8      32     4.0   250.0
      16      51     3.2   313.7
      32     103     3.2   310.7
      64     198     3.1   323.2
     128     397     3.1   322.4


CPU FP32 Avg ms  ms/img  imgs/sec
       1      29    29.0    34.5
       2      28    14.0    71.4
       4      37     9.3   108.1
       8      60     7.5   133.3
      16     101     6.3   158.4
      32     167     5.2   191.6
      64     350     5.5   182.9
     128     693     5.4   184.7

Hi Jeffrey, thanks for the

RSun9 — Thu, 08 Jun 2017 17:49:52 GMT

Hi Jeffrey, thanks for the confirmation.

I have only tested a few other batch sizes. In my tests, fp16 seemed to under-perform fp32 in other batch sizes too, but my data points were limited. I'll see if I have time to run more tests.

My test platform has a Core i7-6700 (3.4GHz) with an integrated HD Graphics 530.

-Robby

I managed to run tests at the

RSun9 — Fri, 09 Jun 2017 00:50:00 GMT

I managed to run tests at the same data points. I'll just post my results, and let you draw the conclusion ;-)

Again, my test platform has a Core i7-6700 (3.4GHz) with an integrated HD Graphics 530.

[ Edit: I can't seem to get the table to display properly. Will have to leave it as is. ]

GPU, FP16

batch-size average-ms ms/image images/sec
1 9 9.0 111.1
2 84 42.0 23.8
4 163 40.8 24.5
8 322 40.3 24.8
16 118 7.4 135.6
32 126 3.9 254.0
64 217 3.4 294.9
128 431 3.4 297.0

GPU, FP32

batch-size average-ms ms/image images/sec
1 15 15.0 66.7
2 24 12.0 83.3
4 41 10.3 97.6
8 52 6.5 153.8
16 93 5.8 172.0
32 178 5.6 179.8
64 351 5.5 182.3
128 700 5.5 182.9

CPU, FP32

1 13 13.0 76.9
2 28 14.0 71.4
4 35 8.8 114.3
8 50 6.3 160.0
16 82 5.1 195.1
32 143 4.5 223.8
64 260 4.1 246.2
128 506 4.0 253.0

-Robby

Thanks for the additional

Jeffrey_M_Intel1 — Mon, 12 Jun 2017 19:51:23 GMT

Thanks for the additional data. Fortunately the dev team has already recognized this behavior from internal testing.

So, to go back to your original post, the issue is in IE (your option #3 above) but we should expect improvements in the next release.

Will it work for you to proceed with the current performance limitations until the next release is available?

Hi Jeffrey, thanks for the

RSun9 — Mon, 12 Jun 2017 20:14:42 GMT

Hi Jeffrey, thanks for the update. I am glad your team was able to find the real cause so fast.

I can work with the current version for now, and wait for the next release.

-Robby