- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'd like to ask a question about the classification sample of the Inference Engine. For fp16 precision and batch processing, the average running time looks pretty bad. Why is that?
So far I have only tested with the Alexnet. I first ran the model optimizer (MO) to set the precision and batch size, and then I ran the classification sample with the xml and weights generated by MO. See below for details.
Test 1, precision = fp32, batch size = 1.
ModelOptimizer -w ./bvlc_alexnet.caffemodel -p FP32 -d ./deploy_alexnet.prototxt -f 1 -b 1 --target APLK -i
Start working...
Framework plugin: CAFFE
Target type: APLK
Network type: CLASSIFICATION
Batch size: 1
Precision: FP32
Layer fusion: true
Output directory: Artifacts
Custom kernels directory:
Network input normalization: 1
Writing binary data to: Artifacts/AlexNet/AlexNet.bin
./classification_sample -i ice-creams-227x227.bmp -m ./Artifacts/AlexNet/AlexNet.xml -l ./synset_words.txt -d GPU
InferenceEngine:
API version ............ 1.0
Build .................. 2778
****
API version ............ 0.1
Build .................. manual-01121
Description ....... clDNNPlugin
Average running time of one iteration: 14 ms
Test 2, precision = fp16, batch size = 1.
ModelOptimizer -w ./bvlc_alexnet.caffemodel -p FP16 -d ./deploy_alexnet.prototxt -f 1 -b 1 --target APLK -i
Start working...
Framework plugin: CAFFE
Target type: APLK
Network type: CLASSIFICATION
Batch size: 1
Precision: FP16
Layer fusion: true
Output directory: Artifacts
Custom kernels directory:
Network input normalization: 1
Writing binary data to: Artifacts/AlexNet/AlexNet.bin
./classification_sample -i ice-creams-227x227.bmp -m ./Artifacts/AlexNet/AlexNet.xml -l ./synset_words.txt -d GPU
InferenceEngine:
API version ............ 1.0
Build .................. 2778
****
API version ............ 0.1
Build .................. manual-01121
Description ....... clDNNPlugin
Average running time of one iteration: 9 ms
Test 3, precision = fp32, batch size = 8.
ModelOptimizer -w ./bvlc_alexnet.caffemodel -p FP32 -d ./deploy_alexnet.prototxt -f 1 -b 8 --target APLK -i
Start working...
Framework plugin: CAFFE
Target type: APLK
Network type: CLASSIFICATION
Batch size: 8
Precision: FP32
Layer fusion: true
Output directory: Artifacts
Custom kernels directory:
Network input normalization: 1
Writing binary data to: Artifacts/AlexNet/AlexNet.bin
./classification_sample -i ice-creams-227x227.bmp -i tiger-eyes-227x227.bmp -i cat.bmp -i tiger-227x227.bmp -m ./Artifacts/AlexNet/AlexNet.xml -l ./synset_words.txt -d GPU
InferenceEngine:
API version ............ 1.0
Build .................. 2778
****
API version ............ 0.1
Build .................. manual-01121
Description ....... clDNNPlugin
Average running time of one iteration: 52 ms
Test 4, precision = fp16, batch size = 8.
ModelOptimizer -w ./bvlc_alexnet.caffemodel -p FP16 -d ./deploy_alexnet.prototxt -f 1 -b 8 --target APLK -i
Start working...
Framework plugin: CAFFE
Target type: APLK
Network type: CLASSIFICATION
Batch size: 8
Precision: FP16
Layer fusion: true
Output directory: Artifacts
Custom kernels directory:
Network input normalization: 1
Writing binary data to: Artifacts/AlexNet/AlexNet.bin
./classification_sample -i ice-creams-227x227.bmp -i tiger-eyes-227x227.bmp -i cat.bmp -i tiger-227x227.bmp -m ./Artifacts/AlexNet/AlexNet.xml -l ./synset_words.txt -d GPU
InferenceEngine:
API version ............ 1.0
Build .................. 2778
****
API version ............ 0.1
Build .................. manual-01121
Description ....... clDNNPlugin
Average running time of one iteration: 321 ms
The classification top 10 results were consistent and seemingly correct, so I snipped them for clarity. At precision fp32, the average running time looked normal. However, at fp16, the average running time for batch size 8 is 40 times that for batch size 1, much worse than doing no-batch 8 times.
I repeated my tests with different batch sizes, and saw similar performance trend. For fp16, the batch performance looked really bad.
I can think of three possible causes for this behavior:
1. I was doing something wrong.
2. There was something wrong with the classification sample.
3. There was something wrong with the Inference Engine.
Could you look into this issue?
Thanks,
-Robby
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the additional data. Fortunately the dev team has already recognized this behavior from internal testing.
So, to go back to your original post, the issue is in IE (your option #3 above) but we should expect improvements in the next release.
Will it work for you to proceed with the current performance limitations until the next release is available?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robby,
I've replicated the higher than expected execution time for batch size 8 for FP16 for AlexNet classification, and we are investigating.
However, with a few more data points you should be able to see that even with the strange behavior for that single combination the expected general pattern is there:
- Higher batch sizes provide better performance.
- Lower batch sizes allow you to trade performance for lower latency.
- FP16 on GPU is roughly 2x performance vs FP32
For reference, I gathered a snapshot (NOT an official benchmark!) of Alexnet classification rates by running the classification sample on my test machine which has the CVSDK beta installed. It has an i7-6770HQ processor with Iris Pro Graphics 580.
Do you see similar patterns if you test more batch sizes?
FP16 Avg ms ms/img imgs/sec 1 42 42.0 23.8 2 42 21.0 47.6 4 80 20.0 50.0 8 144 18.0 55.6 16 60 3.8 266.7 32 65 2.0 492.3 64 105 1.6 609.5 128 203 1.6 630.5 FP32 Avg ms ms/img imgs/sec 1 18 18.0 55.6 2 18 9.0 111.1 4 25 6.3 160.0 8 32 4.0 250.0 16 51 3.2 313.7 32 103 3.2 310.7 64 198 3.1 323.2 128 397 3.1 322.4 CPU FP32 Avg ms ms/img imgs/sec 1 29 29.0 34.5 2 28 14.0 71.4 4 37 9.3 108.1 8 60 7.5 133.3 16 101 6.3 158.4 32 167 5.2 191.6 64 350 5.5 182.9 128 693 5.4 184.7
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jeffrey, thanks for the confirmation.
I have only tested a few other batch sizes. In my tests, fp16 seemed to under-perform fp32 in other batch sizes too, but my data points were limited. I'll see if I have time to run more tests.
My test platform has a Core i7-6700 (3.4GHz) with an integrated HD Graphics 530.
-Robby
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I managed to run tests at the same data points. I'll just post my results, and let you draw the conclusion ;-)
Again, my test platform has a Core i7-6700 (3.4GHz) with an integrated HD Graphics 530.
[ Edit: I can't seem to get the table to display properly. Will have to leave it as is. ]
GPU, FP16
batch-size average-ms ms/image images/sec
1 9 9.0 111.1
2 84 42.0 23.8
4 163 40.8 24.5
8 322 40.3 24.8
16 118 7.4 135.6
32 126 3.9 254.0
64 217 3.4 294.9
128 431 3.4 297.0
GPU, FP32
batch-size average-ms ms/image images/sec
1 15 15.0 66.7
2 24 12.0 83.3
4 41 10.3 97.6
8 52 6.5 153.8
16 93 5.8 172.0
32 178 5.6 179.8
64 351 5.5 182.3
128 700 5.5 182.9
CPU, FP32
1 13 13.0 76.9
2 28 14.0 71.4
4 35 8.8 114.3
8 50 6.3 160.0
16 82 5.1 195.1
32 143 4.5 223.8
64 260 4.1 246.2
128 506 4.0 253.0
-Robby
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the additional data. Fortunately the dev team has already recognized this behavior from internal testing.
So, to go back to your original post, the issue is in IE (your option #3 above) but we should expect improvements in the next release.
Will it work for you to proceed with the current performance limitations until the next release is available?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jeffrey, thanks for the update. I am glad your team was able to find the real cause so fast.
I can work with the current version for now, and wait for the next release.
-Robby
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page