Solved: OpenVINO + IntelGPU performance for larger images

Hamaji__Shinichiro · ‎07-03-2019

While we evaluate OpenVINO with Intel GPU, we found efficiency becomes worse for high resolution input images than for low resolution ones. Some GFMA/sec numbers we have obtained so far:

ResNet50 (GFMA/sec)    224   384   448   672   896
IrisPlus650_fp16     266.1 265.4 242.4 215.3 155.8
IrisPlus650_fp32     183.6 145.0 151.7 121.5 129.0
UHD630_fp16          193.4 205.5 190.5 170.6 124.2
UHD630_fp32          130.1 124.6 119.7 106.5 102.8
UHD600_fp16           47.0  45.3  33.9  27.6  23.1
UHD600_fp32           30.7  35.2  16.1  19.4  18.2

We ran inference 100 times for each image size (thus GPU was throttled due to heat, IIUC), took the median of elapsed time, and calculated GFMA/sec. Estimated numbers of FMA are ~4GFMA, ~12GFMA, ~16GFMA, ~36GFMA, and ~64GFMA, for resolution 224, 384, 448, 672, and 896, respectively.

I think it's a bit unorthodox since accelerators (e.g., nVidia GPU) usually perform better for large inputs, so I'm guessing this is not the best performance of the GPU and the frameworks is failing to fully utilize the GPU. I observed similar tendencies for other CV models, too. As nGraph performs similarly, I'm guessing the culprit is libclDNN, but I'm not sure at all.

Here's a shell script to reproduce my experiment:

# Please change this path.

DLDT=/path/to/dldt

curl -O https://raw.githubusercontent.com/pfnet-research/chainer-compiler/eccf127c7bfdd4f16f4e34f4b625d6f5810e7fe0/utils/run_onnx_dldt.py
curl -O https://raw.githubusercontent.com/pfnet-research/chainer-compiler/eccf127c7bfdd4f16f4e34f4b625d6f5810e7fe0/utils/run_onnx_util.py
curl -O http://shinh.skr.jp/t/resnet50.tgz
tar -xvzf resnet50.tgz
for i in resnet50/resnet50_*; do
  PYTHONPATH=$DLDT/model-optimizer:$DLDT/inference-engine/bin/intel64/Release/lib/python_api/python3.6 LD_LIBRARY_PATH=$DLDT/inference-engine/bin/intel64/Release/lib python3 run_onnx_dldt.py $i -I 6 --device=GPU --data_type=FP16
done

Any comments/suggestions will be really appreciate. Thanks!

Shubha_R_Intel · ‎08-02-2019

Dear Dear Hamaji, Shinichiro,

Upon further researching this issue we have found that performance improvement is not expected with bigger images in this case. The reason for this is, with your topology -resnet50 gpu plugin is choosing fully optimized kernels. Such kernels are implemented for this specific case with image_size=224x224 and they are already utilizing gpus very well. With such optimal kernels inferencing bigger images leads to worse performance, so your observations are expected behavior, not a bug.

Hope it helps.

Thanks,

Shubha

View solution in original post

Shubha_R_Intel · ‎08-01-2019

Dear Hamaji, Shinichiro,

I have escalated this issue to an OpenVino developer. In the meantime, can you kindly download OpenVino 2019R2 and redo your measurements ? Many performance issues have been fixed in this release.

Thanks. Please report your findings here.

Sincerely,

Shubha

Shubha_R_Intel · ‎08-02-2019

Dear Dear Hamaji, Shinichiro,

Upon further researching this issue we have found that performance improvement is not expected with bigger images in this case. The reason for this is, with your topology -resnet50 gpu plugin is choosing fully optimized kernels. Such kernels are implemented for this specific case with image_size=224x224 and they are already utilizing gpus very well. With such optimal kernels inferencing bigger images leads to worse performance, so your observations are expected behavior, not a bug.

Hope it helps.

Thanks,

Shubha

Hamaji__Shinichiro · ‎08-02-2019

Got it. Thanks for your answer!

Shubha_R_Intel · ‎08-05-2019

Dear Hamaji, Shinichiro,

Of course. Thanks for using OpenVino !

Shubha