GPU inference solwer than CPU inference.

PV__Sharfudheen · ‎10-16-2018

I expect GPU inference time less than of CPU inference time. But the inference time taken for GPU is more than CPU inference. I am running inference for 30 minutes video of resolution (1280x720) in both CPU and GPU mode. The details are below.

I used object_detection_demo_ssd_async.exe and person-detection-retail-0013.xml(FP32) model.

opeVino Version: computer_vision_sdk_2018.3.343

System details.

Processor : intel(r) xeon(r) cpu e3-1505m v5 @ 2.80ghz

RAM : 32 GB

type : 64 bit - Windows 10

GPU 1: NVIDIA Quadro M2000M

GPU 2: Intel(R) HD Graphics P530

1. CPU

-i Input(1).mp4 -m C:\Intel\computer_vision_sdk_2018.3.343\deployment_tools\intel_models\person-detection-retail-0013\FP32\person-detection-retail-0013.xml -d CPU

InferenceEngine:
        API version ............ 1.2
        Build .................. 13911
       [ INFO ] Parsing input parameters
       [ INFO ] Reading input
      [ INFO ] Loading plugin

        API version ............ 1.2
        Build .................. win_20180511
        Description ....... MKLDNNPlugin
        [ INFO ] Loading network files
        [ INFO ] Batch size is forced to 1.
        [ INFO ] Checking that the inputs are as the sample expects
        [ INFO ] Checking that the outputs are as the sample expects
        [ INFO ] Loading model to the plugin

Inference time (per image) : 12.25 ms

2.GPU

-i Input(1).mp4 -m C:\Intel\computer_vision_sdk_2018.3.343\deployment_tools\intel_models\person-detection-retail-0013\FP32\person-detection-retail-0013.xml -d GPU

InferenceEngine:
        API version ............ 1.2
        Build .................. 13911
        [ INFO ] Parsing input parameters
        [ INFO ] Reading input
        [ INFO ] Loading plugin

        API version ............ 1.2
        Build .................. cldnn/GEN_GPU_clDNN_ci-main_cldnn-main-03988_artifacts.zip
        Description ....... clDNNPlugin
        [ INFO ] Loading network files
        [ INFO ] Batch size is forced to 1.
        [ INFO ] Checking that the inputs are as the sample expects
        [ INFO ] Checking that the outputs are as the sample expects
        [ INFO ] Loading model to the plugin
        [ INFO ] Start inference

Inference time(per image) : 48.95 ms

Please help me to configure so GPU inference runs faster than CPU. Please find the attachment for Display Adapter and OpenCL Info. Do I need to do any graphics configuration to make it run fast in GPU. I would like to expect anyone support to solve this issue.

Thanks

nikos1 · ‎10-19-2018

Hello Sharfudheen,

Your numbers looks good. In some cases fast CPUs can execute inference faster than slower GPUs. I too can experience this issue in many configurations in some of my systems in the lab. In your case the P530 does not have so many EUs, like the P580 for example ( https://www.intel.com/content/dam/www/public/us/en/documents/guides/hd-graphics-p530-p580-performance-guide.pdf ; , https://ark.intel.com/products/89608/Intel-Xeon-Processor-E3-1505M-v5-8M-Cache-2-80-GHz- )

I would suggest to run on GPU with the FP16 IR and FP32 for CPU. In your case you run FP32 on GPU:

-i Input(1).mp4 -m C:\Intel\computer_vision_sdk_2018.3.343\deployment_tools\intel_models\person-detection-retail-0013\FP32\person-detection-retail-0013.xml -d GPU

try something like

-i Input(1).mp4 -m C:\Intel\computer_vision_sdk_2018.3.343\deployment_tools\intel_models\person-detection-retail-0013\FP16\person-detection-retail-0013.xml -d GPU

is it faster now? Will still be slower than your quad core CPU but there are other benefits , like power efficiency. Also the CPU will not be as loud when on GPU path.

Cheers,

Nikos

PV__Sharfudheen · ‎10-20-2018

Hi Nikos,

Thanks for the reply. I have tried with FP16 model as well and result is as below.

FP16 - 32.56 ms (GPU)

FP32 - 48.95 (GPU)

FP32 - 12.25 ms (CPU)

I am able handle 4 camera inference in realtime in CPU and would like handle more number of camera inference using GPU capability.

My system has in built Intel (R) HD Graphics P530 and NVIDIA Quadro M2000M and both are configured by OpenCL 1.2 Quadro M2000M.

How do I ensure inference is completely utilizing the GPU capability?. Please have a look into above attached images for system GPU details.

I have tried running the exe (GPU mode) after disabling the NVIDIA Quadro M2000M GPU from Device manager but didnt see any improvements in inference time.

It will be great help if someone can guide me to resolve this issue.

Thanks in advance

Sharaf

nikos1 · ‎10-20-2018

Hi Sharaf,

> FP16 - 32.56 ms (GPU)

> FP32 - 48.95 (GPU)

FP16 is faster than FP32 so you can use FP16 on GPU if have no accuracy issues.

To analyze your four-camera pipeline you really need to use profilers (like VTune or other CPU or OpenCL / GPU profilers).

Also try to connect one monitor directly to the system HDMI (not to the NVIDIA GPU) so Intel GPU clocks get higher (?). Did not help me here though. Not sure where the bottleneck may be for 4 cameras - we need to see end-to-end system profiling traces.

Another idea may be use the HETERO plug-in so that you run on both CPU and GPU devices. Have tried already?

To be honest I am not sure what the real issue is here. It is a fact that for this particular network GPU runs slower on my HD630 system too. My Core i7 CPU take 6ms and my HD630 GPU 29ms.

mbox_conf1/out/conv/flat/r... NOT_RUN        layerType: Reshape            realTime: 0          cpu: 0              execType: unknown
mbox_conf1/out/conv/flat/s... EXECUTED       layerType: SoftMax            realTime: 108        cpu: 108            execType: ref_any
mbox_conf1/out/conv/flat/s... NOT_RUN        layerType: Flatten            realTime: 0          cpu: 0              execType: unknown
mbox_conf1/out/conv/perm      EXECUTED       layerType: Permute            realTime: 11         cpu: 11             execType: unknown
mbox_loc1/out/conv            EXECUTED       layerType: Convolution        realTime: 313        cpu: 313            execType: jit_avx2
mbox_loc1/out/conv/flat       NOT_RUN        layerType: Flatten            realTime: 0          cpu: 0              execType: unknown
mbox_loc1/out/conv/perm       EXECUTED       layerType: Permute            realTime: 24         cpu: 24             execType: unknown
out_detection_out             NOT_RUN        layerType: Output             realTime: 0          cpu: 0              execType: unknown
Total time: 6365     microseconds
[ INFO ] Execution successful

and the HD 630 is slower

mbox1/priorbox_cldnn_custo... NOT_RUN        layerType: Reorder            realTime: 0          cpu: 0              execType: undef
mbox_conf1/out/conv           EXECUTED       layerType: Convolution        realTime: 257        cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
mbox_conf1/out/conv/flat      OPTIMIZED_OUT  layerType: Flatten            realTime: 0          cpu: 0              execType: undef
mbox_conf1/out/conv/flat/r... OPTIMIZED_OUT  layerType: Reshape            realTime: 0          cpu: 0              execType: undef
mbox_conf1/out/conv/flat/s... EXECUTED       layerType: SoftMax            realTime: 6          cpu: 2              execType: softmax_gpu_ref
mbox_conf1/out/conv/flat/s... OPTIMIZED_OUT  layerType: Flatten            realTime: 0          cpu: 0              execType: undef
mbox_conf1/out/conv/perm      EXECUTED       layerType: Permute            realTime: 16         cpu: 2              execType: permute_ref
mbox_loc1/out/conv            EXECUTED       layerType: Convolution        realTime: 265        cpu: 2              execType: convolution_gpu_bfyx_os_iyx_osv16
mbox_loc1/out/conv/flat       OPTIMIZED_OUT  layerType: Flatten            realTime: 0          cpu: 0              execType: undef
mbox_loc1/out/conv/perm       EXECUTED       layerType: Permute            realTime: 23         cpu: 2              execType: permute_ref
Total time: 291134   microseconds
[ INFO ] Execution successful

JFTR if you use -pc you will see an analysis of what layers run slow and maybe you can think of other optimization options.

Cheers,

Nikos

nikos1 · ‎10-20-2018

HETERO did not help too much in my case.

The only other option is batching, that may be the only option to support four cameras in real-time.

PV__Sharfudheen · ‎10-22-2018

Hi Nikos,

Thanks for reply. I don't have problem with accuracy. My concern is performance (inference time).

I have tried in a system with different configuration (Core i7 CPU + Intel(R) HD Graphics 530 + NVIDIA GrForce GTX 950M )and still facing the same issue( GPU inference slower than CPU inference).

I tried batch processing as well couldn't see much improvements in performance ( Batch size - 5). I think some problem with GPU utilization for inference but it's not showing any message or error.

Please have a look into attached NVIDIA Quadro M2000M.png and Intel HD graphics.png file. There is no current clock and max clock value for Intel HD graphics.

Is there any software can used see GPU utilization? I am using windows 10 - 1703 version( we cant see GPU utilization in task manager in this version)

Regards,

Sharaf

nikos1 · ‎10-22-2018

VTune ( https://software.intel.com/en-us/vtune ) is the best profiler that can also show you GPU utilization and OpenCL timeline.

Also try https://github.com/openhardwaremonitor but you may have to compile from source and change device IDs . It should be able to show GPU load.

You could also move your windows to 1803 and see load without any tools.

Cheers,

Nikos

PV__Sharfudheen · ‎10-23-2018

Hi Nikos,

Thanks.

Can someone suggest any GPU / configuration where object_detection_demo_ssd_async.exe runs faster in GPU than CPU (windows )?

I would like to hear from you all.

Regards,

Sharaf