CPU and GPU device produce different results on the same input

Julien1 · ‎02-11-2018

The results for a fixed set of input data vary depending on whether you run the inference on the CPU or GPU. I have observed this on a range of different networks but as an example, I created a very simple network with 3 layers (convolution, pooling and deconvolution). I initialising the weights to gaussians for conv and deconv and then converted it to Intel IR (from Caffe).

I added some basic diagnostics to the segmentation sample to print out the min/max of each output channel. Below is the output from running the sample with -d "GPU" and -d "CPU" (all other parameters remain the same): As you can see the channel min/max are quite different - visually the output of the networks is far from similar.

From some other tests with other networks it appears as though the CPU version is closer to the truth than the the GPU version. I also noted that if I remove the deconvolution and maxpool layers, leaving just a convolution layer then the results for the channel min/max are identical. Perhaps there is some issue with the GPU accelerated deconvolution?

See output below - Intel IR network attached.

.\segmentation_sample.exe -i .\img.png -m convolution.xml -d "GPU"
InferenceEngine:
        API version ............ 1.0
        Build .................. 5852

[ INFO ] Parsing input parameters
[ INFO ] No extensions provided
[ INFO ] Loading plugin

        API version ............ 0.1
        Build .................. prod-02709
        Description ....... clDNNPlugin

[ INFO ] Loading network files
[ INFO ] Preparing input blobs
[ INFO ] Batch size is 1
input item.first = input
input channels = 3 width: 100 height: 100
[ INFO ] Preparing output blobs
output item.first = deconv
[ INFO ] Loading model to the plugin
[ INFO ] Start inference (1 iterations)

Average running time of one iteration: 1.17634 ms

[ INFO ] Processing output blobs
Output: W:101 H:101 C:3 N:1
channel: 0 min: -0.152521 max: 0.245878
channel: 1 min: -0.193329 max: 0.589803
channel: 2 min: -0.126119 max: 0.349479
[ INFO ] Output file : out_dat.bmp was created
[ INFO ] Execution successfull

.\segmentation_sample.exe -i .\img.png -m convolution.xml -d "CPU"
InferenceEngine:
        API version ............ 1.0
        Build .................. 5852

[ INFO ] Parsing input parameters
[ INFO ] No extensions provided
[ INFO ] Loading plugin

        API version ............ 1.0
        Build .................. win_2018.0.20170425
        Description ....... MKLDnnPlugin

[ INFO ] Loading network files
[ INFO ] Preparing input blobs
[ INFO ] Batch size is 1
input item.first = input
input channels = 3 width: 100 height: 100
[ INFO ] Preparing output blobs
output item.first = deconv
[ INFO ] Loading model to the plugin
[ INFO ] Start inference (1 iterations)

Average running time of one iteration: 0.384 ms

[ INFO ] Processing output blobs
Output: W:101 H:101 C:3 N:1
channel: 0 min: -0.267345 max: 0.402487
channel: 1 min: -0.121432 max: 0.610033
channel: 2 min: -0.263773 max: 0.426556
[ INFO ] Output file : out_dat.bmp was created
[ INFO ] Execution successfull

I am running the following OpenCL devices/platforms:

PS D:\Google-Drive\clients\splitmedialabs\code\opencl-info\x64\Release> .\opencl-info.exe
1. Device: Ellesmere
 1.1 Hardware version: OpenCL 2.0 AMD-APP (2442.9)
 1.2 Software version: 2442.9
 1.3 OpenCL C version: OpenCL C 2.0
 1.4 Parallel compute units: 36
2. Device: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
 2.1 Hardware version: OpenCL 1.2 AMD-APP (2442.9)
 2.2 Software version: 2442.9 (sse2,avx)
 2.3 OpenCL C version: OpenCL C 1.2
 2.4 Parallel compute units: 4
1. Device: Intel(R) HD Graphics 630
 1.1 Hardware version: OpenCL 2.1
 1.2 Software version: 23.20.16.4901
 1.3 OpenCL C version: OpenCL C 2.0
 1.4 Parallel compute units: 24
2. Device: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
 2.1 Hardware version: OpenCL 2.1 (Build 2)
 2.2 Software version: 7.5.0.2
 2.3 OpenCL C version: OpenCL C 2.0
 2.4 Parallel compute units: 4

 1. Platform
  1.1 Name       : AMD Accelerated Parallel Processing
  1.2 Vendor     : Advanced Micro Devices, Inc.
  1.3 Version    : OpenCL 2.0 AMD-APP (2442.9)
  1.4 Profile    : FULL_PROFILE
  1.5 Extensions : cl_khr_icd cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offline_devices

 2. Platform
  2.1 Name       : Intel(R) OpenCL
  2.2 Vendor     : Intel(R) Corporation
  2.3 Version    : OpenCL 2.1
  2.4 Profile    : FULL_PROFILE
  2.5 Extensions : cl_intel_dx9_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_d3d11_sharing cl_khr_depth_images cl_khr_dx9_media_sharing cl_khr_fp64 cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_spir

Any help appreciated.

Anna_B_Intel · ‎02-13-2018

Hi Julien,

Could you please also share the modified sample?

Best wishes,

Anna

Julien1 · ‎02-13-2018

Hi Anna,

Thanks - please see attached. Note I also commented out the output of the class based color segments and instead re-scale the output data and write that out as an image. But that has no direct bearing on the data min/max output of the network.

Julien.

Julien1 · ‎02-19-2018

Hi Anna,

Were you able to replicate the issue?

Thanks

Julien.

Anna_B_Intel · ‎02-20-2018

Hi Julien,

I was able to reproduce this issue with CV SDK R3. I have good news for you - the bug was already fixed, the fix will be included in the next release.

Best wishes,

Anna

Julien1 · ‎02-21-2018

Hi Anna,

Thanks - that is good news. When is the next release due?

Is there a way to get access to the fix now? i.e. by using master branch of CLDNN github project and recompiling?

--
Julien.