Hi all,

JJK · ‎11-03-2014

based on a sample program posted here by @Rajiv Deodhar : (thanks for the sample program!) I ran into some strange results with my Xeon Phi's here:

HostA: xeon E5 2695 v2 w/ Xeon Phi 7100, Scientific Linux 6.5, MPSS 3.3.2 stack, Intel compiler icc (ICC) 14.0.1 20131008

Physical hardware is a Dell PowerEdge R720

HostB: xeon E5 2620 w/ Xeon Phi 5100, CentOS 6.5, MPSS 3.3.2 stack, Intel compiler icc (ICC) 15.0.0 20140723

Physical hardware is a Supermicro X9DRG-HF

I've compiled the sample code small_0.c using both icc 14 and icc 15 using

icc -openmp -o small_0 small_0.c

when I run the binaries, small_0.icc14 and small_0.icc15 on hostA I get a nice&consistent 6 GB/s send and receive speed

when I run the binaires on hostB I get mixed results:

- "Send" performance is almost always 6+ GB for buffers that are large enough

- most of the times the icc14 binaries give good "Receive" performance as well but sometimes it collapses to the same level as the icc15 binaries

- the icc15 binaries provide good "Send" performance, "Receive" performance is usually awful:

Bandwidth test for pointers. DeviceID: 0. Data alignment: 2097152. Number of iterations: 10.

         Size(Bytes) Send(GiB/sec) Receive(GiB/sec)
                1024     0.13             0.15
                2048     0.28             0.31
                4096     0.54             0.62
                8192     0.79             0.88
               16384     1.24             1.53
               32768     2.13             2.50
               65536     3.11             3.63
              131072     4.18             4.61
              262144     5.02             5.34
              524288     5.63             5.88
             1048576     5.97             6.19
             2097152     6.23             6.36
             4194304     6.29             6.40
             8388608     5.99             1.26
            16777216     6.16             0.68
            33554432     6.13             0.54
            67108864     6.29             0.55
           134217728     6.37             0.51
           268435456     6.41             0.47
           536870912     6.43             0.38
          1073741824     6.44             0.33

Sometimes it helps to set the env var MIC_USE_2MB_BUFFERS=2M but right now that also does not seem to help.

Running it as root or as a regular user also has no effect. HostB is not loaded , I'm the only user on both the host and the Phi's. Switching to the second phi does not make any difference.

'Netperf' performance on hostB seems OK, 4+ Gbit/s one way, 1.8 Gbit/s the other (compared to 4+/1.9 for hostA).

How can I further debug this? If need be access to hostB can be provided , contact me offline for that.

JJK · ‎11-04-2014

Further testing: I installed the Intel OpenCL runtime for the MIC and ran a slightly modified version of the oclBandWidthTest from the NVidia OpenCL sample showcase. This sample tests the bandwidth from host-to-device, device-to-host and inter-device. For the 'faulty' host (hostB) the results are:

$ ./oclBandwidthTest --memory=pinned
[oclBandwidthTest] starting...
./oclBandwidthTest Starting...
WARNING: NVIDIA OpenCL platform not found - defaulting to first platform!
Running on...
Intel(R) Many Integrated Core Acceleration Card

Quick Mode

Host to Device Bandwidth, 1 Device(s), Pinned memory, direct access
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6356.9

Device to Host Bandwidth, 1 Device(s), Pinned memory, direct access
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			1941.2

Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			2974.5

[oclBandwidthTest] test results...
PASSED

On HostA I get 6+ GB/s in both directions with the same test; this also shows that offloading performance and OpenCL memory transfer performance is on par :)

On a host with NVidia cards I get 6+ GB/s in both directions as well.

I'm starting to suspect something odd is going on with the PCI Express bus on my hostB, but I have no clue how to further test or debug this. Any help would be greatly appreciated.

Loc_N_Intel · ‎11-17-2014

I was able to reproduce the issue you observed: if I compiled the program small_0.c with the Intel compiler 2013, the bandwidth looked good (~6 GB/sec), but if I recompiled the program with the Intel compiler 2015, there was a bandwidth difference between Send and Receive for large messages. This issue has been reported for further investigation.

For your information, if you use the Intel OpenCL runtime, the latest and validated MPSS version is 3.3 , not 3.3.2 (see OpenCL Runtime Release Notes here: https://software.intel.com/en-us/articles/opencl-runtime-release-notes).

Michael_H_6 · ‎12-10-2014

I tried your code on our 5110p and here are the results

For icc 14:

Bandwidth test for pointers. DeviceID: 0. Data alignment: 2097152. Number of iterations: 10.

         Size(Bytes) Send(GiB/sec) Receive(GiB/sec)
                1024     0.16             0.17
                2048     0.33             0.37
                4096     0.67             0.72
                8192     0.90             0.94
               16384     1.44             1.64
               32768     2.28             2.63
               65536     3.34             3.72
              131072     4.41             4.75
              262144     5.22             5.46
              524288     5.74             5.93
             1048576     6.05             6.19
             2097152     5.94             6.04
             4194304     6.18             6.28
             8388608     6.27             6.31
            16777216     6.23             6.44
            33554432     6.29             6.41
            67108864     6.34             6.45
           134217728     6.37             6.48
           268435456     6.38             6.49
           536870912     6.34             6.46
          1073741824     6.34             6.45

For icc 15

Bandwidth test for pointers. DeviceID: 0. Data alignment: 2097152. Number of iterations: 10.

         Size(Bytes) Send(GiB/sec) Receive(GiB/sec)
                1024     0.15             0.18
                2048     0.31             0.37
                4096     0.62             0.69
                8192     0.86             0.98
               16384     1.40             1.68
               32768     2.25             2.63
               65536     3.32             3.64
              131072     4.33             4.77
              262144     4.99             5.46
              524288     5.71             5.87
             1048576     5.98             6.21
             2097152     6.19             6.34
             4194304     6.29             6.43
             8388608     6.22             6.11
            16777216     6.18             6.18
            33554432     6.15             6.27
            67108864     6.27             6.38
           134217728     6.33             6.43
           268435456     6.36             6.46
           536870912     6.37             6.47
          1073741824     6.34             6.45

I cannot reproduce the results you provided.

JJK · ‎02-11-2015

Hi all,

as a late follow-up: a BIOS upgrade on the Supermicro server did the trick. Performance is now 6 GiB/s Send & Receive.

thread closed,

JJK

Offload performance differences