- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
based on a sample program posted here by @Rajiv Deodhar : (thanks for the sample program!) I ran into some strange results with my Xeon Phi's here:
HostA: xeon E5 2695 v2 w/ Xeon Phi 7100, Scientific Linux 6.5, MPSS 3.3.2 stack, Intel compiler icc (ICC) 14.0.1 20131008
Physical hardware is a Dell PowerEdge R720
HostB: xeon E5 2620 w/ Xeon Phi 5100, CentOS 6.5, MPSS 3.3.2 stack, Intel compiler icc (ICC) 15.0.0 20140723
Physical hardware is a Supermicro X9DRG-HF
I've compiled the sample code small_0.c using both icc 14 and icc 15 using
icc -openmp -o small_0 small_0.c
when I run the binaries, small_0.icc14 and small_0.icc15 on hostA I get a nice&consistent 6 GB/s send and receive speed
when I run the binaires on hostB I get mixed results:
- "Send" performance is almost always 6+ GB for buffers that are large enough
- most of the times the icc14 binaries give good "Receive" performance as well but sometimes it collapses to the same level as the icc15 binaries
- the icc15 binaries provide good "Send" performance, "Receive" performance is usually awful:
Bandwidth test for pointers. DeviceID: 0. Data alignment: 2097152. Number of iterations: 10. Size(Bytes) Send(GiB/sec) Receive(GiB/sec) 1024 0.13 0.15 2048 0.28 0.31 4096 0.54 0.62 8192 0.79 0.88 16384 1.24 1.53 32768 2.13 2.50 65536 3.11 3.63 131072 4.18 4.61 262144 5.02 5.34 524288 5.63 5.88 1048576 5.97 6.19 2097152 6.23 6.36 4194304 6.29 6.40 8388608 5.99 1.26 16777216 6.16 0.68 33554432 6.13 0.54 67108864 6.29 0.55 134217728 6.37 0.51 268435456 6.41 0.47 536870912 6.43 0.38 1073741824 6.44 0.33
Sometimes it helps to set the env var MIC_USE_2MB_BUFFERS=2M but right now that also does not seem to help.
Running it as root or as a regular user also has no effect. HostB is not loaded , I'm the only user on both the host and the Phi's. Switching to the second phi does not make any difference.
'Netperf' performance on hostB seems OK, 4+ Gbit/s one way, 1.8 Gbit/s the other (compared to 4+/1.9 for hostA).
How can I further debug this? If need be access to hostB can be provided , contact me offline for that.
Link kopiert
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Further testing: I installed the Intel OpenCL runtime for the MIC and ran a slightly modified version of the oclBandWidthTest from the NVidia OpenCL sample showcase. This sample tests the bandwidth from host-to-device, device-to-host and inter-device. For the 'faulty' host (hostB) the results are:
$ ./oclBandwidthTest --memory=pinned [oclBandwidthTest] starting... ./oclBandwidthTest Starting... WARNING: NVIDIA OpenCL platform not found - defaulting to first platform! Running on... Intel(R) Many Integrated Core Acceleration Card Quick Mode Host to Device Bandwidth, 1 Device(s), Pinned memory, direct access Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6356.9 Device to Host Bandwidth, 1 Device(s), Pinned memory, direct access Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1941.2 Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) 33554432 2974.5 [oclBandwidthTest] test results... PASSED
On HostA I get 6+ GB/s in both directions with the same test; this also shows that offloading performance and OpenCL memory transfer performance is on par :)
On a host with NVidia cards I get 6+ GB/s in both directions as well.
I'm starting to suspect something odd is going on with the PCI Express bus on my hostB, but I have no clue how to further test or debug this. Any help would be greatly appreciated.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I was able to reproduce the issue you observed: if I compiled the program small_0.c with the Intel compiler 2013, the bandwidth looked good (~6 GB/sec), but if I recompiled the program with the Intel compiler 2015, there was a bandwidth difference between Send and Receive for large messages. This issue has been reported for further investigation.
For your information, if you use the Intel OpenCL runtime, the latest and validated MPSS version is 3.3 , not 3.3.2 (see OpenCL Runtime Release Notes here: https://software.intel.com/en-us/articles/opencl-runtime-release-notes).
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I tried your code on our 5110p and here are the results
For icc 14:
Bandwidth test for pointers. DeviceID: 0. Data alignment: 2097152. Number of iterations: 10. Size(Bytes) Send(GiB/sec) Receive(GiB/sec) 1024 0.16 0.17 2048 0.33 0.37 4096 0.67 0.72 8192 0.90 0.94 16384 1.44 1.64 32768 2.28 2.63 65536 3.34 3.72 131072 4.41 4.75 262144 5.22 5.46 524288 5.74 5.93 1048576 6.05 6.19 2097152 5.94 6.04 4194304 6.18 6.28 8388608 6.27 6.31 16777216 6.23 6.44 33554432 6.29 6.41 67108864 6.34 6.45 134217728 6.37 6.48 268435456 6.38 6.49 536870912 6.34 6.46 1073741824 6.34 6.45
For icc 15
Bandwidth test for pointers. DeviceID: 0. Data alignment: 2097152. Number of iterations: 10. Size(Bytes) Send(GiB/sec) Receive(GiB/sec) 1024 0.15 0.18 2048 0.31 0.37 4096 0.62 0.69 8192 0.86 0.98 16384 1.40 1.68 32768 2.25 2.63 65536 3.32 3.64 131072 4.33 4.77 262144 4.99 5.46 524288 5.71 5.87 1048576 5.98 6.21 2097152 6.19 6.34 4194304 6.29 6.43 8388608 6.22 6.11 16777216 6.18 6.18 33554432 6.15 6.27 67108864 6.27 6.38 134217728 6.33 6.43 268435456 6.36 6.46 536870912 6.37 6.47 1073741824 6.34 6.45
I cannot reproduce the results you provided.
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hi all,
as a late follow-up: a BIOS upgrade on the Supermicro server did the trick. Performance is now 6 GiB/s Send & Receive.
thread closed,
JJK

- RSS-Feed abonnieren
- Thema als neu kennzeichnen
- Thema als gelesen kennzeichnen
- Diesen Thema für aktuellen Benutzer floaten
- Lesezeichen
- Abonnieren
- Drucker-Anzeigeseite