Hey Frances,

Grady_Schofield · ‎02-25-2013

I wanted to look at using infinband verbs on the MIC card, where we're talking about transfers between two different compute nodes (two different MICs, system memories or the system memory on one node and a MIC on the other). You are supposed to be able to do something like this with GPUDirect for Nvidia cards, but the documentation for GPUDirect that I could find is too sketchy right now to be useable. I wrote a program to do some transfers using RDMA and found that transfers from host memory to host memory are much faster than transfers from MIC memory to MIC memory.

My code was excised from a larger program and hard to understand and build, but I realized that the standard program in the OFED software, ibv_rc_pingpong, can be used to demonstrate the issue. This was already installed on compute nodes and on the MIC processors on Stampede. The data in the attached plot comes from ibv_rc_pingpong. If you try it, you need to specify the device with '-d mlx4_0' because there are two devices, and the other one, 'scif0', produces nonsense when trying to connect two separate compute nodes. I also increased the maximum transfer unit to 2048 with '-m 2048', improving rates somewhat. Depending on where in the network your two compute nodes are placed, there can be some variability in the rates, but all data in that plot was taken with the same two nodes, and I found the rates to be typical after trying several pairs of nodes.

From looking at the source code for ibv_rc_pingpong, it doesn't use RDMA: the opcodes being used in the work requests are IBV_WR_SEND and IBV_WR_RECV, not IBV_WC_RDMA_WRITE or READ. Regardless, the rates look the same as my RDMA code, where the best host-host transfer rate is around 5.77 GB/s and the best MIC-MIC transfer rate is around 0.92 GB/s.

Details of how ibv_rc_pingpong works aside, the important thing is that the program is doing the same thing whether run on the host or the MIC, and when the MIC is involved, the transfer rates are much lower. I also have host to MIC rates in the plot, which are a little better than MIC to MIC. Surely this is just a driver issue and there is no limit on the underlying hardware that makes the rates involving the MIC so low. In fact, from the SCIF transfer rates I posted elsewhere in this forum, it would be about 2x faster to do a three step process: sending data from the MIC to the local host, then out to the remote host, then up to the remote MIC.

I am working on a large sparse eigensolver for a material science problem where an eigenvector is needed, roughly speaking, for each electron in the system. Here, a very large collection of vectors is produced, and dense matrix operations are performed on those vectors. However, sparse matrix operations produce the vectors in the first place. Because system memory is so much larger than accelerator memory, I had been thinking that it might make sense to do the dense operations on the host ( where the vectors must eventually end up anyway because of the larger host memory size ) and do the sparse operations on the MIC. This would be counterintuitive since accelerators are typically thought of as poor for sparse matrix computations. The point of this test and the SCIF tests in another post is for trying to decide if that will actually work. At 0.92 GB/s, forget about it. Does anyone at Intel think this is just a driver issue and can be improved in the future through software updates?

Frances_R_Intel · ‎02-27-2013

I am going to pass this issue to someone more knowledgeable to look at, but for the sake of completeness when I do, could you tell me what Linux distribution you are using on the host and what version of the MPSS?

Grady_Schofield · ‎02-27-2013

Hey Frances,

This is on Stampede. The operating system is CentOS 6.3. The only thing I could find on the MPSS version is in the file /etc/issue on the MIC card. It said this:

Intel MIC Platform Software Stack release 2.1
Kernel 2.6.34.11-g65c0cd9 on an k1om

Is there another way to check the MPSS version that will get the third number in the version string?

Loc_N_Intel · ‎02-27-2013

Hi Grady,

You can type micinfo command:

% /opt/intel/mic/bin/micinfo to retrieve MPSS information.

Grady_Schofield · ‎02-27-2013

Thanks loc. This is the output.

MicInfo Utility Log

Created Wed Feb 27 15:07:12 2013

    System Info
        Host OS                 : Linux
        OS Version              : 2.6.32-279.el6.x86_64
        Driver Version          : 4346-16
        MPSS Version            : 2.1.4346-16
        Host Physical Memory    : 32836 MB
        CPU Family              : GenuineIntel Family 6 Model 45 Stepping 7
        CPU Speed               : 2701.000
        Threads per Core        : 1

Device No: 0, Device Name: Intel(R) Xeon Phi(TM) coprocessor

    Version
        Flash Version           : 2.1.01.0375
        UOS Version             : 2.6.34.11-g65c0cd9
        Device Serial Number    : ADKC23000348

    Board
        Vendor ID                  : 8086
        Device ID                  : 225c
        SubSystem ID               : 2500
        MIC Processor Stepping ID : 1
        PCIe Width                 : Insufficient Privileges
        PCIe Speed                 : Insufficient Privileges
        PCIe Max payload size      : Insufficient Privileges
        PCIe Max read req size     : Insufficient Privileges
        MIC Processor Model        : 0x01
        MIC Processor Model Ext    : 0x00
        MIC Processor Type         : 0x00
        MIC Processor Family       : 0x0b
        MIC Processor Family Ext   : 0x00
        MIC Silicon Stepping       : B0
        Board SKU                  : ES2-P1750
        ECC Mode                   : Enabled
        SMC HW Revision            : Product 300W Passive CS

    Core
        Total No of Active Cores: 61
        Voltage                 : 1074000 uV
        Frequency               : 1090909 kHz

    Thermal
        Fan Speed Control       : N/A
        SMC Firmware Version    : 1.6.3983
        FSC Strap               : 14 MHz
        Fan RPM                 : N/A
        Fan PWM                 : N/A
        Die Temp                : 41 C

    GDDR
        GDDR Vendor             : Elpida
        GDDR Version            : 0x1
        GDDR Density            : 2048 Mb
        GDDR Size               : 7936 MB
        GDDR Technology         : GDDR5
        GDDR Speed              : 5.500000 GT/s
        GDDR Frequency          : 2750000 kHz
        GDDR Voltage            : 1000000 uV

Frances_R_Intel · ‎02-27-2013

I've passed this on - I will let you know when I hear something.

Frances_R_Intel · ‎02-28-2013

Grady,

A few more questions -

The version of the MPSS you are using on Stampede is not the latest. Do you have admin privileges on Stampede or have you been working with an admin there who can try installing the newest version (2.6.38) to see if the problem is reproducible with that release? Alternately, is there another system there which has the latest release installed on it?

What version of OFED are you using? Is this the version from the www.openfabrics.org site and is this where your copy of the ping pong code comes from?

Grady_Schofield · ‎03-05-2013

Hey Frances,

They are going to skip 2.6.38 and install the next version. I'll try this again when they do. The version of OFED does come from openfabrics.org.

Performance of Infiniband verbs issued from the MIC processor