Software Archive
Read-only legacy content

MPSS 3.3 + OFED on CentOS 7

Steffen_M_
Beginner
1,874 Views

Hi,

looking at the MPSS 3.3 release notes, I discovered that there is now support for RHEL 7. I wanted to try that and hijacked one of our cluster nodes to try installing CentOS 7 and the MPSS stack. The installation of MPSS was completely painless (and I like the fact that there are even service files for systemd), but I haven't managed to install OFED. The MPSS User Manual states that the only supported options for RHEL 7 is OFED-3.5-2-mic, but trying to run its installation script after the MPSS installation failed with a compilation error while building the intel-mic-ofed-compat-rdma RPM. The build logs seem to imply that the OFED sources don't support the 3.10-based compiler in RHEL 7:

make -f scripts/Makefile.build obj=/var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/compat
  gcc -Wp,-MD,/var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/compat/.main.o.d  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/4.8.2/include \
        -D__OFED_BUILD__ \
        -DCOMPAT_BASE="\"compat-2012-07-02-13-gde310fa\"" -DCOMPAT_BASE_TREE="\"unknown\"" -DCOMPAT_BASE_TREE_VERSION="\"v3.5\"" -DCOMPAT_PROJECT="\"Compat-rdma\"" -DCOMPAT_VERSION="\"a5bbb76-np\""  \
        -include /lib/modules/3.10.0-123.4.2.el7.x86_64/build/include/generated/autoconf.h \
        -include /var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/include/linux/autoconf.h \
        -include /lib/modules/3.10.0-123.4.2.el7.x86_64/build/include/linux/kconfig.h \
        -include /var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/include/linux/compat-2.6.h \
         \
         \
         \
         \
        -I/var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/include \
        -I/var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/drivers/infiniband/debug \
        -I/usr/local/include/scst \
        -I/var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/drivers/infiniband/ulp/srpt \
        -D__XEN_INTERFACE_VERSION__= \
        -I/usr/src/kernels/3.10.0-123.4.2.el7.x86_64/arch/x86/include/mach-xen \
        -I/usr/src/kernels/3.10.0-123.4.2.el7.x86_64/arch/x86/include \
        -Iarch/x86/include/generated -Iinclude \
         \
        -I/usr/src/kernels/3.10.0-123.4.2.el7.x86_64/arch/x86/include \
         -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m64 -mno-sse -mpreferred-stack-boundary=3 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -Wframe-larger-than=2048 -fstack-protector-strong -Wno-unused-but-set-variable -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -pg -mfentry -DCC_USING_FENTRY -fno-inline-functions-called-once -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack -DCC_HAVE_ASM_GOTO  -DMODULE  -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(main)"  -D"KBUILD_MODNAME=KBUILD_STR(compat)" -c -o /var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/compat/.tmp_main.o /var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/compat/main.c
In file included from <command-line>:0:0:
/var/tmp/OFED_topdir/BUILD/intel-mic-ofed-compat-rdma-3.5/include/linux/compat-2.6.h:6:27: fatal error: linux/version.h: No such file or directory
 #include <linux/version.h>
                           ^
compilation terminated.

I fixed the version.h problem, but other missing headers keep coming up: asm/types.h, asm/bitsperlong.h,...

As the most recent version of OFED-3.5-2-mic is -beta1, which was released at the beginning of May, is there perhaps a set of patches that need to be applied to the OFED distribution before attempting an installation? For some reason, the MPSS user guide explicitly states that RHEL 7 users should use that OFED stack, and the relevant portion of the guide doesn't list any additional steps apart from a straightforward installation.

Is there anyone out there who has managed to get MIC + Infiniband running on RHEL 7?

Best regards,

Steffen

0 Kudos
14 Replies
TimP
Honored Contributor III
1,877 Views

linux/version.h should be available under /usr/src/ if kernel sources (optional) are installed (and maybe configured) and their path correctly selected.  I don't know if the latter is a step in OFED preparation.  It doesn't look like a gcc version problem.

0 Kudos
Steffen_M_
Beginner
1,877 Views

Hi Tim,

thanks for the comment! I have installed both kernel-devel and kernel-headers for the currently running kernel (to be able to build kernel modules on RHEL, you need kernel-devel).

To me, the problem seems to be that a lot of kernel headers have been moved around, e.g. version.h used to be in /lib/modules/$VERSION/include/linux/version.h, but is now in lib/modules/$VERSION/include/generated/uapi/linux/version.h. There is a similar story for a bunch of other headers, e.g. asm/types.h has to be changed to uapi/asm-generic/types.h I think.

Unfortunaly, I'm really not up to that task, as there are probably more severe problems lurking around the corner, and I don't know enough about the kernel to fix those...

0 Kudos
Frances_R_Intel
Employee
1,878 Views

I've passed the question on to the OFED developers here. We'll see what they have to say.

Frances

0 Kudos
Frances_R_Intel
Employee
1,877 Views

And the developer got right back to me. There is a slight release timing mismatch here. A new version of OFED-3.5-2-mic, the one that was tested against RHEL 7, is due out shortly. The developer said he expects "shortly" to be a few days. I will add a note when I hear it is out there.

0 Kudos
Steffen_M_
Beginner
1,874 Views

Hi Frances,

thanks for clearing up the issue! I'll just wait for an update from you.

Steffen

0 Kudos
Frances_R_Intel
Employee
1,878 Views

Boy, this was faster than I had expected. The following announcement was sent to the ewg mail list at openfabrics.org this morning:

OFED-3.5-2-MIC-rc1 is available at:
https://www.openfabrics.org/downloads/ofed-mic/ofed-3.5-2-mic/OFED-3.5-2-MIC-rc1.tgz

OFED-3.5-2-MIC requires the Intel(R) MPSS 3.x (YOCTO) release for Linux to be
installed on your system.  MPSS 3.x for Linux can be downloaded from:
http://software.intel.com/mic-developer

Changes from OFED-3.5-2-beta1 include:
- added support for RHEL 7.0
- updated DAPL package to release 2.0.42.2
- updated PSM package to intel-mic-psm-3.3
- updated ib_qib driver for mpss-3.3
- script and documentation updates
- bug fixes

0 Kudos
Steffen_M_
Beginner
1,878 Views

Yes, that really was fast!

Unfortunately, I didn't manage to get my setup up and running. The new version of the OFED-MIC stack installs cleanly, and the initial setup was straightforward - IPoIB works (tried doing NFS over it, no problems there). The basic InfiniBand diagnostic tools are also telling me that my network setup is fine.

But then I tried mounting NFS over RDMA and got strange errors. I can browse the mounted filesystem without any problem, but trying to read any file larger than about 800 bytes results in errors like this:

cat: README: Input/output error

I then tried to verify whether there is a compatibility problem between CentOS 6 and 7, so I installed CentOS 7 on a second node and tried to mount an NFS share from there using RDMA. Unfortunately, I keep getting the same error.

I then tested lower-level connectivity using ib_write_bw (installed as part of the OFED stack). After fiddling around with transfer sizes a little, I noticed that transfers of more than 64k fail for some reason. Here are the outputs (I ran "ib_write_bw -F -R" on the server side):

[root@node01 ~]# ib_write_bw --size=65536 -F -R node02.ib
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx4_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 2048
 Link type       : IB
 Max inline data : 0
 rdma_cm QPs	 : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0x05 QPN 0x0070 PSN 0x537e81
 remote address: LID 0x02 QPN 0x006b PSN 0xde3ac7
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1680.109000 != 1397.593000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1705.156000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1797.359000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1424.828000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1519.656000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1736.875000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1398.468000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1708.546000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1730.859000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1302.765000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1436.093000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1398.687000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1337.437000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1357.671000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1381.953000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1360.406000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1371.234000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1314.687000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1414.109000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1469.781000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1564.171000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1381.406000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 2800.000000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1499.968000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1499.750000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1532.453000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1704.609000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1494.171000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 2365.343000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1354.171000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1429.203000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1332.296000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1333.390000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1317.640000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1341.375000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1330.875000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1328.687000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1315.781000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 1680.109000 != 1316.109000
Test integrity may be harmed !
Warning: measured timestamp frequency 2800.04 differs from nominal 1680.11 MHz
 65536      5000           3631.36            3631.36		   0.058102
---------------------------------------------------------------------------------------

 

Going to a larger size always fails like this:

[root@node01 ~]# ib_write_bw --size=131072 -F -R node02.ib
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx4_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 2048
 Link type       : IB
 Max inline data : 0
 rdma_cm QPs	 : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0x05 QPN 0x0072 PSN 0xd8d988
 remote address: LID 0x02 QPN 0x006d PSN 0xa6f8b6
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Problems with warm up
 Failed to complete run_iter_bw function successfully

Here, I doubled the transfer size, but any value larger than 65535 will trigger the same behavior.

At this point, I'm rather lost. Does someone with more experience with InfiniBand have any ideas? Our hardware setup is ConnectX-3 QDR cards (mlx4 driver) with a QDR Mellanox switch. I'd be very grateful!

BTW, I'll be on holidays until the beginning of August, so I won't be able to answer any further questions until then.

0 Kudos
Pavel_Lavrenko
Beginner
1,878 Views

Hi Steffen!

I'm wondering have you solved your problem already? It seems I have the same issue with OFED 3.5.2 rc3.

 

 

 

0 Kudos
Steffen_M_
Beginner
1,878 Views

Pavel Lavrenko wrote:

Hi Steffen!

I'm wondering have you solved your problem already? It seems I have the same issue with OFED 3.5.2 rc3.

Hi Pavel,

no, I asked on the OFED side and apparently this won't be fixed in OFED-3.5. But according to what's going on on the OFED mailing list, OFED-3.12 should be released next week or so and it includes MIC support in the mainline distribution. We'll try whether that works with CentOS 7 and RDMA once it is out.

Steffen

0 Kudos
Eric_B_1
New Contributor I
1,878 Views

So i am also working this issue and have had no luck with the mellenox version of OFED:

I got the version error here is a fix may not be correct:

Notes: Instructions in guide are wrong: Step 4) States: rpmbuild --rebuild --define “MOFED 1”

Correct command is: rpmbuild --rebuild --define 'MOFED 1' ofed-driver*.src.rpm Fatal Error Still occurs: fatal error: linux/version.h: No such file or directory #include

ork Around: Exit “C_INCLUDE_PATH=/usr/include/”

Causes Second Error: fatal error: asm/unistd_64_x32.h: No such file or directory

export C_INCLUDE_PATH=$C_INCLUDE_PATH:/usr/src/kernels/3.10.0-229.el7.x86_64/arch/x86/include/generated/

Now i get a lot of errors like:

 error: dereferencing pointer to incomplete type   entry->write_proc = ibscif_stats_write;

 

 

0 Kudos
Frances_R_Intel
Employee
1,878 Views

I'm a bit surprised that /usr/include wasn't in your default include path. The file /usr/include/linux/version.h was there, so I am taking it as a given that you did install kernel-headers and kernel-devel.

I will put in a documentation bug report about the use of double quotes when single quotes should be used.

As far as the other problems, you don't say exactly which Linux release and which MPSS release you are using. If you are using CentOS 7 or later, as the originator of this thread was, then I would suggest using Open Fabrics OFED 3.12 or 3.18 rather than the Mellanox OFED. What it said in the MPSS 3.4 documentation but does not explicitly say in the MPSS 3.5 or 3.6 documentation is that with RHEL 7.0 and later you should use the Open Fabrics OFED. What the current documentation does say is:

Each OFED distribution supports a subset of the Intel® MPSS supported OS distros; most support SLES* 11 SP3 and RHEL* 6.2/3/4/5/6. Newer distros may not be officially supported by any released OFED (at the time of this writing: RHEL* 6.7, SLES* 11 SP4). Check the respective release notes for the exact supported distros.

So it sounds like another documentation bug report I should put in.

0 Kudos
Eric_B_1
New Contributor I
1,878 Views
0 Kudos
Steffen_M_
Beginner
1,878 Views

The way I got it to run is by using OFED-3.18 from http://downloads.openfabrics.org/OFED/

0 Kudos
Eric_B_1
New Contributor I
1,878 Views

Yes centos 7.1

I tried  OFED-3.18  and no dice. Ill try a clean install and try again.

0 Kudos
Reply