Software Archive
Read-only legacy content
17061 Discussions

CentOS 6.6 and MICs

dotsonml
Beginner
885 Views

I just upgraded my nodes hosting 4 Xeon Phi's to CentOS 6.6. The kernel is 2.6.32-504.el6.x86_64. I also upgraded the OFED drivers to Mellanox 2.3-2.0.0. Is this version compatible with MPSS 3.4.1? I've been having problems running MIC code on more than one of the mic's using mpirun:

mpirun -hosts mic0,mic1,mic2,mic3 -n 4 -ppn 1 ./montecarlo.mic
n01-mic1:MCM:1595:43da8700: 4789 us(4789 us): scif_connect() to port 68, failed with error Connection refused
n01-mic1:MCM:1595:43da8700: 4893 us(104 us):  open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
n01-mic2:MCM:150c:5d260700: 4279 us(4279 us): scif_connect() to port 68, failed with error Connection refused
n01-mic2:MCM:150c:5d260700: 4386 us(107 us):  open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
n05-mic0:MCM:1343:2be38700: 5445 us(5445 us):  ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 4931
n05-mic0:MCM:1343:2be38700: 5608 us(163 us):  open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)

I'm just wondering if I have the wrong version of Mellanox OFED.

Thanks,

 

Mark

 

0 Kudos
16 Replies
Frances_R_Intel
Employee
885 Views

CentOS is not one of the distributions for which the MPSS is officially supported, although it is close enough to RHEL that people generally don't have problems. However, the 2.6.32-504.el6.x86_6 kernel is not one of the default kernels the mic kernel module is pre-compiled for. If you have not already done so, you will want to rebuild the mpss-modules-*.rpm files following the directions in section 2.1 of the readme.txt.

Also see if you can log into the coprocessors using ssh without a password and, as root, run micinfo and miccheck to see if they detect any problems.

0 Kudos
dotsonml
Beginner
885 Views

Hi, Frances, and thank you for your reply.

I have built the mpss-modules for my kernel using rpmbuild, along with the ofed rpms. I am using Mellanox OFED 2.3-2.0. I can do micinfo from the host, but the command is not available on the mics themselves (I can login to them using ssh with no problems). I have the libraries and home directories mounted on the mics with NFS. 

Micinfo from the host looks like this:

MicInfo Utility Log
Created Thu Dec  4 12:56:17 2014


    System Info
        HOST OS            : Linux
        OS Version        : 2.6.32-504.el6.x86_64
        Driver Version        : 3.4.1-1
        MPSS Version        : 3.4.1
        Host Physical Memory    : 32840 MB

Device No: 0, Device Name: mic0

    Version
        Flash Version          : 2.1.02.0390
        SMC Firmware Version     : 1.16.5078
        SMC Boot Loader Version     : 1.8.4326
        uOS Version          : 2.6.38.8+mpss3.4.1
        Device Serial Number      : ADKC33400419

    Board
        Vendor ID          : 0x8086
        Device ID          : 0x2250
        Subsystem ID          : 0x2500
        Coprocessor Stepping ID     : 3
        PCIe Width          : x16
        PCIe Speed          : 5 GT/s
        PCIe Max payload size     : 256 bytes
        PCIe Max read req size     : 512 bytes
        Coprocessor Model     : 0x01
        Coprocessor Model Ext     : 0x00
        Coprocessor Type     : 0x00
        Coprocessor Family     : 0x0b
        Coprocessor Family Ext     : 0x00
        Coprocessor Stepping      : B1
        Board SKU          : B1PRQ-5110P/5120D
        ECC Mode          : Enabled
        SMC HW Revision      : Product 225W Passive CS

    Cores
        Total No of Active Cores : 60
        Voltage          : 1011000 uV
        Frequency         : 1052631 kHz

    Thermal
        Fan Speed Control      : N/A
        Fan RPM          : N/A
        Fan PWM          : N/A
        Die Temp         : 56 C

    GDDR
        GDDR Vendor         : Elpida
        GDDR Version         : 0x1
        GDDR Density         : 2048 Mb
        GDDR Size         : 7936 MB
        GDDR Technology         : GDDR5
        GDDR Speed         : 5.000000 GT/s
        GDDR Frequency         : 2500000 kHz
        GDDR Voltage         : 1501000 uV

Device No: 1, Device Name: mic1

    Version
        Flash Version          : 2.1.02.0390
        SMC Firmware Version     : 1.16.5078
        SMC Boot Loader Version     : 1.8.4326
        uOS Version          : 2.6.38.8+mpss3.4.1
        Device Serial Number      : ADKC32300211

    Board
        Vendor ID          : 0x8086
        Device ID          : 0x2250
        Subsystem ID          : 0x2500
        Coprocessor Stepping ID     : 3
        PCIe Width          : x16
        PCIe Speed          : 5 GT/s
        PCIe Max payload size     : 256 bytes
        PCIe Max read req size     : 512 bytes
        Coprocessor Model     : 0x01
        Coprocessor Model Ext     : 0x00
        Coprocessor Type     : 0x00
        Coprocessor Family     : 0x0b
        Coprocessor Family Ext     : 0x00
        Coprocessor Stepping      : B1
        Board SKU          : B1PRQ-5110P/5120D
        ECC Mode          : Enabled
        SMC HW Revision      : Product 225W Passive CS

    Cores
        Total No of Active Cores : 60
        Voltage          : 930000 uV
        Frequency         : 1052631 kHz

    Thermal
        Fan Speed Control      : N/A
        Fan RPM          : N/A
        Fan PWM          : N/A
        Die Temp         : 61 C

    GDDR
        GDDR Vendor         : Elpida
        GDDR Version         : 0x1
        GDDR Density         : 2048 Mb
        GDDR Size         : 7936 MB
        GDDR Technology         : GDDR5
        GDDR Speed         : 5.000000 GT/s
        GDDR Frequency         : 2500000 kHz
        GDDR Voltage         : 1501000 uV

Device No: 2, Device Name: mic2

    Version
        Flash Version          : 2.1.02.0390
        SMC Firmware Version     : 1.16.5078
        SMC Boot Loader Version     : 1.8.4326
        uOS Version          : 2.6.38.8+mpss3.4.1
        Device Serial Number      : ADKC31900335

    Board
        Vendor ID          : 0x8086
        Device ID          : 0x2250
        Subsystem ID          : 0x2500
        Coprocessor Stepping ID     : 3
        PCIe Width          : x16
        PCIe Speed          : 5 GT/s
        PCIe Max payload size     : 256 bytes
        PCIe Max read req size     : 512 bytes
        Coprocessor Model     : 0x01
        Coprocessor Model Ext     : 0x00
        Coprocessor Type     : 0x00
        Coprocessor Family     : 0x0b
        Coprocessor Family Ext     : 0x00
        Coprocessor Stepping      : B1
        Board SKU          : B1PRQ-5110P/5120D
        ECC Mode          : Enabled
        SMC HW Revision      : Product 225W Passive CS

    Cores
        Total No of Active Cores : 60
        Voltage          : 1002000 uV
        Frequency         : 1052631 kHz

    Thermal
        Fan Speed Control      : N/A
        Fan RPM          : N/A
        Fan PWM          : N/A
        Die Temp         : 60 C

    GDDR
        GDDR Vendor         : Elpida
        GDDR Version         : 0x1
        GDDR Density         : 2048 Mb
        GDDR Size         : 7936 MB
        GDDR Technology         : GDDR5
        GDDR Speed         : 5.000000 GT/s
        GDDR Frequency         : 2500000 kHz
        GDDR Voltage         : 1501000 uV

Device No: 3, Device Name: mic3

    Version
        Flash Version          : 2.1.02.0390
        SMC Firmware Version     : 1.16.5078
        SMC Boot Loader Version     : 1.8.4326
        uOS Version          : 2.6.38.8+mpss3.4.1
        Device Serial Number      : ADKC32300310

    Board
        Vendor ID          : 0x8086
        Device ID          : 0x2250
        Subsystem ID          : 0x2500
        Coprocessor Stepping ID     : 3
        PCIe Width          : x16
        PCIe Speed          : 5 GT/s
        PCIe Max payload size     : 256 bytes
        PCIe Max read req size     : 512 bytes
        Coprocessor Model     : 0x01
        Coprocessor Model Ext     : 0x00
        Coprocessor Type     : 0x00
        Coprocessor Family     : 0x0b
        Coprocessor Family Ext     : 0x00
        Coprocessor Stepping      : B1
        Board SKU          : B1PRQ-5110P/5120D
        ECC Mode          : Enabled
        SMC HW Revision      : Product 225W Passive CS

    Cores
        Total No of Active Cores : 60
        Voltage          : 935000 uV
        Frequency         : 1052631 kHz

    Thermal
        Fan Speed Control      : N/A
        Fan RPM          : N/A
        Fan PWM          : N/A
        Die Temp         : 62 C

    GDDR
        GDDR Vendor         : Elpida
        GDDR Version         : 0x1
        GDDR Density         : 2048 Mb
        GDDR Size         : 7936 MB
        GDDR Technology         : GDDR5
        GDDR Speed         : 5.000000 GT/s
        GDDR Frequency         : 2500000 kHz
        GDDR Voltage         : 1501000 uV

Thanks,

 

Mark

0 Kudos
Frances_R_Intel
Employee
885 Views

Apologies - the first time I read your posting, I missed the part about running Mellanox 2.3. Is it possible for you to pull back to the 2.2 version? I know people sometimes object to doing this because the other nodes on their InfiniBand network are all running 2.3 but 2.1 and 2.2 are the only versions currently supported. Your problems might, indeed, be from the version of OFED you are using.

0 Kudos
dotsonml
Beginner
885 Views

Frances,

Could you tell me if any current OFED or MOFED release will support both MICs and kernel version 2.6.32-504.1.3.el6.x86_64?

Thanks,

Mark

0 Kudos
Frances_R_Intel
Employee
885 Views

I talked to the OFED developers here and they recommend OFED-3.5-2-mic. The directions for getting and installing it are in section 2.4 of the MPSS User's Guide.

0 Kudos
dotsonml
Beginner
885 Views

I have upgraded to the latest MPSS (3.4.2) that supposedly support Mellanox OFED 2.3. I have 2.3-2.0.0 installed on my host. I tried the following commands with test code:

mpirun -n 4 -host n01-mic0 ./mpi-hello.mic

Result:

Hello world from process 001 out of 004, processor name n01-mic0
Hello world from process 002 out of 004, processor name n01-mic0
Hello world from process 003 out of 004, processor name n01-mic0
Hello world from process 000 out of 004, processor name n01-mic0

Next, I tried running on two mics:

mpirun -n 4 -host n01-mic0 ./mpi-hello.mic : -n 4 -host n01-mic1 ./mpi-hello.mic

Result:

n01-mic0:MCM:13a2:50a05700: 3875 us(3875 us): scif_connect() to port 68, failed with error Connection refused
n01-mic0:MCM:13a2:50a05700: 3980 us(105 us):  open_hca: SCIF init ERR for mlx4_0
n01-mic0:MCM:13a3:207d700: 4060 us(4060 us): scif_connect() to port 68, failed with error Connection refused
n01-mic0:MCM:13a3:207d700: 4122 us(62 us):  open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)

Basically the same error I was getting before upgrading MPSS. Can you help?

Mark

 

0 Kudos
dotsonml
Beginner
885 Views

I forgot to mention that I tried OFED-3.5-2-mic, as well, with the same results.

 

Thanks,

Mark

0 Kudos
Loc_N_Intel
Employee
885 Views

For multi-card usage, you need to enable peer-to-peer communication:

$ sudo /sbin/sysctl -w net.ipv4.ip_forward=1

Thank you.

0 Kudos
dotsonml
Beginner
885 Views

Would this be on the host or on the mics themselves?

Thanks.

0 Kudos
Loc_N_Intel
Employee
885 Views

The command should be executed only in the host. Thank you.

0 Kudos
dotsonml
Beginner
885 Views

I implemented your suggestion and reran the command. The errors did change. I no longer get "connection refused," but still no success.

[dotsonml@n01 ~]$ mpirun -n 4 -host n01-mic0 ./mpi-hello.mic : -n 4 -host n01-mic1 ./mpi-hello.mic
n01-mic0:MCM:139e:6de14700: 6707 us(6707 us):  ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 5022
n01-mic0:MCM:139e:6de14700: 6830 us(123 us):  open_hca: SCIF init ERR for mlx4_0
n01-mic0:MCM:139f:9c310700: 8674 us(8674 us):  ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 5023
n01-mic0:MCM:139f:9c310700: 8798 us(124 us):  open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
n01-mic0:MCM:13a0:9587d700: 4205 us(4205 us):  ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 5024
n01-mic0:MCM:13a0:9587d700: 4345 us(140 us):  open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
n01-mic0:MCM:139d:e0b65700: 5578 us(5578 us):  ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 5021
n01-mic0:MCM:139d:e0b65700: 5700 us(122 us):  open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)

Mark

 

0 Kudos
dotsonml
Beginner
885 Views

I ended up adding the environment variable, I_MPI_FABRICS=tcp, and that fixed my problem.

[dotsonml@n01 ~]$ mpirun -n 15 -perhost 5  -f hostfile ./mpi-hello.mic
Hello world from process 005 out of 015, processor name n01-mic1
Hello world from process 006 out of 015, processor name n01-mic1
Hello world from process 012 out of 015, processor name n01-mic2
Hello world from process 007 out of 015, processor name n01-mic1
Hello world from process 013 out of 015, processor name n01-mic2
Hello world from process 014 out of 015, processor name n01-mic2
Hello world from process 001 out of 015, processor name n01-mic0
Hello world from process 008 out of 015, processor name n01-mic1
Hello world from process 010 out of 015, processor name n01-mic2
Hello world from process 011 out of 015, processor name n01-mic2
Hello world from process 002 out of 015, processor name n01-mic0
Hello world from process 003 out of 015, processor name n01-mic0
Hello world from process 009 out of 015, processor name n01-mic1
Hello world from process 000 out of 015, processor name n01-mic0
Hello world from process 004 out of 015, processor name n01-mic0

Thanks,

Mark

0 Kudos
Loc_N_Intel
Employee
885 Views

From the error from your program, seems like MPI runtime chose the fabric DAPL. That is because you have OFED installed. In order to run your program with DAPL fabric, you need to do some extra steps:

First, start OFED service:

# service openibd start

# service ofed-mic start

# service mpxyd start

then select the DAPL fabric and DAPL provider:

# export I_MPI_FABRICS=dapl

# export I_MPI_DAPL_PROVIDER=ofa-v2-scif0

But if you just want to use TCP then you can just ignore the above step and just specify I_MPI_FABRICS=tcp as you mentioned.

0 Kudos
dotsonml
Beginner
885 Views

Thanks a lot.

Is there a difference in the speed and latencies with dapl, ofa, and tcp? Is one better than another?

Mark

0 Kudos
Frances_R_Intel
Employee
885 Views

The performance from highest to lowest is dapl (uses ofed  with iWarp across the PCIe interface and with direct memory access), ofa (uses ofed - again iWarp), tcp (uses ip). 

Are you running opensm either on the host node or somewhere else on your InfiniBand network? If not, could you start it?

A few things you might look at to further troubleshoot this problem -

Try adding "–genv I_MPI_DEBUG 3" to your mpirun command. It will provide DAPL debug messages as well as the standard MPI messages. 

See what 'ibv_status' has to say on the host and coprocessors. You should see a scif0 interface, which is the interface to all the cards, and the mlx4_0 (or whatever version you have) for the actual interface card on the host. I think it is curious that the your connection refused messages all say mlx4_0 but I don't know if that is an error or not.

What does 'ip link show' say about the interfaces?  

What does ibnetdiscover on the host say about the coprocessors? (Unfortunately this command is not installed on the coprocessors.

And just as point of interest:

I was talking to Loc about his use of '/sbin/sysctl -w net.ipv4.ip_forward=1'. I prefer to set up a bridge (sections 14.4.5.4 and 14.4.5.5 in the MPSS User's Guide in the MPSS 3.4 release). It is a gentler way to handle routing of IP traffic between the coprocessors, although the sysctl command is much simpler. You don't need a bridge for IB traffic. I was surprised that turning IP forwarding on affected your IB traffic. I guess this is just one more thing I don't know about networking. (So much to learn; so little time.) Do you have IPoIB turned on?

0 Kudos
dotsonml
Beginner
885 Views

Thanks, Frances. I have everything working now. This forum has helped immensely.

Mark

0 Kudos
Reply