- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just upgraded my nodes hosting 4 Xeon Phi's to CentOS 6.6. The kernel is 2.6.32-504.el6.x86_64. I also upgraded the OFED drivers to Mellanox 2.3-2.0.0. Is this version compatible with MPSS 3.4.1? I've been having problems running MIC code on more than one of the mic's using mpirun:
mpirun -hosts mic0,mic1,mic2,mic3 -n 4 -ppn 1 ./montecarlo.mic
n01-mic1:MCM:1595:43da8700: 4789 us(4789 us): scif_connect() to port 68, failed with error Connection refused
n01-mic1:MCM:1595:43da8700: 4893 us(104 us): open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
n01-mic2:MCM:150c:5d260700: 4279 us(4279 us): scif_connect() to port 68, failed with error Connection refused
n01-mic2:MCM:150c:5d260700: 4386 us(107 us): open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
n05-mic0:MCM:1343:2be38700: 5445 us(5445 us): ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 4931
n05-mic0:MCM:1343:2be38700: 5608 us(163 us): open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)
I'm just wondering if I have the wrong version of Mellanox OFED.
Thanks,
Mark
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CentOS is not one of the distributions for which the MPSS is officially supported, although it is close enough to RHEL that people generally don't have problems. However, the 2.6.32-504.el6.x86_6 kernel is not one of the default kernels the mic kernel module is pre-compiled for. If you have not already done so, you will want to rebuild the mpss-modules-*.rpm files following the directions in section 2.1 of the readme.txt.
Also see if you can log into the coprocessors using ssh without a password and, as root, run micinfo and miccheck to see if they detect any problems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Frances, and thank you for your reply.
I have built the mpss-modules for my kernel using rpmbuild, along with the ofed rpms. I am using Mellanox OFED 2.3-2.0. I can do micinfo from the host, but the command is not available on the mics themselves (I can login to them using ssh with no problems). I have the libraries and home directories mounted on the mics with NFS.
Micinfo from the host looks like this:
MicInfo Utility Log
Created Thu Dec 4 12:56:17 2014
System Info
HOST OS : Linux
OS Version : 2.6.32-504.el6.x86_64
Driver Version : 3.4.1-1
MPSS Version : 3.4.1
Host Physical Memory : 32840 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.4.1
Device Serial Number : ADKC33400419
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 1011000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 56 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
Device No: 1, Device Name: mic1
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.4.1
Device Serial Number : ADKC32300211
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 930000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 61 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
Device No: 2, Device Name: mic2
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.4.1
Device Serial Number : ADKC31900335
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 1002000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 60 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
Device No: 3, Device Name: mic3
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.4.1
Device Serial Number : ADKC32300310
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 935000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 62 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
Thanks,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Apologies - the first time I read your posting, I missed the part about running Mellanox 2.3. Is it possible for you to pull back to the 2.2 version? I know people sometimes object to doing this because the other nodes on their InfiniBand network are all running 2.3 but 2.1 and 2.2 are the only versions currently supported. Your problems might, indeed, be from the version of OFED you are using.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances,
Could you tell me if any current OFED or MOFED release will support both MICs and kernel version 2.6.32-504.1.3.el6.x86_64?
Thanks,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I talked to the OFED developers here and they recommend OFED-3.5-2-mic. The directions for getting and installing it are in section 2.4 of the MPSS User's Guide.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have upgraded to the latest MPSS (3.4.2) that supposedly support Mellanox OFED 2.3. I have 2.3-2.0.0 installed on my host. I tried the following commands with test code:
mpirun -n 4 -host n01-mic0 ./mpi-hello.mic
Result:
Hello world from process 001 out of 004, processor name n01-mic0
Hello world from process 002 out of 004, processor name n01-mic0
Hello world from process 003 out of 004, processor name n01-mic0
Hello world from process 000 out of 004, processor name n01-mic0
Next, I tried running on two mics:
mpirun -n 4 -host n01-mic0 ./mpi-hello.mic : -n 4 -host n01-mic1 ./mpi-hello.mic
Result:
n01-mic0:MCM:13a2:50a05700: 3875 us(3875 us): scif_connect() to port 68, failed with error Connection refused
n01-mic0:MCM:13a2:50a05700: 3980 us(105 us): open_hca: SCIF init ERR for mlx4_0
n01-mic0:MCM:13a3:207d700: 4060 us(4060 us): scif_connect() to port 68, failed with error Connection refused
n01-mic0:MCM:13a3:207d700: 4122 us(62 us): open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)
Basically the same error I was getting before upgrading MPSS. Can you help?
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I forgot to mention that I tried OFED-3.5-2-mic, as well, with the same results.
Thanks,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For multi-card usage, you need to enable peer-to-peer communication:
$ sudo /sbin/sysctl -w net.ipv4.ip_forward=1
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Would this be on the host or on the mics themselves?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The command should be executed only in the host. Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I implemented your suggestion and reran the command. The errors did change. I no longer get "connection refused," but still no success.
[dotsonml@n01 ~]$ mpirun -n 4 -host n01-mic0 ./mpi-hello.mic : -n 4 -host n01-mic1 ./mpi-hello.mic
n01-mic0:MCM:139e:6de14700: 6707 us(6707 us): ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 5022
n01-mic0:MCM:139e:6de14700: 6830 us(123 us): open_hca: SCIF init ERR for mlx4_0
n01-mic0:MCM:139f:9c310700: 8674 us(8674 us): ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 5023
n01-mic0:MCM:139f:9c310700: 8798 us(124 us): open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
n01-mic0:MCM:13a0:9587d700: 4205 us(4205 us): ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 5024
n01-mic0:MCM:13a0:9587d700: 4345 us(140 us): open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
n01-mic0:MCM:139d:e0b65700: 5578 us(5578 us): ERR: dev_open ver (exp 5 rcv 5), op IA_OPEN, flgs 2, st 5 dev_id 5021
n01-mic0:MCM:139d:e0b65700: 5700 us(122 us): open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I ended up adding the environment variable, I_MPI_FABRICS=tcp, and that fixed my problem.
[dotsonml@n01 ~]$ mpirun -n 15 -perhost 5 -f hostfile ./mpi-hello.mic
Hello world from process 005 out of 015, processor name n01-mic1
Hello world from process 006 out of 015, processor name n01-mic1
Hello world from process 012 out of 015, processor name n01-mic2
Hello world from process 007 out of 015, processor name n01-mic1
Hello world from process 013 out of 015, processor name n01-mic2
Hello world from process 014 out of 015, processor name n01-mic2
Hello world from process 001 out of 015, processor name n01-mic0
Hello world from process 008 out of 015, processor name n01-mic1
Hello world from process 010 out of 015, processor name n01-mic2
Hello world from process 011 out of 015, processor name n01-mic2
Hello world from process 002 out of 015, processor name n01-mic0
Hello world from process 003 out of 015, processor name n01-mic0
Hello world from process 009 out of 015, processor name n01-mic1
Hello world from process 000 out of 015, processor name n01-mic0
Hello world from process 004 out of 015, processor name n01-mic0
Thanks,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From the error from your program, seems like MPI runtime chose the fabric DAPL. That is because you have OFED installed. In order to run your program with DAPL fabric, you need to do some extra steps:
First, start OFED service:
# service openibd start
# service ofed-mic start
# service mpxyd start
then select the DAPL fabric and DAPL provider:
# export I_MPI_FABRICS=dapl
# export I_MPI_DAPL_PROVIDER=ofa-v2-scif0
But if you just want to use TCP then you can just ignore the above step and just specify I_MPI_FABRICS=tcp as you mentioned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a lot.
Is there a difference in the speed and latencies with dapl, ofa, and tcp? Is one better than another?
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The performance from highest to lowest is dapl (uses ofed with iWarp across the PCIe interface and with direct memory access), ofa (uses ofed - again iWarp), tcp (uses ip).
Are you running opensm either on the host node or somewhere else on your InfiniBand network? If not, could you start it?
A few things you might look at to further troubleshoot this problem -
Try adding "–genv I_MPI_DEBUG 3" to your mpirun command. It will provide DAPL debug messages as well as the standard MPI messages.
See what 'ibv_status' has to say on the host and coprocessors. You should see a scif0 interface, which is the interface to all the cards, and the mlx4_0 (or whatever version you have) for the actual interface card on the host. I think it is curious that the your connection refused messages all say mlx4_0 but I don't know if that is an error or not.
What does 'ip link show' say about the interfaces?
What does ibnetdiscover on the host say about the coprocessors? (Unfortunately this command is not installed on the coprocessors.
And just as point of interest:
I was talking to Loc about his use of '/sbin/sysctl -w net.ipv4.ip_forward=1'. I prefer to set up a bridge (sections 14.4.5.4 and 14.4.5.5 in the MPSS User's Guide in the MPSS 3.4 release). It is a gentler way to handle routing of IP traffic between the coprocessors, although the sysctl command is much simpler. You don't need a bridge for IB traffic. I was surprised that turning IP forwarding on affected your IB traffic. I guess this is just one more thing I don't know about networking. (So much to learn; so little time.) Do you have IPoIB turned on?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, Frances. I have everything working now. This forum has helped immensely.
Mark

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page