Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Rolly_N_
Beginner
409 Views

How to setup 4 Xeon Phi 7120P in a single node via Infiniband?

Dear all,

I am new Xeon Phi user and we are budget limited to the Xeon Phi 7120P which is less expensive. We have a few Supermicro servers have 4 PCIe x16 slots and 2 PCIe x8 slots. I read the manual of the server and these PCIe slots are shared equally between the two CPU sockets. So, I am trying to installed 4 7120P to each of the nodes, and I can install 2 Mellanox ConnectX3 HCA to the 2 PCIe x8 slots. That means each CPU socket will be connected to 1x Infiniband HCA and 2x 7120P. But is this a supported configuration? I read about the MPSS manual and it reads each Xeon Phi should be connected to a separated HCA. There are total of 7 nodes in our small cluster.

The Xeon Phi will be used to run the Quantum Espresso package as described in this link, but the example seems configured with 2 Xeon Phi per node, and can I add more Phi to the node? Can I get any advantage by adding to 4 Xeon Phi per node?

https://software.intel.com/en-us/articles/explicit-offload-for-quantum-espresso

Besides, is the Xeon Phi 7120P supported by MPSS 3.8 or later version? Any examples of using Mellanox OFED with Xeon Phi?

Thank you,

Rolly

0 Kudos
4 Replies
Rolly_N_
Beginner
409 Views

Dear all,

I am reporting my progress of making this to work on a single node with Ubuntu 16.04 server LTS and Mellanox MLNX_OFED_LINUX-3.4-2.0.0.0-ubuntu16.04-x86_64.

2 ConnectX-3 HCAs are installed to respective CPU slots, so when I do

sudo lspci | grep "Mellanox"

It returns,

01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

82:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

sudo lspci | grep "Phi"

It returns,

83:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (rev 20)

The following step-by-step are based on,

http://arrayfire.com/getting-started-with-the-intel-xeon-phi-on-ubuntu-14-04linux-kernel-3-13-0/

https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/532329#comment-1815945

1) Install Ubuntu 16.04 LTS server, sudo apt-get update, sudo apt-get upgrade, sudo apt-get dist-upgrade, to get the latest kernel

2) sudo apt-get install build-essential linux-headers-generic

3) download the MPSS 3.8.1 source tar, unzip and the kernel module files are located at mpss-3.8.1/src/mpss-modules-3.8.1.

make
sudo make install
sudo depmod
sudo rmmod mic_host
sudo vi /etc/modprobe.d/blacklist-mic-host.conf (blacklist mic_host)
sudo modprobe mic

4) download the MPSS 3.8.1 rpm package tar, unzip and the rpm files are located at /mpss-3.8.1,

tar xvf  mpss-3.8.1-linux.tar
cd mpss-3.8.1
sudo alien --scripts *.rpm
dpkg -i *.deb
sudo vi /etc/ld.so.conf.d/zz_x86_64-compat.conf (/usr/lib64)
sudo ldconfig

5) update the firmware and since my cards are 7120P C0 stepping, I have to use

sudo /usr/bin/micflash -update -device all

6) run miccheck, but need to link

sudo ln -s /usr/bin/lspci  /sbin/lspci
sudo miccheck
sudo micinfo

7) configure the mic, but when I run 

sudo micctrl --initdefaults -vv

It returns an error,

   [Info] mic0: Using existing /etc/mpss/default.conf
   [Info] mic0: Using existing /etc/mpss/mic0.conf
   [Info] mic0: File System Base /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz
   [Info] mic0: MIC Family x100
   [Info] mic0: MPSSVersion 3.x
   [Info] mic0: Common files at /var/mpss/common
   [Info] mic0: Unique files at /var/mpss/mic0
   [Info] mic0: Hostname node09-mic0
[Filesys] mic0: Update /etc/hosts remove mic0
[Filesys] mic0: Update /etc/hosts remove hostmic0
[Filesys] mic0: Created /etc/network/interfaces
[Filesys] mic0: Update /var/mpss/mic0/etc/network/interfaces
[Filesys] mic0: Update /etc/hosts with 172.31.1.1 node09-mic0
   [Info] mic0: Verbose mode Disabled
   [Info] mic0: Linux OS image /usr/share/mpss/boot/bzImage-knightscorner
                System Map /usr/share/mpss/boot/bzImage-knightscorner
   [Info] mic0: Boot On Start Enabled
   [Info] mic0: Shutdown Timeout 300
   [Info] mic0: MIC Crash Dump at /var/crash/mic size 16
  [Error] mic0: Create failed for /etc/ssh/ rsa1 keys: Operation not permitted
   [Info] mic0: ExtraCommandLine 'highres=off noautogroup'
   [Info] mic0: RootDevice RamFS /var/mpss/mic0.image.gz
   [Info] mic0: Console hvc0
   [Info] mic0: PowerManagement cpufreq_on;corec6_on;pc3_on;pc6_on
   [Info] mic0: Cgroup memory=disabled
   [Info] mic0: [Parse] /etc/mpss/mic0.conf
   [Info] mic0: [Parse] Configuration version 1.1
   [Info] mic0: [Parse] /etc/mpss/default.conf
[Filesys] mic0: Update /var/mpss/mic0/etc/hosts

It appears that micctrl is not able to create the ssh key? and it may be due to Ubuntu not using su by default? so I try

sudo bash

Now

root@node09:~# ssh mic0
The authenticity of host 'mic0 (172.31.1.1)' can't be established.
ECDSA key fingerprint is SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'mic0,172.31.1.1' (ECDSA) to the list of known hosts.

[root@node09-mic0 ~]# uname -a
Linux node09-mic0 2.6.38.8+mpss3.8.1 #1 SMP Thu Jan 12 16:10:30 EST 2017 k1om GNU/Linux

It appears working.

I also notice that the file /etc/mpss/mpss.ubuntu has to be modified in order to load the mpssd at startup, the 1st line has to change to 

#!/bin/bash -e

Then, 

sudo cp /etc/mpss/mpss.ubuntu /etc/init.d/mpss

sudo update-rc.d mpss defaults 99 10

Reboot and lspci | grep "mic" now shows mic is loaded at boot! :)

Now,

qeuser@node09:~$ sudo miccheck
MicCheck 3.8.1-1
Copyright (c) 2016, Intel Corporation.

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass
  Test 7 (mic0): Check running SMC firmware version is correct ... pass

Status: OK

and,

qeuser@node09:~$ sudo micinfo
MicInfo Utility Log
Created Sat Feb 25 22:49:44 2017


	System Info
		HOST OS			: Linux
		OS Version		: 4.4.0-64-generic
		Driver Version		: 3.8.1-1
		MPSS Version		: NotAvailable
		Host Physical Memory	: 257836 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : 2.1.02.0391
		SMC Firmware Version	 : 1.17.6900
		SMC Boot Loader Version	 : 1.8.4326
		Coprocessor OS Version 	 : 2.6.38.8+mpss3.8.1
		Device Serial Number 	 : ADKC33400518

	Board
		Vendor ID 		 : 0x8086
		Device ID 		 : 0x225c
		Subsystem ID 		 : 0x7d95
		Coprocessor Stepping ID	 : 2
		PCIe Width 		 : x16
		PCIe Speed 		 : 5 GT/s
		PCIe Max payload size	 : 256 bytes
		PCIe Max read req size	 : 512 bytes
		Coprocessor Model	 : 0x01
		Coprocessor Model Ext	 : 0x00
		Coprocessor Type	 : 0x00
		Coprocessor Family	 : 0x0b
		Coprocessor Family Ext	 : 0x00
		Coprocessor Stepping 	 : C0
		Board SKU 		 : C0PRQ-7120 P/A/X/D
		ECC Mode 		 : Enabled
		SMC HW Revision 	 : Product 300W Passive CS

	Cores
		Total No of Active Cores : 61
		Voltage 		 : 995000 uV
		Frequency		 : 1238095 kHz

	Thermal
		Fan Speed Control 	 : N/A
		Fan RPM 		 : N/A
		Fan PWM 		 : N/A
		Die Temp		 : 51 C

	GDDR
		GDDR Vendor		 : Samsung
		GDDR Version		 : 0x6
		GDDR Density		 : 4096 Mb
		GDDR Size		 : 15872 MB
		GDDR Technology		 : GDDR5 
		GDDR Speed		 : 5.500000 GT/s 
		GDDR Frequency		 : 2750000 kHz
		GDDR Voltage		 : 1501000 uV

Now, it appears working. My next step is going to config IPoIB with MPSS.

0 Kudos
Rolly_N_
Beginner
409 Views

Hello all,

Allow me to further explain my situation. 

I was informed by Dr. Fabio, developer of QE, that the future QE on Xeon Phi will be implementing the native and symmetrical modes. So, I need to setup passwordless ssh to each 7120P and back to the host. The cluster will have a shared /home directory to be accessible for all nodes and all 7120P (max. 28 of them).

I have successful setup passwordless ssh to the Phi but NOT back to the host with the same user account. May I know why?

qeuser@node09:~$ ssh mic0
[qeuser@node09-mic0 ~]$ ssh host
qeuser@host's password: 

Thanks,

Rolly

0 Kudos
Rolly_N_
Beginner
409 Views

Hello all,

After some testing, I gave up on Ubuntu 16.04 and switched to CentOS 7 for installation of the Phis.

Thanks for your attention.

Rolly

0 Kudos
Cavazos__Andres
Beginner
409 Views

Hello,

 

Thank you for all the work that you did trying to get this running. I am a total noob and in way over my head, but I want to make this work as it is a challenge that is driving my curiosity. 

Is CentOS 7 a better option overall?

Are the steps that you outlined still valid when using CentOS7?

I doubt this is possible but I would love to talk maybe over the phone or email if possible... If not I would love to continue on here... 

I am running a Mac Pro 4,1 flashed to 5,1 with dual Intel Xeon E5520s (8 core) and I have recently bought the Xeon Phi CoProcessor 71S1P. 

I have a virtual machine running CentOS7 in VirtualBox...

Any help, or resources to getting this thing up and running would be greatly appreciated! 

Thank you,

Andy Cavazos

andres.cavazos@me.com

 

0 Kudos