I am new Xeon Phi user and we are budget limited to the Xeon Phi 7120P which is less expensive. We have a few Supermicro servers have 4 PCIe x16 slots and 2 PCIe x8 slots. I read the manual of the server and these PCIe slots are shared equally between the two CPU sockets. So, I am trying to installed 4 7120P to each of the nodes, and I can install 2 Mellanox ConnectX3 HCA to the 2 PCIe x8 slots. That means each CPU socket will be connected to 1x Infiniband HCA and 2x 7120P. But is this a supported configuration? I read about the MPSS manual and it reads each Xeon Phi should be connected to a separated HCA. There are total of 7 nodes in our small cluster.
The Xeon Phi will be used to run the Quantum Espresso package as described in this link, but the example seems configured with 2 Xeon Phi per node, and can I add more Phi to the node? Can I get any advantage by adding to 4 Xeon Phi per node?
Besides, is the Xeon Phi 7120P supported by MPSS 3.8 or later version? Any examples of using Mellanox OFED with Xeon Phi?
I am reporting my progress of making this to work on a single node with Ubuntu 16.04 server LTS and Mellanox MLNX_OFED_LINUX-3.4-188.8.131.52-ubuntu16.04-x86_64.
2 ConnectX-3 HCAs are installed to respective CPU slots, so when I do
sudo lspci | grep "Mellanox"
01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
82:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
sudo lspci | grep "Phi"
83:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (rev 20)
The following step-by-step are based on,
1) Install Ubuntu 16.04 LTS server, sudo apt-get update, sudo apt-get upgrade, sudo apt-get dist-upgrade, to get the latest kernel
2) sudo apt-get install build-essential linux-headers-generic
3) download the MPSS 3.8.1 source tar, unzip and the kernel module files are located at mpss-3.8.1/src/mpss-modules-3.8.1.
make sudo make install sudo depmod sudo rmmod mic_host sudo vi /etc/modprobe.d/blacklist-mic-host.conf (blacklist mic_host) sudo modprobe mic
4) download the MPSS 3.8.1 rpm package tar, unzip and the rpm files are located at /mpss-3.8.1,
tar xvf mpss-3.8.1-linux.tar cd mpss-3.8.1 sudo alien --scripts *.rpm dpkg -i *.deb sudo vi /etc/ld.so.conf.d/zz_x86_64-compat.conf (/usr/lib64) sudo ldconfig
5) update the firmware and since my cards are 7120P C0 stepping, I have to use
sudo /usr/bin/micflash -update -device all
6) run miccheck, but need to link
sudo ln -s /usr/bin/lspci /sbin/lspci sudo miccheck sudo micinfo
7) configure the mic, but when I run
sudo micctrl --initdefaults -vv
It returns an error,
[Info] mic0: Using existing /etc/mpss/default.conf [Info] mic0: Using existing /etc/mpss/mic0.conf [Info] mic0: File System Base /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz [Info] mic0: MIC Family x100 [Info] mic0: MPSSVersion 3.x [Info] mic0: Common files at /var/mpss/common [Info] mic0: Unique files at /var/mpss/mic0 [Info] mic0: Hostname node09-mic0 [Filesys] mic0: Update /etc/hosts remove mic0 [Filesys] mic0: Update /etc/hosts remove hostmic0 [Filesys] mic0: Created /etc/network/interfaces [Filesys] mic0: Update /var/mpss/mic0/etc/network/interfaces [Filesys] mic0: Update /etc/hosts with 172.31.1.1 node09-mic0 [Info] mic0: Verbose mode Disabled [Info] mic0: Linux OS image /usr/share/mpss/boot/bzImage-knightscorner System Map /usr/share/mpss/boot/bzImage-knightscorner [Info] mic0: Boot On Start Enabled [Info] mic0: Shutdown Timeout 300 [Info] mic0: MIC Crash Dump at /var/crash/mic size 16 [Error] mic0: Create failed for /etc/ssh/ rsa1 keys: Operation not permitted [Info] mic0: ExtraCommandLine 'highres=off noautogroup' [Info] mic0: RootDevice RamFS /var/mpss/mic0.image.gz [Info] mic0: Console hvc0 [Info] mic0: PowerManagement cpufreq_on;corec6_on;pc3_on;pc6_on [Info] mic0: Cgroup memory=disabled [Info] mic0: [Parse] /etc/mpss/mic0.conf [Info] mic0: [Parse] Configuration version 1.1 [Info] mic0: [Parse] /etc/mpss/default.conf [Filesys] mic0: Update /var/mpss/mic0/etc/hosts
It appears that micctrl is not able to create the ssh key? and it may be due to Ubuntu not using su by default? so I try
root@node09:~# ssh mic0
The authenticity of host 'mic0 (172.31.1.1)' can't be established.
ECDSA key fingerprint is SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'mic0,172.31.1.1' (ECDSA) to the list of known hosts.
[root@node09-mic0 ~]# uname -a
Linux node09-mic0 184.108.40.206+mpss3.8.1 #1 SMP Thu Jan 12 16:10:30 EST 2017 k1om GNU/Linux
It appears working.
I also notice that the file /etc/mpss/mpss.ubuntu has to be modified in order to load the mpssd at startup, the 1st line has to change to
sudo cp /etc/mpss/mpss.ubuntu /etc/init.d/mpss
sudo update-rc.d mpss defaults 99 10
Reboot and lspci | grep "mic" now shows mic is loaded at boot! :)
qeuser@node09:~$ sudo miccheck MicCheck 3.8.1-1 Copyright (c) 2016, Intel Corporation. Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF ... pass Test 5 (mic0): Check ras daemon is available in device ... pass Test 6 (mic0): Check running flash version is correct ... pass Test 7 (mic0): Check running SMC firmware version is correct ... pass Status: OK
qeuser@node09:~$ sudo micinfo MicInfo Utility Log Created Sat Feb 25 22:49:44 2017 System Info HOST OS : Linux OS Version : 4.4.0-64-generic Driver Version : 3.8.1-1 MPSS Version : NotAvailable Host Physical Memory : 257836 MB Device No: 0, Device Name: mic0 Version Flash Version : 2.1.02.0391 SMC Firmware Version : 1.17.6900 SMC Boot Loader Version : 1.8.4326 Coprocessor OS Version : 220.127.116.11+mpss3.8.1 Device Serial Number : ADKC33400518 Board Vendor ID : 0x8086 Device ID : 0x225c Subsystem ID : 0x7d95 Coprocessor Stepping ID : 2 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 256 bytes PCIe Max read req size : 512 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : C0 Board SKU : C0PRQ-7120 P/A/X/D ECC Mode : Enabled SMC HW Revision : Product 300W Passive CS Cores Total No of Active Cores : 61 Voltage : 995000 uV Frequency : 1238095 kHz Thermal Fan Speed Control : N/A Fan RPM : N/A Fan PWM : N/A Die Temp : 51 C GDDR GDDR Vendor : Samsung GDDR Version : 0x6 GDDR Density : 4096 Mb GDDR Size : 15872 MB GDDR Technology : GDDR5 GDDR Speed : 5.500000 GT/s GDDR Frequency : 2750000 kHz GDDR Voltage : 1501000 uV
Now, it appears working. My next step is going to config IPoIB with MPSS.
Allow me to further explain my situation.
I was informed by Dr. Fabio, developer of QE, that the future QE on Xeon Phi will be implementing the native and symmetrical modes. So, I need to setup passwordless ssh to each 7120P and back to the host. The cluster will have a shared /home directory to be accessible for all nodes and all 7120P (max. 28 of them).
I have successful setup passwordless ssh to the Phi but NOT back to the host with the same user account. May I know why?
qeuser@node09:~$ ssh mic0 [qeuser@node09-mic0 ~]$ ssh host qeuser@host's password:
Thank you for all the work that you did trying to get this running. I am a total noob and in way over my head, but I want to make this work as it is a challenge that is driving my curiosity.
Is CentOS 7 a better option overall?
Are the steps that you outlined still valid when using CentOS7?
I doubt this is possible but I would love to talk maybe over the phone or email if possible... If not I would love to continue on here...
I am running a Mac Pro 4,1 flashed to 5,1 with dual Intel Xeon E5520s (8 core) and I have recently bought the Xeon Phi CoProcessor 71S1P.
I have a virtual machine running CentOS7 in VirtualBox...
Any help, or resources to getting this thing up and running would be greatly appreciated!