- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I'm trying to install mpss, but cannot seem to get past the first steps: when I try to flash update the coprocessor, the initialization fails. Additionally, when I use micctrl to check the status of the Phi, I get the following error: "FATAL: Module mic not found."
Any suggestions?
Thanks in advance, Chris
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This usually occurs when your Linux kernel is newer than the kernel for which the MIC module had been compiled. To fix this, you have to recompile the MIC kernel module by following instructions in Section 9.1 "Recompiling the Host Driver" of the file readme-en.txt
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you started up the mpss service? If this is the very first time you have installed the MPSS and you have not rebooted since your did the install, it is possible that the mic kernel module is not yet loaded into the kernel. You can check this using the command:
lsmod | grep mic
If you don't see 'mic' (not micro) show up, try starting the mpss service:
service mpss start
Hopefully this is all that is wrong. The readme.txt file that comes with the MPSS is not very exciting reading but it is essential. For an idea of some of the things that happen behind the scenes, you can see my admin guide. I am not an admin but this has information on some of the things I ran into doing the install myself.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Argh! I missed the sentence about the flash update having failed. The problem is farther back that the mic kernel module. Perhaps I should have started with a question - what version of operating system are you using on the host? what version of MPSS? is this indeed your first install? Exactly what error message did you get when you tried to do the update the flash?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply. We're using opensuse 12.3, and the MPSS is KNC gold 2.1-4346-16. the flash update,
$ /opt/intel/mic/bin/micflash -Update /opt/intel/mic/flash/
returns:
VERSION: Copyright 2011-2012 Intel Corporation All Rights Reserved.
VERSION: 4346-16
Intel(R) Xeon Phi(TM) Coprocessor stack initialization failed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure exactly how openSUSE releases map to regular SUSE releases, although from looking around, I gather that openSUSE 12.3 is using a later version of Linux than SUSE 11 SP 2. But two things that are required before you can install the MPSS on any SUSE system are: 1) you must edit /etc/modprobe.d/unsupported-modules and set allow_unsupported_modules to 1; 2) you must disable SELinux before installing the MPSS. Both of these seem like good candidates for causing the problems you are seeing.
If you did not do both of these things before installing the MPSS, could I ask you to uninstall the MPSS, make these two changes to your host, then reinstall the MPSS? It is necessary to uninstall the MPSS before you try installing it again. You can find the directions for uninstalling the MPSS in the readme file that comes with the MPSS.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Frances, thanks for the reply. We've actually upgraded to RHEL 6.3 (we'd try openSUSE while we were waiting for the order to go through). I've been sure to diable SELinux, but we're still having trouble with the install:
micctrl --initdefaults
returns:
No MIC cards found in the system
The MIC driver has been determined to be loaded. Use the
'lspci' utility to verify cards are installed.
Then, lpsci | grep 2250 returns
06:00.0 Co-processor: Intel Corporation Device 2250 (rev ff)
so the device is there. Any suggustions?
Thanks, Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, so you have now switch from OpenSUSE to RHEL 6.3. The kernel module is now loading (you would have ended up needing to rebuild it if you were still using OpenSUSE) but the micctrl command does not recognize the card as being an Intel(r) Xeon Phi(tm) coprocessor. In the past when this has happened, the problem was that the BIOS for the host did not provide large BAR support. Can you check the system log for your host and see if you have any error messages about BAR allocation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Frances, the following messages were at the end of /var/log/messages:
May 14 15:34:56 goliath kernel: pnp 00:01: disabling [mem 0xfbf00000-0xfbffffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:01: disabling [mem 0xfc000000-0xfcffffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:01: disabling [mem 0xfd000000-0xfdffffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:01: disabling [mem 0xfe000000-0xfebfffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:01: disabling [mem 0xfec8a000-0xfec8afff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:01: disabling [mem 0xfed10000-0xfed10fff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0a: disabling [mem 0xfed1c000-0xfed1ffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0a: disabling [mem 0xfed20000-0xfed3ffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0a: disabling [mem 0xfed40000-0xfed8ffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0c: disabling [mem 0xfec00000-0xfec00fff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0c: disabling [mem 0xfee00000-0xfee00fff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0d: disabling [mem 0xe0000000-0xefffffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0e: disabling [mem 0x00000000-0x0009ffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0e: disabling [mem 0x000c0000-0x000cffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0e: disabling [mem 0x000e0000-0x000fffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0e: disabling [mem 0x00100000-0xbfffffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pnp 00:0e: disabling [mem 0xfed90000-0xffffffff] because it overlaps 0000:06:00.0 BAR 0 [mem 0x00000000-0x1ffffffff 64bit pref]
May 14 15:34:56 goliath kernel: pci 0000:00:03.0: BAR 15: can't assign mem pref (size 0x200000000)
May 14 15:34:56 goliath kernel: pci 0000:00:1c.5: BAR 15: assigned [mem 0xc0000000-0xc01fffff 64bit pref]
May 14 15:34:56 goliath kernel: pci 0000:00:1c.4: BAR 15: assigned [mem 0xc0200000-0xc03fffff 64bit pref]
May 14 15:34:56 goliath kernel: pci 0000:00:1c.0: BAR 14: assigned [mem 0xc0400000-0xc05fffff]
May 14 15:34:56 goliath kernel: pci 0000:00:1c.0: BAR 15: assigned [mem 0xc0600000-0xc07fffff 64bit pref]
May 14 15:34:56 goliath kernel: pci 0000:00:1c.0: BAR 13: assigned [io 0x1000-0x1fff]
May 14 15:34:56 goliath kernel: pci 0000:06:00.0: BAR 0: [mem 0x00000000-0x1ffffffff 64bit pref] has bogus alignment
May 14 15:34:56 goliath kernel: pci 0000:06:00.0: BAR 4: assigned [mem 0xfac00000-0xfac1ffff 64bit]
May 14 15:34:56 goliath kernel: pci 0000:06:00.0: BAR 4: set to [mem 0xfac00000-0xfac1ffff 64bit] (PCI address [0xfac00000-0xfac1ffff]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So the solution, from the MPSS readme-en.txt file:
<blockquote>
In order for Intel(R) Xeon Phi(TM) coprocessors to function properly in a platform, BIOS and OS support for large (8GB+) Memory Mapped I/O Base Address Registers (MMIO BAR's) above the 4GB address limit must be enabled. By default, most platform BIOS implementations have this set to disabled, therefore it must be enabled manually in the platform BIOS setup. Contact your platform and/or BIOS vendor to determine whether changing this setting applies for the platform being used.
</blockquote>
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I saw the similar problem on my host, but the error message was following. Any thought? And the mic on this host worked well before.
Jul 3 04:05:06 kernel: vnet: mode: dma, buffers: 62
Jul 3 04:05:06 kernel: mic 0000:2a:00.0: device not available because of BAR 0 [0x000000-0x1ffffffff] collisions
Jul 3 04:05:06 kernel: pci_enable failed board #0
Jul 3 04:05:06 kernel: mic: probe of 0000:2a:00.0 failed with error -22
Jul 3 04:05:06 kernel: mic 0000:90:00.0: device not available because of BAR 0 [0x000000-0x1ffffffff] collisions
Jul 3 04:05:06 kernel: pci_enable failed board #0
Jul 3 04:05:06 kernel: mic: probe of 0000:90:00.0 failed with error -22
Jul 3 04:05:06 kernel: mic: No MIC boards present. SCIF available in loopback mode
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One update is my machine worked well orinigally, but failed when I tried to do some mic test.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Currently, I found two of my Xeon node with MIC which worked well before, but now both of them get the error message like following and the mic cannot be started. Is there anything I can do to fix the issue?
Jul 3 04:05:06 kernel: mic 0000:90:00.0: device not available because of BAR 0 [0x000000-0x1ffffffff] collisions
Jul 3 04:05:06 kernel: pci_enable failed board #0
Jul 3 04:05:06 kernel: mic: probe of 0000:90:00.0 failed with error -22
Jul 3 04:05:06 kernel: mic: No MIC boards present. SCIF available in loopback mode
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I reinstall my host node, but still has the same issue. I think it 's the problem of mic card. What can I do?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is this problem solved now?
I still get the same error
mic 0000:83:00.0: device not available because of BAR 0 [0x000000-0x3ffffffff] collisions
mic: probe of 0000:83:00.0 failed with error -22
mic: No MIC boards present. SCIF available in loopback mode
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The only case I know where you get this error message is when the the large BAR support is not enabled in the host's BIOS. If you have been successfully running on your coprocessor cards before, then large BAR support must have been enabled at that time. Could you double check and make sure it is still enabled? (Not quite sure what might cause it to become disable after you have been running with it enabled, but it won't hurt to check again.) If the BAR looks ok, could you run the micdebug.sh script and send us the output. You can send it as a private message, using the "Send Author A Message" link on this comment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi! I have had the same error:
# micctrl --load-modules
modprobe: WARNING: Module mic_x200_dma not found.
modprobe: ERROR: could not insert 'scif_bus': Unknown symbol in module, or unknown parameter
[Error] cannot load kernel modules
and # miccheck
Test 1: Check required drivers are loaded ... fail
Idea about rebuilding of MPSS host driver was really helpful.So Intel® Manycore Platform Software Stack (Intel® MPSS) - Section D.3 “Rebuilding Intel® MPSS host driver (optional)”.
Just need to notice, that I was needed to force reinstall packages (yum reinstall), because before it told, that it was “no changes”.
And Readme file inside MPSS also very useful. Just need to do it step by step.
Thank you all!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page