- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since MPSS 3.2 mics does not boot reliably way, there is following error on the console: "Initramfs unpacking failed: junk in compressed archive"
There are two 7120 cards on the host and boot failure occurs quite often. If I restart mpss service problem may disappear and both cards works fine. Another mpss restart and suddenly there is that "Initramfs unpacking failed"-error on the logs. It may be mic0 or mic1 which fails. Never both.
If I enable verboselogging "Initramfs unpacking failed"-error message disappears but problem do not.
[ 82.940477] System halted.
[ 82.942345] mic_shutdown: system state 2 dbreg 0x80000002
[ 0.000000] SFI: Entering sfi_map_memory, phys = e0000, size = 131071
[ 0.000000] SFI: Entering sfi_map_memory, phys = ef180, size = 32
[ 0.000000] SFI: Entering sfi_map_table, pa = 92000
[ 0.000000] SFI: Entering sfi_map_memory, phys = 92000, size = 24
[ 0.000000] SFI: sfi_map_table, th = ffffffffff4ba000
[ 0.000000] SFI: Entering sfi_map_memory, phys = 92000, size = 1000
[ 0.000000] SFI: Entering sfi_map_table, pa = ef1c8
[ 0.000000] SFI: sfi_map_table, th = ffffffffff47a1c8
[ 0.000000] SFI: Entering sfi_map_table, pa = ef000
[ 0.000000] SFI: sfi_map_table, th = ffffffffff47a000
[ 0.000000] SFI: Entering sfi_map_memory, phys = ef000, size = 312
[ 0.000000] SFI: Entering sfi_map_table, pa = 92000
[ 0.000000] SFI: Entering sfi_map_memory, phys = 92000, size = 24
[ 0.000000] SFI: sfi_map_table, th = ffffffffff4ba000
[ 0.000000] SFI: Entering sfi_map_memory, phys = 92000, size = 1000
[ 0.000000] SFI: Entering sfi_map_table, pa = 92000
[ 0.000000] SFI: Entering sfi_map_memory, phys = 92000, size = 24
[ 0.000000] SFI: sfi_map_table, th = ffffffffff4ba000
[ 0.000000] SFI: Entering sfi_map_memory, phys = 92000, size = 1000
[ 0.000000] SFI: Entering sfi_map_table, pa = ef1c8
[ 0.000000] SFI: sfi_map_table, th = ffffffffff47a1c8
[ 0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
[ 0.000000] PCI: Unassigned devices with 32bit resource registers may break!
[ 0.010000] SFI: Entering sfi_map_memory, phys = ef180, size = 48
[ 0.010000] SFI: Entering sfi_map_table, pa = 92000
[ 0.010000] SFI: Entering sfi_map_memory, phys = 92000, size = 24
[ 0.010000] SFI: sfi_map_table, th = ffffc90000000000
[ 0.010000] SFI: Entering sfi_map_memory, phys = 92000, size = 1000
[ 0.010000] SFI: Entering sfi_map_table, pa = ef1c8
[ 0.010000] SFI: sfi_map_table, th = ffff8800000ef1c8
[ 0.010000] SFI: Entering sfi_map_table, pa = ef000
[ 0.010000] SFI: sfi_map_table, th = ffff8800000ef000
[ 0.010000] SFI: Entering sfi_map_memory, phys = ef000, size = 312
[ 0.010000] SFI: Entering sfi_map_table, pa = 92000
[ 0.010000] SFI: Entering sfi_map_memory, phys = 92000, size = 24
[ 0.010000] SFI: sfi_map_table, th = ffffc90000000000
[ 0.010000] SFI: Entering sfi_map_memory, phys = 92000, size = 1000
[ 0.010000] SFI: Entering sfi_map_table, pa = ef1c8
[ 0.010000] SFI: sfi_map_table, th = ffff8800000ef1c8
[ 0.010000] SFI: Entering sfi_map_table, pa = ef000
[ 0.010000] SFI: sfi_map_table, th = ffff8800000ef000
[ 0.010000] SFI: Entering sfi_map_memory, phys = ef000, size = 312
[ 2.634597] Initramfs unpacking failed: junk in compressed archive
[ 4.061908] i8042: Can't read CTR while initializing i8042
[ 6.704162] Have you set virtblk file?
[ 13.239568] [ pm_scif_init : 348 ]:==> pm_scif_init
[ 13.239590] [ pm_scif_init : 349 ]:pm_scif insmoded
[ 13.239643] [ pm_scif_init : 377 ]: scif_bind successfull. Local port number = 1089, ep =
[ 13.240538] [ pm_recv_from_host : 182 ]:==> pm_recv_from_host
[ 13.240574] [ pm_handle_open : 88 ]:==> pm_handle_open
[ 13.240707] [ pm_recv_from_host : 182 ]:==> pm_recv_from_host
Intel MIC Platform Software Stack (Built by Poky 7.0) 3.2.1 m40-mic0 hvc0
Here is boot log from failed boot when verboselogging is enabled:
Unmounting local filesystems...
[ 96.880582] Preparing to shutdown kernel
[ 96.880612] md: stopping all md devices.
[ 98.013776] card: scif node 1 exiting
[ 98.018264] Deregistered interrupt handler for node 0, for IRQ = 17,handle = 0
[ 98.072678] Back from notifier call
[ 98.073544] System halted.
[ 98.074980] mic_shutdown: system state 2 dbreg 0x80000002
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 2.6.38.8+mpss3.2.1 (build@yocto-182-71) (gcc version 4.7.0 20110509 (experimental) (GCC) ) #1 SMP Wed Apr 2 08:52:20 PDT 2014
[ 0.000000] Command line: card=0 vnet=dma scif_id=1 scif_addr=0x8474fae780 vnet_addr=0x84761c0118 vcons_hdr_addr=0x8474fa5440 virtio_addr=[ 7.420826] RAS.init: module operational
[ 13.061375] Module mpssboot loaded at 0xffffffffa0003000
[ 13.069231] MPSSBOOT Time of day sycned with host
[ 13.096002] Module pm_scif loaded at 0xffffffffa0016000
[ 13.101490] [ pm_scif_init : 348 ]:==> pm_scif_init
[ 13.101516] [ pm_scif_init : 349 ]:pm_scif insmoded
[ 13.101559] [ pm_scif_init : 377 ]: scif_bind successfull. Local port number = 1089, ep =
[ 13.102659] [ pm_recv_from_host : 182 ]:==> pm_recv_from_host
[ 13.102695] [ pm_handle_open : 88 ]:==> pm_handle_open
[ 13.102792] [ pm_recv_from_host : 182 ]:==> pm_recv_from_host
[ 13.688581] Module blcr_imports loaded at 0xffffffffa0009000
[ 13.712853] Module blcr loaded at 0xffffffffa009b000
[ 13.734372] blcr: vmadump: (from bproc-"4.0.0pre8") Erik Hendriks <erik@hendriks.cx>
[ 13.734405] blcr: vmadump: Modified for blcr 0.8.5 <http://ftg.lbl.gov/checkpoint>
[ 13.734427] blcr: Berkeley Lab Checkpoint/Restart (BLCR) module version 0.8.5.
[ 13.734448] blcr: Parameter cr_io_max = 0x4000000
[ 13.734461] blcr: Supports kernel interface version 0.10.3.
[ 13.734477] blcr: Supports context file format versions 8 though 9.
[ 13.734495] blcr: http://ftg.lbl.gov/checkpoint
[ 13.809241] MPSSBOOT Boot acknowledged
Intel MIC Platform Software Stack (Built by Poky 7.0) 3.2.1 m40-mic0 hvc0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
micctrl -r mic0
sleep 2
micctrl -r mic1
sleep 2
micctrl -b mic0
micctrl -w mic0 # waits until boot 0 is done
micctrl -b mic1
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[root@m40 ~]# miccheck
MicCheck 3.2.1-r1
Copyright 2013 Intel Corporation All Rights Reserved
Executing default tests for host
Test 0: Check number of devices the OS sees in the system ... pass
Test 1: Check mic driver is loaded ... pass
Test 2: Check number of devices driver sees in the system ... pass
Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
Test 5 (mic0): Check ras daemon is available in device ... pass
Test 6 (mic0): Check running flash version is correct ... pass
Executing default tests for device: 1
Test 7 (mic1): Check device is in online state and its postcode is FF ... pass
Test 8 (mic1): Check ras daemon is available in device ... pass
Test 9 (mic1): Check running flash version is correct ... pass
Status: OK
[root@m40 ~]# micctrl --config
mic0:
=============================================================
Config Version: 1.1
Linux Kernel: /usr/share/mpss/boot/bzImage-knightscorner
BootOnStart: Enabled
Shutdowntimeout: 300 seconds
ExtraCommandLine: highres=off
PowerManagment: cpufreq_on;corec6_on;pc3_on;pc6_on
Root Device: Dynamic Ram Filesystem /var/mpss/mic0.image.gz from:
Base: CPIO /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz
Overlay Filelist /opt/intel/mic/ofed/ /opt/intel/mic/ofed/ofed.filelist on
Overlay RPM /opt/intel/mic/filesystem on
CommonDir: Directory /var/mpss/common
Micdir: Directory /var/mpss/mic0
Network: Static bridge br0
MIC IP: 10.10.5.40
Host IP: 10.10.4.40
Net Bits: 16
NetMask: 255.255.0.0
MtuSize: 1500
Hostname: m40-mic0
MIC MAC: 4c:79:ba:4c:01:18
Host MAC: 4c:79:ba:4c:01:19
Cgroup:
Memory: Enabled
Console: hvc0
VerboseLogging: Enabled
CrashDump: /var/crash/mic 16GB
mic1:
=============================================================
Config Version: 1.1
Linux Kernel: /usr/share/mpss/boot/bzImage-knightscorner
BootOnStart: Enabled
Shutdowntimeout: 300 seconds
ExtraCommandLine: highres=off
PowerManagment: cpufreq_on;corec6_on;pc3_on;pc6_on
Root Device: Dynamic Ram Filesystem /var/mpss/mic1.image.gz from:
Base: CPIO /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz
Overlay Filelist /opt/intel/mic/ofed/ /opt/intel/mic/ofed/ofed.filelist on
Overlay RPM /opt/intel/mic/filesystem on
CommonDir: Directory /var/mpss/common
Micdir: Directory /var/mpss/mic1
Network: Static bridge br0
MIC IP: 10.10.6.40
Host IP: 10.10.4.40
Net Bits: 16
NetMask: 255.255.0.0
MtuSize: 1500
Hostname: m40-mic1
MIC MAC: 4c:79:ba:4c:00:be
Host MAC: 4c:79:ba:4c:00:bf
Cgroup:
Memory: Enabled
Console: hvc0
VerboseLogging: Enabled
CrashDump: /var/crash/mic 16GB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tommi,
We are working on this issue for you. As soon as I get an update, I'll let you know.
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tommi,
The experts want to know what you have done to isolate the problem.
- Are the cards rebooted the same way each time - for example, by issuing "service mpss restart"? If not, what is the process?
- Is the problem reproducable if the card hasn't been touched? For example, once the coprocessors are both working, can you restart mpss 10 times in a row and get at least one failure?
- Is this reproducible on multiple hosts? (if not, have the cards been re-seated....?)
- Is it sensitive to multi-card installs - can it be reproduced with only 1 card installed?
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Taylor Kidd (Intel) wrote:
Tommi,
The experts want to know what you have done to isolate the problem.
- Are the cards rebooted the same way each time - for example, by issuing "service mpss restart"? If not, what is the process?
Yes, service mpss restart
- Is the problem reproducable if the card hasn't been touched? For example, once the coprocessors are both working, can you restart mpss 10 times in a row and get at least one failure?
Yes, I'd say it's over 50% probability that boot fails.
- Is this reproducible on multiple hosts? (if not, have the cards been re-seated....?)
Yes, I've 45 nodes with 2 mics and it's not host specific issue. Another user is facing same issue:
https://software.intel.com/en-us/forums/topic/508661#comment-1787816
- Is it sensitive to multi-card installs - can it be reproduced with only 1 card installed?
We have "only" multi-card nodes and those are water cooled so it's not possible to take card away from the node.
Is it possible to start mpss only one mic at time?
Or insert some delay between mic startups?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
micctrl -r mic0
sleep 2
micctrl -r mic1
sleep 2
micctrl -b mic0
micctrl -w mic0 # waits until boot 0 is done
micctrl -b mic1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I forgot - first edit:
/etc/mpss/mic[0,1].conf
set
BootOnStart Enabled
to
BootOnStart Disabled
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the hint, I changed configuration so that mics will boot serially. But it did not help :-(
I rebooted 28 nodes and here are results:
9 nodes had both mics up.
6 nodes mic0 up correctly
9 nodes mic1 up correctly
4 nodes both mics failed.
I reimaged all nodes so there is no configuration differences between nodes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tommi,
Here are the experts requests:
- The output “[ 13.809241] MPSSBOOT Boot acknowledged” output indicates it did boot. Tell us the card's state using “micctrl –s”.
- Generally when this occurs it indicates a network setup issue. I would start by using minicom to log into the card using the virtual console and looking at the network config.
- Do a “mkdir unpack; cd unpack; zcat /var/mpss/mic0.image.gz | (cpio –iv; cpio –iv)” and see if the initrd image unpacks on the host or not.
- Send me the initrd image created by mpssd and defined by the RootDevice parameter (usually /var/mpss/mic0.image.gz).
You can send the image to me in a private message. You might have to change its name so it makes it past the virus filters. If it still doesn't work, let me know and we can do so via email.
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Taylor Kidd (Intel) wrote:
Tommi,
Here are the experts requests:
- The output “[ 13.809241] MPSSBOOT Boot acknowledged” output indicates it did boot. Tell us the card's state using “micctrl –s”.
- Generally when this occurs it indicates a network setup issue. I would start by using minicom to log into the card using the virtual console and looking at the network config.
- Do a “mkdir unpack; cd unpack; zcat /var/mpss/mic0.image.gz | (cpio –iv; cpio –iv)” and see if the initrd image unpacks on the host or not.
- Send me the initrd image created by mpssd and defined by the RootDevice parameter (usually /var/mpss/mic0.image.gz).
You can send the image to me in a private message. You might have to change its name so it makes it past the virus filters. If it still doesn't work, let me know and we can do so via email.
Hi, my post was a bit unclear. Mics will boot up but due to initramfs unpacking error mic will use "wrong" ssh_host_key. I use micctrl --hostkeys=/opt/intel/mic_host_keys/ which will add my cluster host keys to overlay file system /var/mpss/mic0,1/etc/ssh/. See https://software.intel.com/en-us/forums/topic/508661#comment-1787816
I can extract /var/mpss/mic0,1/image.gz on the host without errors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#!/bin/bash for mic in mic0 mic1; do status=1; while [ $status -eq 1 ]; do ssh $mic uname -r if [ $? -ne 0 ]; then micctrl --shutdown $mic micctrl -w $mic micctrl -b $mic micctrl -w $mic else status=0; fi done done
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tommi,
We are still working on an answer. I hope to get back to you soon.
Regards
---
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I still have zero visability into what the error could be so here are some more questions.
1. Did you on the host do the command "zcat /var/mpss/mic0.image.gz | (cpio -v; cpio -iv)" note the double cpio?
2. If so does the the output always have you host keys or does it change?
3. How did you get visability to the unpacking error message?
4. I notice you have an RPM overlay at /opt/intel/mic/filesystem. What is in this directory?
5. Can I get access to your mic0.image.gz file so I can analyze it for errors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Johnnie P. wrote:
I still have zero visability into what the error could be so here are some more questions.
1. Did you on the host do the command "zcat /var/mpss/mic0.image.gz | (cpio -v; cpio -iv)" note the double cpio?
2. If so does the the output always have you host keys or does it change?
3. How did you get visability to the unpacking error message?
4. I notice you have an RPM overlay at /opt/intel/mic/filesystem. What is in this directory?
5. Can I get access to your mic0.image.gz file so I can analyze it for errors.
1. Yes, but first cpio command is not valid, second one will extract the image without errors:
[root@m37 asdasd]# zcat /var/mpss/mic0.image.gz | (cpio -v; cpio -i)
cpio: You must specify one of -oipt options.
Try `cpio --help' or `cpio --usage' for more information.
106060 blocks
[root@m37 asdasd]# echo $?
0
rpm -q cpio
cpio-2.10-11.el6_3.x86_64
2. ssh_host_keys are not inside the image file.
3. The error message is visible on the serial console if verbose logging is disabled.
4. Nothing
5. Check your message box
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So Got a copy of the image file and on my Red Hat 6.2 host I cannot un cpio it. In the second cpio section I see it find the etc/rc5.d directory and then I get the error:
cpio:Substituting '.' for empty member
cpio: premature end of file.
I notice from your mic0.conf file you have not upgraded to release 3.2. The MicDir parameter still makes use of the mic0.filelist file. To further debug this I will need to see the contents of that file. It would be better if you sent me the whole mic0 directory so I can try to use it to reproduce this.
I would also suggest upgrading to the 3.2 release. The use of the filelist file for MicDir and CommonDir has been removed. The cards file system will be created with the files haveing the same user and permissions as it has on the host. There has been a number of fixes to all areas of micctrl and a number of them may have effect on this issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry I may have copied the ramdisk image I down loaded from puuppa site into a directory where there was already other stuff and confused the issue. So I need to change some of the questions.
What release do you have installed?
Can I get a copy of the files int he /var/mpss/mic0 directory so I can try the same thing here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, have you checked the log files on the host to see if there are any file system errors?
And an obscure fact to think about is that the image files are remade every time the coprocessors are booted. So the only image file that counts when it comes to solving this problem is the one that was created during a failed boot attempt.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tommi,
Thank you for identifying the issue in the release notes.
Intel Tracking ID: 4868776 Affected OS: All Linux Description: [Tools] MPSS-3.2 generates corrupted mic0.img.gz file if /var/mpss/mic0/... containes softlinks to files not existing in that file system tree Notes: Investigating
For documentation purposes, the status is no longer "investigating". The solution is to update to MPSS-3.2.3 with OFED-1.5.4.1 instead of OFED-3.5-2-MIC-BETA.
Regards
--
Taylor

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page