Software Archive
Read-only legacy content
17061 Discussions

Unknown header type 7f

Jacob_F_
Beginner
5,881 Views

I'm running RHEL 7.0 and I the system seems to have a problem talking to the Phi card.

This is what I see in lspci:

03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev ff) (prog-if ff)

        !!! Unknown header type 7f

        Kernel driver in use: mic

 

I've attached the micdebug log.

0 Kudos
5 Replies
Frances_R_Intel
Employee
5,881 Views

I want to thank you for including the micdebug output. It was useful in eliminating a number of possibilities.

I suspect what is going on is that your card is overheating. You might want to look at https://software.intel.com/en-us/forums/topic/532366, where they were also seeing the unknown header type message after the system had been up for a few minutes. They also had a problem with the BIOS version, but the card returning just strings of 1s basically implies the card has given up and shut down. You can check this by unplugging the host, letting everything come back to room temperature then powering the system back up and checking lspci as soon as the host is back up. You can use micsmc (there is a man page) to monitor the temperature.  (From the micdebug output, it looks like the card might have come up right after it was installed but didn't stay up long. It looks like you might have run 'micctrl --initdefaults' after you installed the MPSS but it didn't complete - /etc/mpss/default.conf is there but /etc/mpss/mic0.conf is either missing or corrupted. Do you know if that it true?)

In any event, this may be an issue you need to take back to the supplier for your host system to make sure it is configured correctly for the coprocessor. Let us know what you find out.

 

0 Kudos
Jacob_F_
Beginner
5,880 Views

I do remember seeing some weirdness with micctrl --initdefaults before.

I tried uninstalling MPSS, powering down to let it cool off, then reinstalling.
When I got to the modprobe mic step I got this:

sudo modprobe mic
Message from syslogd@monster at Apr  8 11:23:33 ...
 kernel:BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:2:263]
BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:2:263]
rcu_sched self-detected stall on CPU {0} )t=60000 jiffies g=2492 c=2491 q=0)
ETC timer compensation(-1000000ppm) is much higherthan expected

Then when I ran micctrl --initdefaults:
sudo micctrl --initdefaults
[Warning] mic0: Generating compatibility network config file /opt/intel/mic/filesystem/mic0/etc/sysconfig/network/ifcfg-mic0 for IDB.
[Warning]       This may be problematic at best and will be removed in a future release, Check with the IDB release.

I've attached my latest micdebug.

Thanks Frances, let me know if there's anything else I can try, or why you think the CPUs might be getting into soft lockup.

0 Kudos
lu_S_
Beginner
5,880 Views

Unknown header type 7f

Thank you for your attention, I have the same problem of the lspci output:

84:00.0 Co-processor:  Intel Corporation Xeon Phi coprocessor 31S1 (rev ff) (prog-if ff)

            !!!Unknown header type 7f

            Kernel driver in use: mic

After all the device came to the room temperature, I powered on the system and  the lspci output is the same. When I tried to use micsmc -t to see the mic0's temperature, I got error message:

Error: mic0: unable to determin device status: get post code: read: /sys/class/mic/mic0/post_code: No such device or address

The output of micdebug.sh is attached follows.

I will be very appreciate for your help!

0 Kudos
JJK
New Contributor III
5,880 Views

hi,

just out of curiosity: can you try unloading the mic driver and then rerun 'lspci -vv -s 84:0' again ?

 

JJK

 

0 Kudos
Frances_R_Intel
Employee
5,880 Views

I sat down and walked my way though all the information Lu S. sent and I still think this is an overheating problem.

In the messages log, we can see the coprocessor booting successfully during the host boot.

Jul 13 18:25:32 localhost kernel: mic0: Transition from state ready to booting
Jul 13 18:25:32 localhost kernel: mic image: /usr/share/mpss/boot/rasmm-kernel.knightscorner-ab.elf
Jul 13 18:25:32 localhost kernel: MIC 0 Booting
Jul 13 18:25:32 localhost kernel: mic0: Transition from state booting to online
Jul 13 18:25:32 localhost kernel: ELF booted succesfully

There is nothing to show what the lspci output was at that time, but the card cannot boot if the mic kernel module cannot read the header. So at that time the header must have been valid.

However, the coprocessor doesn't stay up for long.

Jul 13 18:26:50 localhost kernel: mic0: Transition from state online to resetting
Jul 13 18:26:52 localhost kernel: mic0: Resetting (Post Code 3C)
Jul 13 18:26:53 localhost kernel: mic0: Resetting (Post Code 3d)
Jul 13 18:26:54 localhost kernel: mic0: Resetting (Post Code 3d)
Jul 13 18:26:55 localhost kernel: mic0: Resetting (Post Code 3d)
Jul 13 18:26:56 localhost kernel: mic0: Resetting (Post Code 3d)
Jul 13 18:26:57 localhost kernel: mic0: Resetting (Post Code 3d)
Jul 13 18:26:58 localhost kernel: mic0: Resetting (Post Code 3E)
Jul 13 18:26:59 localhost kernel: mic0: Resetting (Post Code 3E)
Jul 13 18:27:00 localhost kernel: mic0: Resetting (Post Code 3E)
Jul 13 18:27:01 localhost kernel: mic0: Resetting (Post Code 09)
Jul 13 18:27:02 localhost kernel: mic0: Resetting (Post Code 09)
Jul 13 18:27:03 localhost kernel: mic0: Resetting (Post Code 12)
Jul 13 18:27:03 localhost kernel: mic0: Transition from state resetting to ready

This is followed by a couple attempts to bring up the network connection to the coprocessor, which fail because the coprocessor isn't online.

Then the coprocessor reboots.

Jul 13 18:34:28 localhost kernel: mic0: Transition from state ready to booting
Jul 13 18:34:28 localhost kernel: mic image: /usr/share/mpss/boot/rasmm-kernel.knightscorner-ab.elf
Jul 13 18:34:28 localhost kernel: MIC 0 Booting
Jul 13 18:34:28 localhost kernel: mic0: Transition from state booting to online
Jul 13 18:34:28 localhost kernel: ELF booted succesfully

But this time, the coprocessor barely makes it up before it comes down again.

Jul 13 18:38:28 localhost kernel: mic0: Transition from state online to resetting
Jul 13 18:38:29 localhost kernel: Invalid Postcode : ��Jul 13 18:38:30 localhost kernel: mic0: Resetting (Post Code ��
Jul 13 18:38:30 localhost kernel: mic0: Transition from state resetting to reset failed
Jul 13 18:38:30 localhost kernel: MIC 0 RESETFAIL postcode ��1

and apparently brings the the mpss daemon down with it, since the service needs to be restarted.

So, at system boot, the header was valid and the coprocessor booted but by the time the host was up in multi-user mode and Lu S. was able to run lspci, the coprocessor had shut itself down.

People who have been seeing this behavior might want to contact their supplier to determine that the card is working properly, then check out the posts in this forum where people have been talking about solutions for cooling their cards in systems which do not provide adequate cooling by default.

 

0 Kudos
Reply