Software Archive
Read-only legacy content
17061 Discussions

After mpss restart mic freezes

Bsc_Cns
Novice
386 Views

Hello,

We have a Xeon Phi Coprocessor, firmware version 2.1.01.0375. Every time we restart the mpss service we have to remove the mic packages from the host and install them again to bring the mic back online. Otherwise it hangs and the only way I can see something, is connecting to the mic console using minicom. Here's what I can see repeating over and over in the mic console:

[ 842.813622] INFO: task kworker/u:0:5 blocked for more than 120 seconds.
[ 842.813643] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Installed packages in our system (SLES11 SP2):

intel-mic-kmod-2.1.4982-15.3.0.13.0.suse
intel-mic-gpl-2.1.4982-15.suse
intel-mic-2.1.4982-15.suse
intel-mic-flash-0375-15.suse
intel-mic-gdb-2.1.4982-15.suse
intel-mic-sysmgmt-2.1.4982-15.suse

Before the last driver upgrade, I was able to restart the mpss service without any problem. Has anyone had the same issue?

Thanks in advance.

0 Kudos
5 Replies
William_Arasin
Beginner
386 Views

Since upgrading my system to the same verison of the MPSS software, I've seen a similar problem. I used to be able reboot the mics and have them come up, but after the upgrade the only way for me to bring them online was to reboot the host. I'm running CentOS 6.3, not SLES 11SP2. I made a change to the power management settings for the mics, namely I went from: "cpufreq_on;corec6_off;pc3_on;pc6_off" to "cpufreq_off;corec6_off;pc3_off;pc6_off" in /etc/sysconfig/mic/mic*.conf and since then the cards have stopped crashing when idle and I can now reboot them without rebooting the host. I don't know if turning off pc3 will damage the unit, but I have been told that turning off cpu freq is safe.  I've had a ticket open for a week or two about the mic crashing while idle after the MPSS upgrade, but I haven't heard anything new.

0 Kudos
Bsc_Cns
Novice
386 Views

The Intel Support guys told me that it seems there is some kind of memory issue. I saw this using minicom:

[71949.257374] br0: port 2(mic0) entering forwarding state
[73317.804716] MIC 0 Resetting (Post Code 3d)
[73318.802690] MIC 0 Resetting (Post Code 3E)
[73324.790626] MIC 0 Resetting (Post Code 09)
[73325.788584] MIC 0 Resetting (Post Code 10)
[73326.786592] MIC 0 Resetting (Post Code 12)

And this is what the Intel Support guys told me:

'POST code 3d and 3E indicate a memory training issue with the KNC card.'

0 Kudos
Pierre_Lagier
Beginner
386 Views

Hi,

Similar problem here. After full power off, cable unplugged, we were able to see the old MICs coming back to ready state and we did a complete re-install from previous KNC sofware version. So far all is OK, however latest version of MIC (5110P) are working as dream with latest KNC. Question is how far we can go with different KNC versions over a cluster mixing up two kinds of MICs ? We'll see soon next big MPI run is for tomorrow ! Any clues from Intel friends is more than welcome.

Pierre.

0 Kudos
kankamuso
Beginner
386 Views

Bsc Cns wrote:

The Intel Support guys told me that it seems there is some kind of memory issue. I saw this using minicom:

[71949.257374] br0: port 2(mic0) entering forwarding state
[73317.804716] MIC 0 Resetting (Post Code 3d)
[73318.802690] MIC 0 Resetting (Post Code 3E)
[73324.790626] MIC 0 Resetting (Post Code 09)
[73325.788584] MIC 0 Resetting (Post Code 10)
[73326.786592] MIC 0 Resetting (Post Code 12)

And this is what the Intel Support guys told me:

'POST code 3d and 3E indicate a memory training issue with the KNC card.'

Is this a hardware problem or a software one?. I am seeing the same error codes here and don't know if I should return the card already...

thanks,

jose

0 Kudos
Bsc_Cns
Novice
386 Views

I don't even know, after replacing the mic card, the problem persists. It may have something to do with the driver version, because it did not happen with older versions, I could restart the mpss service without a problem.

Lets see what the support guys say about it.

0 Kudos
Reply