Software Archive
Read-only legacy content
17061 Discussions

centos crashed after modprobe mic

victor_l_1
Beginner
873 Views

I am trying to install driver to setup my intel phi 7120P on centos 6.4. I followed the getting started guide, installing a blank new centos 6.4. Install rpm's all good until enter the command modprobe mic. My os froze and wasn't able to do anything except power off the pc. After that I restart my pc and then error comes up and I wasn't able to boot into centos. Here are the errors I got:

cpu 2: machine check exception: 5 bank 20: be200000000c110a
rip !inexact! 10:<ffffffff81050da2> {nr_iowait_cpu+0x22/0x30}
tsc 2690e807de addr fb31ab00 misc e8fc391600802086
processor 0:306e4 time 1421253495 socket 0 apic 4
some cpus didn't answer in synchronization
machine check: processor context corrupt

kernel panic - not syncing: fatal machine check on current cpu
pid: 0, comm: swapper tainted: g M ---------------- 2.6.32-358.el6.x86_64 #1
call trace:
<#MC> [<ffffffff8150cfc8>] ? panic+0xa7/0x16f
[<ffffffff81025f2f>] ? mce_panic+0x20f/0x230
[<ffffffff810271b3>] ? do_machine_check+0x723/0xa70
[<ffffffff81050da2>] ? nr_iowait_cpu+0x22/0x30
[<ffffffff815104fc>] ? machine_check+0x1c/0x30
[<ffffffff81050da2>] ? nr_iowait_cpu+0x22/0x30
<<EOE>> <IRQ> [<ffffffff810a8287>] ? update_ts_time_stats+0x67/0xa0
[<ffffffff8109dd08>] ? sched_clock_cpu+0xb8/0x110
[<ffffffff810a82ed>] ? tick_nohz_stop_idle+0x2d/0x50
[<ffffffff810a83eb>] ? tick_check_idle+0xdb/0xe0
[<ffffffff81076edc>] ? irq_enter+0x6c/0x80
[<ffffffff81516d53>] ? smp_apic_timer_interrupt+0x43/0x9b
[<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
[<ffffffff812d3791>] ? intel_idle+0xb8/0x110
[<ffffffff8109dd08>] ? sched_clock_cpu+0xb8/0x110
[<ffffffff81414ef7>] ? cpuidel_idle_call+0xa7/0x140
[<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
[<ffffffff81506b1c>] ? start_secondary+0x2ac/0x2ef
panic occurred, switching back to text console
Rebooting in 30 seconds..

0 Kudos
7 Replies
Frances_R_Intel
Employee
873 Views

I've never heard of a case where loading the mic kernel module corrupted the kernel. That doesn't mean it can't happen but I've never heard of it. When you say you followed the getting started guide, do you mean the directions in the readme.txt file? Those would be the right directions. And did you recompile the kernel module, following the directions in section 2.1 of the readme.txt file? That needs to be done if the kernel you are using does not exactly match one of the precompiled versions in mpss<version>/modules. 

0 Kudos
victor_l_1
Beginner
873 Views

Yes, I was following the getting started guide which directs me to the readme.txt file of mpss. The version of mpss I used was mpss-3.4.2-linux. I have also tried to compile mpss following the directions in section 2.1 but it lead to the same result (crash after mic then can't boot centos). However if I take the card out, I am able to boot up centos.
 

0 Kudos
victor_l_1
Beginner
873 Views

Any update on the problem? Could that be a defective product?

0 Kudos
Frances_R_Intel
Employee
873 Views

Defective part? Possibly. It might be useful to check with your supplier to see if they can test or replace the coprocessor, especially if the system you are using is one that has been sold as compatible with the coprocessor card.

As far as getting more information, I haven't found anyone yet who can guide me in reading the error but will expand my search - even if no one on the coprocessor team can interpret the error for me, someone on the Intel® Xeon® processor team should be able to. Right now I am kind of leaning toward a memory mapping problem.

In the meantime, you said, if I understood correctly, that when you rebooted the machine with the card in after having tried to load the mic kernel module once, it didn't get as far as having CentOS completely booted. Did you have the mic kernel module set to load on boot? Or are you saying that you went though the following sequence:

  1. insert coprocessor, boot host, load mic kernel module, panic
  2. leave coprocessor in, boot host with no module load attempt, boot failed (blank screen or error message?)
  3. pull coprocessor, boot host, everything ok

The information you provided is everything that is in the kernel log, right? You loaded the kernel module with modprobe, so you won't have an mpssd log, but there may be something useful in the messages log. Do you have a messages log that covers booting the host before installing the MPSS, attempting to start mic kernel module and attempting to boot with coprocessor installed when boot failed? That might provide some clues to what is going on.

 

0 Kudos
victor_l_1
Beginner
873 Views

I want through the sequence that you mentioned:

1. insert coprocessor, boot host, load mic kernel, screen froze after I entered the modprobe command.

2. leave coprocessor in, boot host regularly, boot failed with with the error I showed on my first command.

3. pull coprocessor, boot host, everything ok.

I don't know if those error are from kernel log. Those are just error messages shown up at failed boot.

Where can I find the mpssd log that you mentioned?

 

0 Kudos
victor_l_1
Beginner
873 Views

I finally figured out what the problem was. I didn't have both the 8pin and 6pin power cable connected. Anyway, I was able to get passed mobprobe mic but I ran into another problem. I wasn't able to upload the flash image to it. At first I got the error micflash: mic0: SMC update failed: SMC buffer size exceeded (0x1). My phi is C0 stepping. I did a lot of google search and diagnostics, ended up seeing the issue described in the Flash Issues  & Remedies pdf. I noticed that I also have issue 3 on my Phi, the smc's blue LED is static. It seems that there is a hardware issue. Does the SMC's blue LED will at least flash upon host pc power up? Mine never flash at all. If that is the case, I would feel more comfortable to return it to Intel instead of digging more into it.

0 Kudos
Frances_R_Intel
Employee
873 Views

In practical terms, for the machines we run in our group's lab, we just live with the message about the SMC buffer size being exceeded. (I can hear the developers for the firmware on the coprocessor keeling over from heart attacks as I say that.) The cards we have installed are just confused about the buffer size. If the cause is the SMC being hung, that is a problem. If you try power cycling the host to resolve that, make sure you actually pull the plug, then plugging it back in.

In either case, you can talk to your supplier about getting a replacement card. They can tell you how to go about that. One advantage of returning your current card is that it can be double checked to make sure the problem is truly in the card. 

0 Kudos
Reply