Software Archive
Read-only legacy content
17061 Discussions

MICs appear to crash

Orion_P_
New Contributor I
663 Views

I'm trying to setup a new system:

SuperMicro 5018GR-T

2 Intel Xeon Phis:

                Coprocessor Stepping     : B1
                Board SKU                : B1PRQ-31S1P

MPSS 3.5 and Scientific Linux 7.1

 

# micflash -update -device all -smcbootloader
No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0391-02.rom.smc
mic1: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0391-02.rom.smc
mic0: SMC boot-loader image: /usr/share/mpss/flash/EXT_HP2_SMC_Bootloader_1_8_4326.css_ab
mic1: SMC boot-loader image: /usr/share/mpss/flash/EXT_HP2_SMC_Bootloader_1_8_4326.css_ab
mic1: SMC boot-loader update started
mic0: SMC boot-loader update started
mic1: SMC boot-loader update done
mic1: Transitioning to ready state
mic0: SMC boot-loader update done
mic0: Transitioning to ready state
mic1: Flash update started
mic1: Flash update done
mic1: SMC update started
mic0: Flash update started
mic0: Flash update done
mic0: SMC update started
mic1: SMC update done
mic1: Transitioning to ready state
mic0: SMC update done
mic0: Transitioning to ready state

Please restart host for flash changes to take effect

I start up mpss fine.  But then at some point I loose a mic:

/var/log/messages:

May 21 15:15:44 smmic1 kernel: ------------[ cut here ]------------
May 21 15:15:44 smmic1 kernel: WARNING: at /home/build/rpmbuild/BUILD/mpss-modules-3.5/micscif/mi
cscif_smpt.c:392 mic_map+0xf1/0x110 [mic]()
May 21 15:15:44 smmic1 kernel: micscif_handle_lostnode 1445 node 1
May 21 15:15:44 smmic1 kernel: Warning: Core image elf header not found
May 21 15:15:44 smmic1 kernel: Kdump: vmcore not initialized
May 21 15:15:44 smmic1 kernel: micscif_handle_lostnode 1457 node 1 crash dump failed status -22
May 21 15:15:44 smmic1 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver
nfs lockd sunrpc fscache intel_powerclamp coretemp intel_rapl kvm crct10dif_pclmul pcspkr crc32_p
clmul i2c_i801 crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper sb_edac ablk
_helper cryptd iTCO_wdt iTCO_vendor_support edac_core lpc_ich mfd_core wmi ipmi_devintf ipmi_si i
pmi_msghandler acpi_power_meter ioatdma mei_me acpi_pad mei shpchp mic(OF) binfmt_misc xfs libcrc
32c raid1 raid0 sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_
helper ttm drm ahci libahci igb libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region
_hash dm_log dm_mod
May 21 15:15:44 smmic1 kernel: CPU: 3 PID: 3799 Comm: micinfo Tainted: GF         IO-------------
-   3.10.0-229.el7.x86_64 #1
May 21 15:15:44 smmic1 kernel: Hardware name: Supermicro SYS-5018GR-T/X10SRG-F, BIOS 1.0 10/21/20
14
May 21 15:15:44 smmic1 kernel: 0000000000000000
May 21 15:15:44 smmic1 kernel: 00000000482cfbb1
May 21 15:15:44 smmic1 kernel: ffff8810053b7b30
May 21 15:15:44 smmic1 kernel: ffffffff81603f36
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: ffff8810053b7b68
May 21 15:15:44 smmic1 kernel: ffffffff8106e28b
May 21 15:15:44 smmic1 kernel: 0000000027be7000
May 21 15:15:44 smmic1 kernel: 0000000000001000
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: 0000001027be7000
May 21 15:15:44 smmic1 kernel: ffff881028255000
May 21 15:15:44 smmic1 kernel: 0000000000000000
May 21 15:15:44 smmic1 kernel: ffff8810053b7b78
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: Call Trace:
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02e6871>] mic_map+0xf1/0x110 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02e799f>] ? va_gen_init+0x6f/0x90 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02df88d>] ? micscif_rma_ep_init+0xed/0x150 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02c97a3>] ? __scif_open+0x93/0x110 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02d2ed2>] ? scif_fdopen+0x32/0x70 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6f68>] ? mic_open+0x48/0x50 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02e698d>] mic_map_single+0xfd/0x160 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02d9a1a>] micscif_setup_qp_connect+0x13a/0x240 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02c8ea0>] scif_conn_func+0x50/0x8c0 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffff8126ecee>] ? selinux_capable+0x2e/0x40
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02cafdc>] __scif_connect+0x1fc/0x3c0 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02d3517>] scif_process_ioctl+0x537/0xe60 [mic]
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffff8160f294>] ? __do_page_fault+0x204/0x520
May 21 15:15:44 smmic1 kernel: mic0: Transition from state online to lost
May 21 15:15:44 smmic1 kernel:
May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6fad>] mic_ioctl+0x3d/0x60 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffff811d9a75>] do_vfs_ioctl+0x2e5/0x4c0
May 21 15:15:44 smmic1 kernel: [<ffffffff8126ef4e>] ? file_has_perm+0xae/0xc0
May 21 15:15:44 smmic1 kernel: [<ffffffff811d9cf1>] SyS_ioctl+0xa1/0xc0
May 21 15:15:44 smmic1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b
May 21 15:15:44 smmic1 kernel: micscif_handle_lostnode 1472 stopping node 1 to recover lost node!
May 21 15:15:44 smmic1 kernel: ---[ end trace 2eb53c750e757832 ]---
May 21 15:15:44 smmic1 kernel: mic_map failed board id 0
    addr 0x00001027be7000 size 0x00000000001000
May 21 15:15:44 smmic1 kernel: micscif_setup_qp_connect 159 error -12
May 21 15:15:44 smmic1 kernel: scif_conn_func err -12 qp_offset 0x0
May 21 15:15:44 smmic1 kernel: micscif_dec_node_refcnt 158 dec dev ffffffffa0301210 node 1 ref -9
223372036854775807  caller ffffffffa02d2f38 Lost Node??
May 21 15:15:44 smmic1 kernel: ------------[ cut here ]------------
May 21 15:15:44 smmic1 kernel: WARNING: at /home/build/rpmbuild/BUILD/mpss-modules-3.5/micscif/mi
cscif_smpt.c:392 mic_map+0xf1/0x110 [mic]()
May 21 15:15:44 smmic1 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver
nfs lockd sunrpc fscache intel_powerclamp coretemp intel_rapl kvm crct10dif_pclmul pcspkr crc32_p
clmul i2c_i801 crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper sb_edac ablk
_helper cryptd iTCO_wdt iTCO_vendor_support edac_core lpc_ich mfd_core wmi ipmi_devintf ipmi_si i
pmi_msghandler acpi_power_meter ioatdma mei_me acpi_pad mei shpchp mic(OF) binfmt_misc xfs libcrc
32c raid1 raid0 sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_
helper ttm drm ahci libahci igb libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region
_hash dm_log dm_mod
May 21 15:15:44 smmic1 kernel: CPU: 3 PID: 3799 Comm: micinfo Tainted: GF       W IO-------------
-   3.10.0-229.el7.x86_64 #1
May 21 15:15:44 smmic1 kernel: Hardware name: Supermicro SYS-5018GR-T/X10SRG-F, BIOS 1.0 10/21/20
14
May 21 15:15:44 smmic1 kernel: 0000000000000000 00000000482cfbb1 ffff8810053b7b30 ffffffff81603f3
6
May 21 15:15:44 smmic1 kernel: ffff8810053b7b68 ffffffff8106e28b 000000002257e000 000000000000100
0
May 21 15:15:44 smmic1 kernel: 000000102257e000 ffff881028255000 0000000000000000 ffff8810053b7b7
8
May 21 15:15:44 smmic1 kernel: Call Trace:
May 21 15:15:44 smmic1 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b
May 21 15:15:44 smmic1 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0
May 21 15:15:44 smmic1 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20
May 21 15:15:44 smmic1 kernel: [<ffffffffa02e6871>] mic_map+0xf1/0x110 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02e799f>] ? va_gen_init+0x6f/0x90 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02df88d>] ? micscif_rma_ep_init+0xed/0x150 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02c97a3>] ? __scif_open+0x93/0x110 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02d2ed2>] ? scif_fdopen+0x32/0x70 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6f68>] ? mic_open+0x48/0x50 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02e698d>] mic_map_single+0xfd/0x160 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02d9a1a>] micscif_setup_qp_connect+0x13a/0x240 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02c8ea0>] scif_conn_func+0x50/0x8c0 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffff8126ecee>] ? selinux_capable+0x2e/0x40
May 21 15:15:44 smmic1 kernel: [<ffffffffa02cafdc>] __scif_connect+0x1fc/0x3c0 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02d3517>] scif_process_ioctl+0x537/0xe60 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6fad>] mic_ioctl+0x3d/0x60 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffff811d9a75>] do_vfs_ioctl+0x2e5/0x4c0
May 21 15:15:44 smmic1 kernel: [<ffffffff8126ef4e>] ? file_has_perm+0xae/0xc0
May 21 15:15:44 smmic1 kernel: [<ffffffff811d9cf1>] SyS_ioctl+0xa1/0xc0
May 21 15:15:44 smmic1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b
May 21 15:15:44 smmic1 kernel: ---[ end trace 2eb53c750e757833 ]---
May 21 15:15:44 smmic1 kernel: mic_map failed board id 0
    addr 0x0000102257e000 size 0x00000000001000
May 21 15:15:44 smmic1 kernel: micscif_setup_qp_connect 159 error -12
May 21 15:15:44 smmic1 kernel: scif_conn_func err -12 qp_offset 0x0
May 21 15:15:44 smmic1 kernel: micscif_dec_node_refcnt 158 dec dev ffffffffa0301210 node 1 ref -9
223372036854775806  caller ffffffffa02d2f38 Lost Node??
May 21 15:15:44 smmic1 kernel: ------------[ cut here ]------------
May 21 15:15:44 smmic1 kernel: WARNING: at /home/build/rpmbuild/BUILD/mpss-modules-3.5/micscif/mi
cscif_smpt.c:392 mic_map+0xf1/0x110 [mic]()
May 21 15:15:44 smmic1 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver
nfs lockd sunrpc fscache intel_powerclamp coretemp intel_rapl kvm crct10dif_pclmul pcspkr crc32_p
clmul i2c_i801 crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper sb_edac ablk
_helper cryptd iTCO_wdt iTCO_vendor_support edac_core lpc_ich mfd_core wmi ipmi_devintf ipmi_si i
pmi_msghandler acpi_power_meter ioatdma mei_me acpi_pad mei shpchp mic(OF) binfmt_misc xfs libcrc
32c raid1 raid0 sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_
helper ttm drm ahci libahci igb libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region
_hash dm_log dm_mod
May 21 15:15:44 smmic1 kernel: CPU: 3 PID: 3799 Comm: micinfo Tainted: GF       W IO-------------
-   3.10.0-229.el7.x86_64 #1
May 21 15:15:44 smmic1 kernel: Hardware name: Supermicro SYS-5018GR-T/X10SRG-F, BIOS 1.0 10/21/20
14
May 21 15:15:44 smmic1 kernel: 0000000000000000 00000000482cfbb1 ffff8810053b7b30 ffffffff81603f3
6
May 21 15:15:44 smmic1 kernel: ffff8810053b7b68 ffffffff8106e28b 0000000022578000 000000000000100
0
May 21 15:15:44 smmic1 kernel: 0000001022578000 ffff881028255000 0000000000000000 ffff8810053b7b7
8
May 21 15:15:44 smmic1 kernel: Call Trace:
May 21 15:15:44 smmic1 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b
May 21 15:15:44 smmic1 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0
May 21 15:15:44 smmic1 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20
May 21 15:15:44 smmic1 kernel: [<ffffffffa02e6871>] mic_map+0xf1/0x110 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02e799f>] ? va_gen_init+0x6f/0x90 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02df88d>] ? micscif_rma_ep_init+0xed/0x150 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02c97a3>] ? __scif_open+0x93/0x110 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02d2ed2>] ? scif_fdopen+0x32/0x70 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6f68>] ? mic_open+0x48/0x50 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02e698d>] mic_map_single+0xfd/0x160 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02d9a1a>] micscif_setup_qp_connect+0x13a/0x240 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02c8ea0>] scif_conn_func+0x50/0x8c0 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffff8126ecee>] ? selinux_capable+0x2e/0x40
May 21 15:15:44 smmic1 kernel: [<ffffffffa02cafdc>] __scif_connect+0x1fc/0x3c0 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02d3517>] scif_process_ioctl+0x537/0xe60 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6fad>] mic_ioctl+0x3d/0x60 [mic]
May 21 15:15:44 smmic1 kernel: [<ffffffff811d9a75>] do_vfs_ioctl+0x2e5/0x4c0
May 21 15:15:44 smmic1 kernel: [<ffffffff8126ef4e>] ? file_has_perm+0xae/0xc0
May 21 15:15:44 smmic1 kernel: [<ffffffff811d9cf1>] SyS_ioctl+0xa1/0xc0
May 21 15:15:44 smmic1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b
May 21 15:15:44 smmic1 kernel: ---[ end trace 2eb53c750e757834 ]---
May 21 15:15:44 smmic1 kernel: mic_map failed board id 0
    addr 0x00001022578000 size 0x00000000001000
May 21 15:15:44 smmic1 kernel: micscif_setup_qp_connect 159 error -12
May 21 15:15:44 smmic1 kernel: scif_conn_func err -12 qp_offset 0x0
May 21 15:15:44 smmic1 kernel: micscif_dec_node_refcnt 158 dec dev ffffffffa0301210 node 1 ref -9
223372036854775805  caller ffffffffa02d2f38 Lost Node??

 

/var/log/mpssd:

Thu May 21 15:11:46 2015: MPSS Daemon start
Thu May 21 15:11:47 2015: mic1: Command line: quiet root=ramfs console=hvc0 cgroup_disable=memory highres=off
Thu May 21 15:11:47 2015: mic0: Command line: quiet root=ramfs console=hvc0 cgroup_disable=memory highres=off
Thu May 21 15:11:47 2015: mic1: Debug log buffer addr ffffffff818a3320 len @ ffffffff81724cc0
Thu May 21 15:11:47 2015: mic1: Generate /var/mpss/mic1.image.gz
Thu May 21 15:11:47 2015: mic0: Debug log buffer addr ffffffff818a3320 len @ ffffffff81724cc0
Thu May 21 15:11:47 2015: mic0: Generate /var/mpss/mic0.image.gz
Thu May 21 15:11:50 2015: mic0: State ready -> booting
Thu May 21 15:11:50 2015: mic0: Booting /usr/share/mpss/boot/bzImage-knightscorner initrd /var/mpss/mic0.image.gz
Thu May 21 15:11:52 2015: mic1: State ready -> booting
Thu May 21 15:11:52 2015: mic1: Booting /usr/share/mpss/boot/bzImage-knightscorner initrd /var/mpss/mic1.image.gz
Thu May 21 15:12:15 2015: mic1: Monitor connection established
Thu May 21 15:12:16 2015: mic0: Monitor connection established
Thu May 21 15:12:16 2015: mic1: State booting -> online
Thu May 21 15:12:17 2015: mic0: State booting -> online
Thu May 21 15:15:44 2015: mic0: State online -> lost
Thu May 21 15:15:44 2015: mic0: [SaveCrashdump] Aborted - open /proc/mic_vmcore/mic0 failed: No such file or directory
Thu May 21 15:16:14 2015: mic1: State online -> lost
Thu May 21 15:16:14 2015: mic1: [SaveCrashdump] Aborted - open /proc/mic_vmcore/mic1 failed: No such file or directory
Thu May 21 15:17:29 2015: mic0: State lost -> resetting
Thu May 21 15:17:29 2015: mic0: [SaveCrashDump] Waiting for reset
Thu May 21 15:17:31 2015: mic0: [SaveCrashDump] Waiting for reset
Thu May 21 15:17:31 2015: mic0: State resetting -> reset failed
Thu May 21 15:17:33 2015: mic0: [SaveCrashDump] Failed to reset card.  Aborting reboot

 

0 Kudos
5 Replies
Orion_P_
New Contributor I
663 Views

Turns out that this was a BIOS issue.  Updated BIOS and IPMI and now the fans are running better.   One card still gets to 90C.  Not sure what else can be done though.  How bad is it to run that hot?

0 Kudos
JJK
New Contributor III
663 Views

I've encountered similar problems with a Supermicro server; the Phi would reach 90 C when idle, and as soon as you'd put any kind of load on it , it would reach 98 C and shut down .

My solution was to permanently turn on the fans (using ipmitool, it's persistent). The card is now ~ 47 C continually.

 

 

0 Kudos
Frances_R_Intel
Employee
663 Views

Orion,

When you say the card gets up to 90C, do you mean that it sometimes gets that hot when it is running a large job or that it gets up to 90C when it is idle? If the later, you really should look at improving the cooling.

You said that you updated the IPMI in addition to the BIOS. Does this imply that you are monitoring the card temps and using that as input to control the fans?

0 Kudos
Orion_P_
New Contributor I
663 Views

No, it idles in the 60s.  But it does get to 90C when running full out.  The new IPMI software does seem to do a better job of speeding up the fans on the cards when they heat up.  However, turning the fans on "full" is not at option - I can hear the machine then in my office down the hall from the server room.

0 Kudos
Frances_R_Intel
Employee
663 Views

If it peaks at 90 then, while not ideal, I believe it is acceptable. Looking at the data sheet (https://www-ssl.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-datasheet.html), the fan speed should increase at around 82C, ideally to keep the coprocessor temperature at around that temperature while it is busy.

I don't know what else can be done, given that running the fans full out all the time is unacceptable. If you are feeling ambitious, you can try swapping the cards between the two slots to make sure that the high temperature follows the slot, not the card. If it does, you might want to talk to Supermicro to see if there is anything that can be done to direct more air through the card in that particular slot or lower the temperature of the air entering the card.

0 Kudos
Reply