- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to setup a new system:
SuperMicro 5018GR-T
2 Intel Xeon Phis:
Coprocessor Stepping : B1
Board SKU : B1PRQ-31S1P
MPSS 3.5 and Scientific Linux 7.1
# micflash -update -device all -smcbootloader No image path specified - Searching: /usr/share/mpss/flash mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0391-02.rom.smc mic1: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0391-02.rom.smc mic0: SMC boot-loader image: /usr/share/mpss/flash/EXT_HP2_SMC_Bootloader_1_8_4326.css_ab mic1: SMC boot-loader image: /usr/share/mpss/flash/EXT_HP2_SMC_Bootloader_1_8_4326.css_ab mic1: SMC boot-loader update started mic0: SMC boot-loader update started mic1: SMC boot-loader update done mic1: Transitioning to ready state mic0: SMC boot-loader update done mic0: Transitioning to ready state mic1: Flash update started mic1: Flash update done mic1: SMC update started mic0: Flash update started mic0: Flash update done mic0: SMC update started mic1: SMC update done mic1: Transitioning to ready state mic0: SMC update done mic0: Transitioning to ready state Please restart host for flash changes to take effect
I start up mpss fine. But then at some point I loose a mic:
/var/log/messages:
May 21 15:15:44 smmic1 kernel: ------------[ cut here ]------------ May 21 15:15:44 smmic1 kernel: WARNING: at /home/build/rpmbuild/BUILD/mpss-modules-3.5/micscif/mi cscif_smpt.c:392 mic_map+0xf1/0x110 [mic]() May 21 15:15:44 smmic1 kernel: micscif_handle_lostnode 1445 node 1 May 21 15:15:44 smmic1 kernel: Warning: Core image elf header not found May 21 15:15:44 smmic1 kernel: Kdump: vmcore not initialized May 21 15:15:44 smmic1 kernel: micscif_handle_lostnode 1457 node 1 crash dump failed status -22 May 21 15:15:44 smmic1 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache intel_powerclamp coretemp intel_rapl kvm crct10dif_pclmul pcspkr crc32_p clmul i2c_i801 crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper sb_edac ablk _helper cryptd iTCO_wdt iTCO_vendor_support edac_core lpc_ich mfd_core wmi ipmi_devintf ipmi_si i pmi_msghandler acpi_power_meter ioatdma mei_me acpi_pad mei shpchp mic(OF) binfmt_misc xfs libcrc 32c raid1 raid0 sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_ helper ttm drm ahci libahci igb libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region _hash dm_log dm_mod May 21 15:15:44 smmic1 kernel: CPU: 3 PID: 3799 Comm: micinfo Tainted: GF IO------------- - 3.10.0-229.el7.x86_64 #1 May 21 15:15:44 smmic1 kernel: Hardware name: Supermicro SYS-5018GR-T/X10SRG-F, BIOS 1.0 10/21/20 14 May 21 15:15:44 smmic1 kernel: 0000000000000000 May 21 15:15:44 smmic1 kernel: 00000000482cfbb1 May 21 15:15:44 smmic1 kernel: ffff8810053b7b30 May 21 15:15:44 smmic1 kernel: ffffffff81603f36 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: ffff8810053b7b68 May 21 15:15:44 smmic1 kernel: ffffffff8106e28b May 21 15:15:44 smmic1 kernel: 0000000027be7000 May 21 15:15:44 smmic1 kernel: 0000000000001000 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: 0000001027be7000 May 21 15:15:44 smmic1 kernel: ffff881028255000 May 21 15:15:44 smmic1 kernel: 0000000000000000 May 21 15:15:44 smmic1 kernel: ffff8810053b7b78 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: Call Trace: May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02e6871>] mic_map+0xf1/0x110 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02e799f>] ? va_gen_init+0x6f/0x90 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02df88d>] ? micscif_rma_ep_init+0xed/0x150 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02c97a3>] ? __scif_open+0x93/0x110 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02d2ed2>] ? scif_fdopen+0x32/0x70 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6f68>] ? mic_open+0x48/0x50 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02e698d>] mic_map_single+0xfd/0x160 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02d9a1a>] micscif_setup_qp_connect+0x13a/0x240 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02c8ea0>] scif_conn_func+0x50/0x8c0 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff8126ecee>] ? selinux_capable+0x2e/0x40 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02cafdc>] __scif_connect+0x1fc/0x3c0 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02d3517>] scif_process_ioctl+0x537/0xe60 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff8160f294>] ? __do_page_fault+0x204/0x520 May 21 15:15:44 smmic1 kernel: mic0: Transition from state online to lost May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6fad>] mic_ioctl+0x3d/0x60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff811d9a75>] do_vfs_ioctl+0x2e5/0x4c0 May 21 15:15:44 smmic1 kernel: [<ffffffff8126ef4e>] ? file_has_perm+0xae/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff811d9cf1>] SyS_ioctl+0xa1/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b May 21 15:15:44 smmic1 kernel: micscif_handle_lostnode 1472 stopping node 1 to recover lost node! May 21 15:15:44 smmic1 kernel: ---[ end trace 2eb53c750e757832 ]--- May 21 15:15:44 smmic1 kernel: mic_map failed board id 0 addr 0x00001027be7000 size 0x00000000001000 May 21 15:15:44 smmic1 kernel: micscif_setup_qp_connect 159 error -12 May 21 15:15:44 smmic1 kernel: scif_conn_func err -12 qp_offset 0x0 May 21 15:15:44 smmic1 kernel: micscif_dec_node_refcnt 158 dec dev ffffffffa0301210 node 1 ref -9 223372036854775807 caller ffffffffa02d2f38 Lost Node?? May 21 15:15:44 smmic1 kernel: ------------[ cut here ]------------ May 21 15:15:44 smmic1 kernel: WARNING: at /home/build/rpmbuild/BUILD/mpss-modules-3.5/micscif/mi cscif_smpt.c:392 mic_map+0xf1/0x110 [mic]() May 21 15:15:44 smmic1 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache intel_powerclamp coretemp intel_rapl kvm crct10dif_pclmul pcspkr crc32_p clmul i2c_i801 crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper sb_edac ablk _helper cryptd iTCO_wdt iTCO_vendor_support edac_core lpc_ich mfd_core wmi ipmi_devintf ipmi_si i pmi_msghandler acpi_power_meter ioatdma mei_me acpi_pad mei shpchp mic(OF) binfmt_misc xfs libcrc 32c raid1 raid0 sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_ helper ttm drm ahci libahci igb libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region _hash dm_log dm_mod May 21 15:15:44 smmic1 kernel: CPU: 3 PID: 3799 Comm: micinfo Tainted: GF W IO------------- - 3.10.0-229.el7.x86_64 #1 May 21 15:15:44 smmic1 kernel: Hardware name: Supermicro SYS-5018GR-T/X10SRG-F, BIOS 1.0 10/21/20 14 May 21 15:15:44 smmic1 kernel: 0000000000000000 00000000482cfbb1 ffff8810053b7b30 ffffffff81603f3 6 May 21 15:15:44 smmic1 kernel: ffff8810053b7b68 ffffffff8106e28b 000000002257e000 000000000000100 0 May 21 15:15:44 smmic1 kernel: 000000102257e000 ffff881028255000 0000000000000000 ffff8810053b7b7 8 May 21 15:15:44 smmic1 kernel: Call Trace: May 21 15:15:44 smmic1 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b May 21 15:15:44 smmic1 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0 May 21 15:15:44 smmic1 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20 May 21 15:15:44 smmic1 kernel: [<ffffffffa02e6871>] mic_map+0xf1/0x110 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02e799f>] ? va_gen_init+0x6f/0x90 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02df88d>] ? micscif_rma_ep_init+0xed/0x150 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02c97a3>] ? __scif_open+0x93/0x110 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d2ed2>] ? scif_fdopen+0x32/0x70 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6f68>] ? mic_open+0x48/0x50 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02e698d>] mic_map_single+0xfd/0x160 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d9a1a>] micscif_setup_qp_connect+0x13a/0x240 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02c8ea0>] scif_conn_func+0x50/0x8c0 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff8126ecee>] ? selinux_capable+0x2e/0x40 May 21 15:15:44 smmic1 kernel: [<ffffffffa02cafdc>] __scif_connect+0x1fc/0x3c0 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d3517>] scif_process_ioctl+0x537/0xe60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6fad>] mic_ioctl+0x3d/0x60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff811d9a75>] do_vfs_ioctl+0x2e5/0x4c0 May 21 15:15:44 smmic1 kernel: [<ffffffff8126ef4e>] ? file_has_perm+0xae/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff811d9cf1>] SyS_ioctl+0xa1/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b May 21 15:15:44 smmic1 kernel: ---[ end trace 2eb53c750e757833 ]--- May 21 15:15:44 smmic1 kernel: mic_map failed board id 0 addr 0x0000102257e000 size 0x00000000001000 May 21 15:15:44 smmic1 kernel: micscif_setup_qp_connect 159 error -12 May 21 15:15:44 smmic1 kernel: scif_conn_func err -12 qp_offset 0x0 May 21 15:15:44 smmic1 kernel: micscif_dec_node_refcnt 158 dec dev ffffffffa0301210 node 1 ref -9 223372036854775806 caller ffffffffa02d2f38 Lost Node?? May 21 15:15:44 smmic1 kernel: ------------[ cut here ]------------ May 21 15:15:44 smmic1 kernel: WARNING: at /home/build/rpmbuild/BUILD/mpss-modules-3.5/micscif/mi cscif_smpt.c:392 mic_map+0xf1/0x110 [mic]() May 21 15:15:44 smmic1 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache intel_powerclamp coretemp intel_rapl kvm crct10dif_pclmul pcspkr crc32_p clmul i2c_i801 crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper sb_edac ablk _helper cryptd iTCO_wdt iTCO_vendor_support edac_core lpc_ich mfd_core wmi ipmi_devintf ipmi_si i pmi_msghandler acpi_power_meter ioatdma mei_me acpi_pad mei shpchp mic(OF) binfmt_misc xfs libcrc 32c raid1 raid0 sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_ helper ttm drm ahci libahci igb libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region _hash dm_log dm_mod May 21 15:15:44 smmic1 kernel: CPU: 3 PID: 3799 Comm: micinfo Tainted: GF W IO------------- - 3.10.0-229.el7.x86_64 #1 May 21 15:15:44 smmic1 kernel: Hardware name: Supermicro SYS-5018GR-T/X10SRG-F, BIOS 1.0 10/21/20 14 May 21 15:15:44 smmic1 kernel: 0000000000000000 00000000482cfbb1 ffff8810053b7b30 ffffffff81603f3 6 May 21 15:15:44 smmic1 kernel: ffff8810053b7b68 ffffffff8106e28b 0000000022578000 000000000000100 0 May 21 15:15:44 smmic1 kernel: 0000001022578000 ffff881028255000 0000000000000000 ffff8810053b7b7 8 May 21 15:15:44 smmic1 kernel: Call Trace: May 21 15:15:44 smmic1 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b May 21 15:15:44 smmic1 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0 May 21 15:15:44 smmic1 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20 May 21 15:15:44 smmic1 kernel: [<ffffffffa02e6871>] mic_map+0xf1/0x110 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02e799f>] ? va_gen_init+0x6f/0x90 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02df88d>] ? micscif_rma_ep_init+0xed/0x150 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02c97a3>] ? __scif_open+0x93/0x110 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d2ed2>] ? scif_fdopen+0x32/0x70 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6f68>] ? mic_open+0x48/0x50 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02e698d>] mic_map_single+0xfd/0x160 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d9a1a>] micscif_setup_qp_connect+0x13a/0x240 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02c8ea0>] scif_conn_func+0x50/0x8c0 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff8126ecee>] ? selinux_capable+0x2e/0x40 May 21 15:15:44 smmic1 kernel: [<ffffffffa02cafdc>] __scif_connect+0x1fc/0x3c0 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d3517>] scif_process_ioctl+0x537/0xe60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6fad>] mic_ioctl+0x3d/0x60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff811d9a75>] do_vfs_ioctl+0x2e5/0x4c0 May 21 15:15:44 smmic1 kernel: [<ffffffff8126ef4e>] ? file_has_perm+0xae/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff811d9cf1>] SyS_ioctl+0xa1/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b May 21 15:15:44 smmic1 kernel: ---[ end trace 2eb53c750e757834 ]--- May 21 15:15:44 smmic1 kernel: mic_map failed board id 0 addr 0x00001022578000 size 0x00000000001000 May 21 15:15:44 smmic1 kernel: micscif_setup_qp_connect 159 error -12 May 21 15:15:44 smmic1 kernel: scif_conn_func err -12 qp_offset 0x0 May 21 15:15:44 smmic1 kernel: micscif_dec_node_refcnt 158 dec dev ffffffffa0301210 node 1 ref -9 223372036854775805 caller ffffffffa02d2f38 Lost Node??
/var/log/mpssd:
Thu May 21 15:11:46 2015: MPSS Daemon start Thu May 21 15:11:47 2015: mic1: Command line: quiet root=ramfs console=hvc0 cgroup_disable=memory highres=off Thu May 21 15:11:47 2015: mic0: Command line: quiet root=ramfs console=hvc0 cgroup_disable=memory highres=off Thu May 21 15:11:47 2015: mic1: Debug log buffer addr ffffffff818a3320 len @ ffffffff81724cc0 Thu May 21 15:11:47 2015: mic1: Generate /var/mpss/mic1.image.gz Thu May 21 15:11:47 2015: mic0: Debug log buffer addr ffffffff818a3320 len @ ffffffff81724cc0 Thu May 21 15:11:47 2015: mic0: Generate /var/mpss/mic0.image.gz Thu May 21 15:11:50 2015: mic0: State ready -> booting Thu May 21 15:11:50 2015: mic0: Booting /usr/share/mpss/boot/bzImage-knightscorner initrd /var/mpss/mic0.image.gz Thu May 21 15:11:52 2015: mic1: State ready -> booting Thu May 21 15:11:52 2015: mic1: Booting /usr/share/mpss/boot/bzImage-knightscorner initrd /var/mpss/mic1.image.gz Thu May 21 15:12:15 2015: mic1: Monitor connection established Thu May 21 15:12:16 2015: mic0: Monitor connection established Thu May 21 15:12:16 2015: mic1: State booting -> online Thu May 21 15:12:17 2015: mic0: State booting -> online Thu May 21 15:15:44 2015: mic0: State online -> lost Thu May 21 15:15:44 2015: mic0: [SaveCrashdump] Aborted - open /proc/mic_vmcore/mic0 failed: No such file or directory Thu May 21 15:16:14 2015: mic1: State online -> lost Thu May 21 15:16:14 2015: mic1: [SaveCrashdump] Aborted - open /proc/mic_vmcore/mic1 failed: No such file or directory Thu May 21 15:17:29 2015: mic0: State lost -> resetting Thu May 21 15:17:29 2015: mic0: [SaveCrashDump] Waiting for reset Thu May 21 15:17:31 2015: mic0: [SaveCrashDump] Waiting for reset Thu May 21 15:17:31 2015: mic0: State resetting -> reset failed Thu May 21 15:17:33 2015: mic0: [SaveCrashDump] Failed to reset card. Aborting reboot
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Turns out that this was a BIOS issue. Updated BIOS and IPMI and now the fans are running better. One card still gets to 90C. Not sure what else can be done though. How bad is it to run that hot?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've encountered similar problems with a Supermicro server; the Phi would reach 90 C when idle, and as soon as you'd put any kind of load on it , it would reach 98 C and shut down .
My solution was to permanently turn on the fans (using ipmitool, it's persistent). The card is now ~ 47 C continually.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Orion,
When you say the card gets up to 90C, do you mean that it sometimes gets that hot when it is running a large job or that it gets up to 90C when it is idle? If the later, you really should look at improving the cooling.
You said that you updated the IPMI in addition to the BIOS. Does this imply that you are monitoring the card temps and using that as input to control the fans?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, it idles in the 60s. But it does get to 90C when running full out. The new IPMI software does seem to do a better job of speeding up the fans on the cards when they heat up. However, turning the fans on "full" is not at option - I can hear the machine then in my office down the hall from the server room.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If it peaks at 90 then, while not ideal, I believe it is acceptable. Looking at the data sheet (https://www-ssl.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-datasheet.html), the fan speed should increase at around 82C, ideally to keep the coprocessor temperature at around that temperature while it is busy.
I don't know what else can be done, given that running the fans full out all the time is unacceptable. If you are feeling ambitious, you can try swapping the cards between the two slots to make sure that the high temperature follows the slot, not the card. If it does, you might want to talk to Supermicro to see if there is anything that can be done to direct more air through the card in that particular slot or lower the temperature of the air entering the card.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page