- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am using Intel MPI Benchmark to evaluate my Xeon Phi cards (5110p). Particularly, for the Pingpong test, all of my cards work well but one. This card will fail and automatically reboot when the message size of Pingpong test reaches 1G, which causes the benchmark to crash. Using micctrl -s to check the card, I observed that this card first became "lost", then "rebooting", and finally "online" again.
Any ideas?
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi JS,
What MPSS version, Intel compiler version, and Intel MPI libraries are you using? Also, could you attach the output from micinfo command for the co-processor that has problem? Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
loc-nguyen (Intel) wrote:
Hi JS,
What MPSS version, Intel compiler version, and Intel MPI libraries are you using? Also, could you attach the output from micinfo command for the co-processor that has problem? Thank you.
Thanks for replying. I am using MPSS 3.1.2, icc 14.0.0, impi 4.1.1.036.
The micinfo output is as attached. There are two MICs per node, the problematic card is mic1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The only differences I observe (other than serial number) is the core voltages are different mic0:1013000 uV, mic1:949000 uV and temperatures different mic0:61C, mic1:51C.
You may have a thermal condition.
What happens to the temperatures of each card as you run the test?
What happens after temperatures settles, if you run the ping-pong test starting at 1G?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
The only differences I observe (other than serial number) is the core voltages are different mic0:1013000 uV, mic1:949000 uV and temperatures different mic0:61C, mic1:51C.
You may have a thermal condition.
What happens to the temperatures of each card as you run the test?
What happens after temperatures settles, if you run the ping-pong test starting at 1G?
Jim Dempsey
Thanks for reply. The temperatures of the cards will increase a little bit as I run the test. The settled temperatures when running Pingpong test are roughly mic0: 65C, mic1 54C. I observed similar temperatures in other nodes (the MIC cards there behave normally in Pingpong test). Not sure if this is a thermal condition.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And when you ran the test starting at 1G what happened?
Crash right away, or some time after running?
Can you Ping-Pong between host and each mic individually? Does the problem show up there as well?
I haven't run the status reports, but in the Xeon Phi Data Sheet, table 6-23, it shows various status bits. One of the MIC utilities should have a log of errors and/or status. Assuming mic1 can report the status prior to resetting. Of particular interest is the P12V_UVOV.
You may have a power supply issue.
Try swapping the power cables between cards. If the problem shifts between cards then suspect insufficient capacity on the cables connected to the failing card. What do you have for video card? Is the video card and failing MIC on the same set of power cables?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
And when you ran the test starting at 1G what happened?
Crash right away, or some time after running?
Can you Ping-Pong between host and each mic individually? Does the problem show up there as well?
I haven't run the status reports, but in the Xeon Phi Data Sheet, table 6-23, it shows various status bits. One of the MIC utilities should have a log of errors and/or status. Assuming mic1 can report the status prior to resetting. Of particular interest is the P12V_UVOV.
You may have a power supply issue.
Try swapping the power cables between cards. If the problem shifts between cards then suspect insufficient capacity on the cables connected to the failing card. What do you have for video card? Is the video card and failing MIC on the same set of power cables?
Jim Dempsey
It will fail after 1G runs a while. Pingpong test between host and the problematic mic will fail in a similar fashion.
I tried to swap the power cords, but the problem stayed at mic1. It looks like the momory usage of mic1 cannot go beyond 2000MB according to micsmc-gui. The power cable is exclusive for MIC.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The next step to perform is to power off system, remove MIC1, reinstall MIC1. On my system box and motherboard I noticed that the seating of lower mounted MIC was problematic. You may be seeing a similar issue. After reseating, verify that the PCIe latch/lock fully closes. I have dual 5120P's installed in an ASUS P9X79-WS. Examine the seating of the MIC in the PCIe on the non-latch side too. I had an additional issue with the motherboard mounted slightly too deep into the case. This cause the ear bracket with screw slot to hold the card slightly out of being fully seated in the slot. To fix this I added a washer under the standoffs on the bottom of the board. I have a tower, so the "bottom" is the away from CPU side. Note, if some of the PCIe are not making contact (but most are) you may have addressing errors while everything else seems OK.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
The next step to perform is to power off system, remove MIC1, reinstall MIC1. On my system box and motherboard I noticed that the seating of lower mounted MIC was problematic. You may be seeing a similar issue. After reseating, verify that the PCIe latch/lock fully closes. I have dual 5120P's installed in an ASUS P9X79-WS. Examine the seating of the MIC in the PCIe on the non-latch side too. I had an additional issue with the motherboard mounted slightly too deep into the case. This cause the ear bracket with screw slot to hold the card slightly out of being fully seated in the slot. To fix this I added a washer under the standoffs on the bottom of the board. I have a tower, so the "bottom" is the away from CPU side. Note, if some of the PCIe are not making contact (but most are) you may have addressing errors while everything else seems OK.
Jim Dempsey
Thanks for the continuous help! I re-plugged the cards, even swapped them. The problematic card (now detected as mic0) still failed. I don't think it is a pci-e plugging issue now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
micras is the name of the service; micrasd is the name of the daemon started by the service. So the command:
service micras start
starts the daemon logging messages to /var/log/micras.log on the host. When started like this, the daemon starts in maintenance mode. To start the daemon without maintenance mode, you will need to start it by hand. I'm afraid I don't know much about maintenance mode. There is additional information on the RAS system in section 3.3 of the Intel® Xeon Phi™ Coprocessor System Software Developers Guide https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide. It's a very long section, I'm afraid.
The lower core voltage on the card that is having problems is kind of suspicious. It would have been really nice if reseating the card had upped the voltage to the same level as the other cards. If looking at the micras log doesn't tell you anything (and even if it does), it may be time to submit the problem to support.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Frances Roth (Intel) wrote:
micras is the name of the service; micrasd is the name of the daemon started by the service. So the command:
service micras start
starts the daemon logging messages to /var/log/micras.log on the host. When started like this, the daemon starts in maintenance mode. To start the daemon without maintenance mode, you will need to start it by hand. I'm afraid I don't know much about maintenance mode. There is additional information on the RAS system in section 3.3 of the Intel® Xeon Phi™ Coprocessor System Software Developers Guide https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-sys.... It's a very long section, I'm afraid.
The lower core voltage on the card that is having problems is kind of suspicious. It would have been really nice if reseating the card had upped the voltage to the same level as the other cards. If looking at the micras log doesn't tell you anything (and even if it does), it may be time to submit the problem to support.
Thanks for your reply. Reseating MICs didn't bring the voltage of the problematic MIC to a higher level.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
I am sorry we haven't resolved your issue. Have you run the micras service? I haven't used it myself, so you will have to read the manual. BTW the Intel Xeon Phi software configuration users guide lists micras and micrasd I do not know if this is a typeo, two utilities or name change. Combined with micras(s) is crashmgr.
Using micras or micras+crashmgr .AND. running the Ping-Pong tests that fails may yield some insight as to what is happening inside the failing mic.
Additional note,
The micras (micrasd) has a maintenance mode. The user guide has skimpy documentation on what this does and how to use it, it says this forces the card into Maintence test and repair mode (but nothing else is mentioned). Before you run it, I suggest you get what reports you can an search the intel.com site for additional information on micras/micrasd and maint option.Jim Dempsey
Thanks for the suggestion! I will take a shot at micras.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As Frances previously mentioned, this may be something that ultimately needs to be reported to the OEM who provided you the coprocessors and system housing them.
Since you are able to reproduce the problem with the problematic card, I suggest collecting the following data:
- as mentioned previously, the micras service may be able to detect potential issues -- which would get logged in /var/log/micras.log
- Before you start your pingpong test, I would recommend also capturing the buffer log for the problematic coprocessor, as described in this article: https://software.intel.com/en-us/blogs/2013/06/05/collecting-debug-data-when-running-with-intelr-xeon-phitm-coprocessor
to reiterate those steps (and assuming the problematic coprocessor is mic1):
echo 0 > /sys/class/mic/scif/watchdog_enabled
Then, use the following steps to show the micro-OS kernel log buffer
Mount debugfs on the host: mount -t debugfs none /sys/kernel/debug
Dump the buffer:
cat /sys/kernel/debug/mic_debug/mic1.log_buf > <some file of your choice> (shows contents of the buffer up until now)
sudo tail -f /sys/kernel/debug/mic_debug/mic1/log_buf | tee -a <some file of your choice> (collects any recent and new data as things run; also outputs contents to STDOUT)
Also, collecting the 'micdebug.sh' output would be useful.
Once you have all these, please feel free to attach them to this forum thread and we'll look at them for clues on what's happening

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page