Software Archive
Read-only legacy content
公告
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 讨论

Problems with new Xeon Phi

kankamuso
初学者
676 次查看

Dear all,

I just received my brand new computer with the Intel Xeon Phi. I followoed the instructions on the MPSS readme and booting files but the card behaves randomly. Sometimes it enters the reset failed mode, sometimes not. It takes a loong time (if ever) to reset using 

[jrbcast@localhost ~]$ sudo micctrl -r

Also, I cannot see anything through the micinfo command whenever the card is on the READY state even when the MPSS service is started as shown:

[jrbcast@localhost ~]$ /opt/intel/mic/bin/micinfo
MicInfo Utility Log

Created Sun Mar 3 18:49:54 2013


System Info
Host OS : Linux
OS Version : 2.6.32-279.el6.x86_64
Driver Version : NotAvailable
MPSS Version : 2.1.4982-15
Host Physical Memory : 32829 MB
CPU Family : GenuineIntel Family 6 Model 45 Stepping 7
CPU Speed : 2001.000
Threads per Core : 2


*************************** The information below is not complete **************************
****** Please start the MPSS service and run MicInfo again to view the entire output *******


Device No: 0, Device Name: Intel(R) Xeon Phi(TM) Coprocessor

Version
Flash Version : NotAvailable
UOS Version : NotAvailable
Device Serial Number : NotAvailable

Board
Vendor ID : 8086
Device ID : 2250
SubSystem ID : 2500
Coprocessor Stepping ID : 3
PCIe Width : Insufficient Privileges
PCIe Speed : Insufficient Privileges
PCIe Max payload size : Insufficient Privileges
PCIe Max read req size : Insufficient Privileges
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : NotAvailable
ECC Mode : NotAvailable
SMC HW Revision : NotAvailable

Core
Voltage : NotAvailable
Frequency : NotAvailable

Thermal
Fan Speed Control : NotAvailable
SMC Firmware Version : NotAvailable
FSC Strap : NotAvailable
Fan RPM : NotAvailable
Fan PWM : NotAvailable
Die Temp : NotAvailable

GDDR
GDDR Vendor : NotAvailable
GDDR Version : NotAvailable
GDDR Density : NotAvailable
GDDR Size : NotAvailable
GDDR Technology : NotAvailable
GDDR Speed : NotAvailable
GDDR Frequency : NotAvailable
GDDR Voltage : NotAvailable

I just see things when on the ONLINE status and after the MPSS service is started:

[jrbcast@localhost ~]$ sudo service mpss start
Starting MPSS Stack: [ OK ]
mic0: online (mode: linux image: /lib/firmware/mic/uos.img)
[jrbcast@localhost ~]$ /opt/intel/mic/bin/micinfo
MicInfo Utility Log

Created Sun Mar 3 18:50:47 2013


System Info
Host OS : Linux
OS Version : 2.6.32-279.el6.x86_64
Driver Version : 4982-15
MPSS Version : 2.1.4982-15
Host Physical Memory : 32829 MB
CPU Family : GenuineIntel Family 6 Model 45 Stepping 7
CPU Speed : 1200.000
Threads per Core : 2


Device No: 0, Device Name: Intel(R) Xeon Phi(TM) Coprocessor

Version
Flash Version : 2.1.05.0375
UOS Version : 2.6.38.8-g32944d0
Device Serial Number : ADKC25003311

Board
Vendor ID : 8086
Device ID : 2250
SubSystem ID : 2500
Coprocessor Stepping ID : 3
PCIe Width : Insufficient Privileges
PCIe Speed : Insufficient Privileges
PCIe Max payload size : Insufficient Privileges
PCIe Max read req size : Insufficient Privileges
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS

Core
Total No of Active Cores : 60
Voltage : 1032000 uV
Frequency : 1052631 kHz

Thermal
Fan Speed Control : N/A
SMC Firmware Version : 1.7.4172
FSC Strap : 14 MHz
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 81 C

GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1000000 uV

Also, the micsmc command does only show that cards are disconnected !!. I attach a file with the dmesg and all the command sequence I have followed to write this mail.

In another thread I read that these Post codes (obtained through dmesg command) meant something was wrong with the memory. Should I RMA the card?:

MIC 0 Resetting (Post Code 3d)
MIC 0 Resetting (Post Code 3d)
MIC 0 Resetting (Post Code 3d)
MIC 0 Resetting (Post Code 3d)
MIC 0 Resetting (Post Code 3d)
lo: Disabled Privacy Extensions
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)

Thanks in advance,

Jose 

0 项奖励
2 回复数
Kevin_D_Intel
员工
676 次查看

Hi Jose, I posted earlier to your post in the collateral thread (http://software.intel.com/en-us/forums/topic/366599) but it is more appropriate to continue the discussion under this post.

I looked through this post and the attachment and do not see anything to suggest the card did not function properly at least when the posted information was collected. The Engineer I inquired with indicated 3E is a normal state the card enters during reset and the time spent in that state varies with different cards; however, from your description it sounds plausible there could be a HW issue with the card.

If you purchased your system through an OEM then contact them to determine if a replacement card can/should be provided.

I realize that is not much help, but I hope it helps some.

0 项奖励
kankamuso
初学者
676 次查看

Kevin Davis (Intel) wrote:

Hi Jose, I posted earlier to your post in the collateral thread (http://software.intel.com/en-us/forums/topic/366599) but it is more appropriate to continue the discussion under this post.

I looked through this post and the attachment and do not see anything to suggest the card did not function properly at least when the posted information was collected. The Engineer I inquired with indicated 3E is a normal state the card enters during reset and the time spent in that state varies with different cards; however, from your description it sounds plausible there could be a HW issue with the card.

If you purchased your system through an OEM then contact them to determine if a replacement card can/should be provided.

I realize that is not much help, but I hope it helps some.

thanks Kevin,

I forgot to mention that I am running under CentOS 6.3 with the original kernel, no updates. Nevertheless, after waiting for more than 300 seconds for a restar, and not being able to query through mic info, I am pretty sure someone is not working well. I have already contacted my supplier. Lets ths develops.

bests,

jose

0 项奖励
回复