SMBus indicating busy and the R/W transactions are failing on Intel x86-64 based Broadwell-DE board running Wind River Linux 9.

pchik3 · ‎02-04-2020

Summary : SMBus indicating busy and the R/W transactions are failing

Board : Intel Broadwell-DE.

host OS with version

# uname -a

4.8.28-rt10-WR9.0.0.20_ovp #1 SMP Thu Jul 18 10:08:49 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux

dmesg snippet :

2019-10-13T07:34:52.276369+00:00 _RIO_Optical_P2B_EDVT-node kernel: i801_smbus 0000:00:1f.3: Transaction timeout

2019-10-13T07:34:52.278340+00:00 _RIO_Optical_P2B_EDVT-node kernel: i801_smbus 0000:00:1f.3: Failed terminating the transaction

2019-10-13T07:34:52.278346+00:00 _RIO_Optical_P2B_EDVT-node kernel: i801_smbus 0000:00:1f.3: Transaction failed

2019-10-13T07:34:52.278346+00:00 _RIO_Optical_P2B_EDVT-node kernel: i801_smbus 0000:00:1f.3: SMBus is busy, can't use it!

2019-10-13T07:34:52.278347+00:00 _RIO_Optical_P2B_EDVT-node kernel: i801_smbus 0000:00:1f.3: SMBus is busy, can't use it!

2019-10-13T07:34:52.279329+00:00 _RIO_Optical_P2B_EDVT-node kernel: i801_smbus 0000:00:1f.3: SMBus is busy, can't use it!

2019-10-13T07:34:52.279337+00:00 _RIO_Optical_P2B_EDVT-node kernel: i801_smbus 0000:00:1f.3: SMBus is busy, can't use it!

2019-10-13T07:34:52.279900+00:00 _RIO_Optical_P2B_EDVT-node i2cd[7513]: I2C recovery initiated. Trigger due to non SFP device Fan board CPLD.

2019-10-13T07:34:52.280325+00:00 _RIO_Optical_P2B_EDVT-node kernel: i801_smbus 0000:00:1f.3: SMBus is busy, can't use it!

2019-10-13T07:34:52.280329+00:00 _RIO_Optical_P2B_EDVT-node kernel: i801_smbus 0000:00:1f.3: SMBus is busy, can't use it!

-> Issue is seen sporadically after multiple warm resets (Reboots), as SMBus is indicating busy, customer is not able to perform further tasks related to I2C (SMBus). In order to investigate further we need help from Intel driver team (I2C driver) to comment on below queries.

Questions from Wind River to Intel :

1) HOST_BUSY: bit 0 of HST_STS - Host Status Regiser (SMBus - D31:F3)

According to datasheet "No SMB registers should be accessed while this bit is set, except the BLOCK DATA BYTE Register",

What happen if other register is accessed while this bit is setting?

Are there any methods to clear this bit and bring SMBus Controller back to function again?

2) KILL: bit 1 of HST_CNT - Host Control Register (SMBus - D31: F3)

If HOST_BUSY bit is set in HST_STS, can it be cleared by setting and then unsetting this KILL bit in HST_CNT?

What the minimum time is required between setting and unsetting the KILL bit?

3) Soft SMBus Reset (SSRESET): bit 3 of HSOTTC - Host Control Register (SMBus - D31: F3)

According to datasheet, this bit can be used to reset SMBus state machine, my question is “Can it be used to clear HOST_BUSY bit”?

4) FAILED: bit 4 of of HST_STS - Host Status Regiser (SMBus - D31:F3)

Datasheet says "This bit is set in response to the KILL bit being set to terminate the host transaction and it is cleared by software writing 1 to it".

But according to my test, the FAILED bit will be cleared after setting KILL bit and before unsetting KILL. No software writing is needed to clear it.

Could you confirm this behavior?

5) Is there a way to hardware reset SMBus controller?

CarlosAM_INTEL · ‎02-04-2020

Hello, @pchik3:

Thank you for contacting Intel Embedded Community.

Could you please clarify if this situation happens on your design or a third- party design?

In case that it is a third-party device, could you please inform the name of the manufacturer, its model, the part number, and where its documentation is stated?

On the other hand, could you please let us know how many units of the project related to this circumstance have been manufactured? How many are affected? Could you please give the failure rate? Also, could you please list the sources that you have used to design it and if it has been verified by Intel? Could you please let us the procedure that you have followed to determine this issue?

Could you please give pictures of the top side markings of the affected processors?

We are waiting for your reply to these questions.

Best regards,

@Mæcenas_INTEL.

pchik3 · ‎04-02-2020

Thanks for getting back to us.

This issue was encountered in customers own board. Below are the comments on your query.

Could you please clarify if this situation happens on your design or a third- party design?

>> This is happening in our Juniper design.

In case that it is a third-party device, could you please inform the name of the manufacturer, its model, the part number, and where its documentation is stated?

>> Issue is happening with Juniper design RIO/RIO-X platform, Broadwell-DE based Design

On the other hand, could you please let us know how many units of the project related to this circumstance have been manufactured? How many are affected? Could you please give the failure rate? Also, could you please list the sources that you have used to design it and if it has been verified by Intel? Could you please let us the procedure that you have followed to determine this issue?

>> Issue has been in multiple instances/locations/units. We think issue not related unit/HW specific scenarios. B-DE platform schematic/Layout reviewed by Intel.

Could you please give pictures of the top side markings of the affected processors?

>> We cannot share the pictures, this is Broad well-DE processor.

CarlosAM_INTEL · ‎04-02-2020

Hello, @pchik3:

Thanks for your reply.

In order to help, we have sent an email to you.

Best regards,

@Mæcenas_INTEL.

pchik3 · ‎04-03-2020

Thank for clarification, I will check with IPS team and create a new ticket.