Community
cancel
Showing results for 
Search instead for 
Did you mean: 
ALyak
Beginner
1,824 Views

82599EB: DRHD & DMAR faults, followed by Detected Tx Unit Hang

Hello everybody,

we're running ubuntu-natty 2.6.38-13.53 kernel, with ixgbe 3.2.9-k2 and ixgbevf 1.0.19-k0 drivers. We use 82599EB dual-port NICs. Each port spawns 10 VFs, which are further attached to virtual machines with KVM.

Frequently we experience network failures, which start like this:

Dec 22 14:41:07 ccmaster kernel: [190048.835136] DRHD: handling fault status reg 2

 

Dec 22 14:41:07 ccmaster kernel: [190048.864523] DMAR:[DMA Read] Request device [03:11.7] fault addr 79634000

 

Dec 22 14:41:07 ccmaster kernel: [190048.864525] DMAR:[fault reason 06] PTE Read access is not set

 

Dec 22 14:41:07 ccmaster kernel: [190049.014923] DRHD: handling fault status reg 102

 

Dec 22 14:41:07 ccmaster kernel: [190049.044511] DMAR:[DMA Read] Request device [03:11.7] fault addr 79634000

 

Dec 22 14:41:07 ccmaster kernel: [190049.044513] DMAR:[fault reason 06] PTE Read access is not set

 

Dec 22 14:41:08 ccmaster kernel: [190050.355215] DRHD: handling fault status reg 202

 

Dec 22 14:41:08 ccmaster kernel: [190050.385040] DMAR:[DMA Read] Request device [03:11.7] fault addr 77a92000

 

Dec 22 14:41:08 ccmaster kernel: [190050.385041] DMAR:[fault reason 06] PTE Read access is not set

 

Dec 22 14:41:09 ccmaster kernel: [190051.007798] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2

 

Dec 22 14:41:09 ccmaster kernel: [190051.043515] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2

 

Dec 22 14:41:09 ccmaster kernel: [190051.471541] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7

 

Dec 22 14:41:09 ccmaster kernel: [190051.510908] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7

 

Dec 22 14:41:10 ccmaster kernel: [190051.885971] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3

 

Dec 22 14:41:10 ccmaster kernel: [190051.923664] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6

 

Dec 22 14:41:10 ccmaster kernel: [190051.925334] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3

 

Dec 22 14:41:10 ccmaster kernel: [190051.964411] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6

 

Dec 22 14:41:10 ccmaster kernel: [190052.195640] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4

 

Dec 22 14:41:10 ccmaster kernel: [190052.235159] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4

 

Dec 22 14:41:11 ccmaster kernel: [190053.001909] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2

 

Dec 22 14:41:11 ccmaster kernel: [190053.040401] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2

 

Dec 22 14:41:12 ccmaster kernel: [190053.882821] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3

 

Dec 22 14:41:12 ccmaster kernel: [190053.920700] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6

 

Dec 22 14:41:12 ccmaster kernel: [190053.922305] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3

 

Dec 22 14:41:12 ccmaster kernel: [190053.960197] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6

 

Dec 22 14:41:13 ccmaster kernel: [190054.612941] ixgbe 0000:03:00.1: eth103: Detected Tx Unit Hang

 

Dec 22 14:41:13 ccmaster kernel: [190054.612943] Tx Queue <0>

 

Dec 22 14:41:13 ccmaster kernel: [190054.612944] TDH, TDT <100>, <122>

 

Dec 22 14:41:13 ccmaster kernel: [190054.612944] next_to_use <122>

 

Dec 22 14:41:13 ccmaster kernel: [190054.612945] next_to_clean <102>

 

Dec 22 14:41:13 ccmaster kernel: [190054.612946] tx_buffer_info[next_to_clean]

 

Dec 22 14:41:13 ccmaster kernel: [190054.612946] time_stamp <10121fa2d>

 

Dec 22 14:41:13 ccmaster kernel: [190054.612947] jiffies <10121fc55>

 

Dec 22 14:41:13 ccmaster kernel: [190054.838626] ixgbe 0000:03:00.1: eth103: tx hang 1 detected on queue 0, resetting adapter

 

Dec 22 14:41:13 ccmaster kernel: [190054.838782] ixgbe 0000:03:00.1: eth103: Reset adapter

 

Dec 22 14:41:13 ccmaster kernel: [190054.866337] ixgbe 0000:03:00.1: eth103: RXDCTL.ENABLE on Rx queue 20 not cleared within the polling period

 

Dec 22 14:41:13 ccmaster kernel: [190055.083995] br103: port 1(eth103) entering forwarding state

 

Dec 22 14:41:13 ccmaster kernel: [190055.232255] ixgbe 0000:03:00.1: master disable timed out

 

Dec 22 14:41:15 ccmaster kernel: [190057.418550] ixgbe 0000:03:00.1: eth103: NIC Link is Up 10 Gbps, Flow Control: RX/TX

 

Dec 22 14:41:15 ccmaster kernel: [190057.420402] br103: port 1(eth103) entering forwarding state

 

Dec 22 14:41:15 ccmaster kernel: [190057.420405] br103: port 1(eth103) entering forwarding state

 

Dec 22 14:41:15 ccmaster kernel: [190057.451889] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7

 

Dec 22 14:41:15 ccmaster kernel: [190057.491834] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7

 

Dec 22 14:41:15 ccmaster kernel: [190057.538455] ixgbe 0000:03:00.1: eth103: NIC Link is Down

 

Dec 22 14:41:16 ccmaster kernel: [190058.001181] DRHD: handling fault status reg 302

 

Dec 22 14:41:16 ccmaster kernel: [190058.029084] DMAR:[DMA Read] Request device [03:11.7] fault addr 79634000

 

Dec 22 14:41:16 ccmaster kernel: [190058.029086] DMAR:[fault reason 06] PTE Read access is not set

Dec 22 14:41:16 ccmaster kernel: [190058.338892] DRHD: handling fault status reg 402

 

Dec 22 14:41:16 ccmaster kernel: [190058.367063] DMAR:[DMA Read] Request device [03:11.7] fault addr 77a92000

 

Dec 22 14:41:16 ccmaster kernel: [190058.367064] DMAR:[fault reason 06] PTE Read access is not set

 

Dec 22 14:41:16 ccmaster kernel: [190058.508637] br103: port 1(eth103) entering forwarding state

 

Dec 22 14:41:17 ccmaster kernel: [190058.874750] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3

 

Dec 22 14:41:17 ccmaster kernel: [190058.874797] ixgbe 0000:03:00.1: eth103: NIC Link is Up 10 Gbps, Flow Control: RX/TX

 

Dec 22 14:41:17 ccmaster kernel: [190058.876606] br103: port 1(eth103) entering forwarding state

 

Dec 22 14:41:17 ccmaster kernel: [190058.876609] br103: port 1(eth103) entering forwarding state

 

Dec 22 14:41:17 ccmaster kernel: [190058.912721] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6

 

Dec 22 14:41:17 ccmaster kernel: [190058.914695] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3

 

Dec 22 14:41:17 ccmaster kernel: [190058.952633] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 6

 

Dec 22 14:41:17 ccmaster kernel: [190058.981264] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2

 

Dec 22 14:41:17 ccmaster kernel: [190059.021207] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 2

 

Dec 22 14:41:17 ccmaster kernel: [190059.184576] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4

 

Dec 22 14:41:17 ccmaster kernel: [190059.224515] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 4

 

Dec 22 14:41:17 ccmaster kernel: [190059.449368] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7

 

Dec 22 14:41:17 ccmaster kernel: [190059.488744] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 7

 

Dec 22 14:41:19 ccmaster kernel: [190060.872227] ixgbe 0000:03:00.1: eth103: VF Reset msg received from vf 3

 

Dec 22 14:41:19 ccmaster kernel: [190060.911541] ixgbe 0000:03:00.1: eth103: ...
0 Kudos
6 Replies
Mark_H_Intel
Employee
107 Views

Hi Alex,

Was this issue ever resolved? Was it related to the slot like your post at /message/146581# 146581 http://communities.intel.com/message/146581# 146581?

Mark H

ALyak
Beginner
107 Views

Hi Mark,

thank you for noticing my question.

We did not find an exact root cause for the issue.

Couple of things we did:

- downgraded to kernel 2.6.38-8.48 (which has the same ixgbe/ixgbevf drives)

- used the following systcl settings for bridges:

net.bridge.bridge-nf-call-iptables=0

net.bridge.bridge-nf-call-ip6tables=0

net.bridge.bridge-nf-call-arptables=0

Currently we're not experiencing this issue. Do you have any clue what issues those DRHD/DMAR faults and "Tx Hang" messages point to?

Thanks,

Alex.

Mark_H_Intel
Employee
107 Views

Hi Alex,

Most Linux questions are outside my area of expertise, so I will do some checking with our developers to see what I can find out. Since your connections are stable, you might not want to spend time experimenting with updated drivers, KVM, or an updated kernel. Newer versions of any of these might make a difference. Of course, I do not have any specific information that making any of those changes will help.

I know that suggesting driver and component updates is the default answer from technical support guys, but that is what I am. I cannot help myself. I always recommend updating to the latest versions UNLESS everything is stable and performing as desired. Sometimes leaving things alone is a good choice.

I will let you know what I find out from our developers. Have a great day.

Mark H

ALyak
Beginner
107 Views

Mark, thanks for your reply.

I am basically looking for a clue what these error messages might indicate. Being a dev myself, whenever a component prints an error message, it indicates that something went wrong, and a component knows what operation did not succeed (maybe it doesn't know why, though). So if you can dig out a clue what error these prints indicate, it might help a lot.

Thanks again!

Patrick_K_Intel1
Employee
107 Views

Hi Alex,

Did some digging around with the experts and discovered this was found 2+ years ago now. Some bugs in BIOS, Kernel and driver it would appear.

here are the bugzilla reports that should help with understanding the problem.

https://bugzilla.redhat.com/show_bug.cgi?id=541397 https://bugzilla.redhat.com/show_bug.cgi?id=541397

https://bugzilla.redhat.com/show_bug.cgi?id=538163 https://bugzilla.redhat.com/show_bug.cgi?id=538163

https://bugzilla.redhat.com/show_bug.cgi?id=568153 https://bugzilla.redhat.com/show_bug.cgi?id=568153 à ixgbe related.

Regards,

Patrick

ALyak
Beginner
107 Views

Hello Patrick,

thanks for looking at this as well.

Those bug reports, however, seem to indicate older kernel versions, then the one we're using (2.6.38-8). The ixgbe driver shipped by Ubuntu with this kernel (3.2.9-k2) is somewhat dated, although.

Currently, we don't see this issue anymore, perhaps due to disabling the bridge-nf settings.

Thanks!

Alex.

Reply