Intel® ARC™ Graphics
Get answers to your questions or issues when gaming on the world’s best discrete video cards with the latest news surrounding Intel® ARC™ Graphics
2308 Discussions

VIDEO_TDR_FAILURE BSOD with A770 LE when ASPM L1 is enabled

SomethingForNothing
3,184 Views

Hi, I'm encountering BSOD crashes randomly, reporting VIDEO_TDR_FAILURE under low load situations in Windows 11. The BSOD itself features some particular graphics corruption as if the whole image was divided into many little rectangles. About every second of them is black or flickering. The little rectangles inbetween appear normal so at least the larger text parts of the BSOD are readable. Then the system reboots.

There is no action which triggers the error directly but it occurs randomly after 30 to 200 minutes of running idle or under low load (mail client, web browser in background, etc.).

 

I tracked the issue down to be related to enabling ASPM L1 to reduce the A770's idle power consumption.

The ASPM setting in BIOS is set to Auto which includes L1

The PCI Express Power saving in Windows is set to Maximum

As a result, A770's idle power consumption drops from about 40 W to about 15 W as shown by Arc Control.

However, the system also shows the unstable behavior described above.

 

Workaround: When I change PCI Express Power saving in Windows to Medium, idle power consumption of the A770 goes back to 40 W but the system is stable.

 

The issue persisted over different Arc driver versions. As a final step I reinstalled 5333 after using DDU which had no effect.

The issue persisted over different BIOS versions as well, both 7C84v1E where ASPM settings were introduced for my mainboard and 7C84v1F which is the most recent BIOS.

 

Specifications:

Intel A770 LE 16 GB

Arc Driver 31.0.101.5333

AMD Ryzen 5800X

MSI MAG X570 Tomahawk WiFi

32 GB RAM

Windows 11 Pro 23H2 Build 22631.3296

Corsair TX650M PSU

0 Kudos
16 Replies
그래요ITech
New Contributor I
3,030 Views

Hello SomethingForNothing.

The VIDEO_TDR_FAILURE BSOD is often related to issues with the graphics card or its drivers. TDR stands for Timeout, Detection, and Recovery, and this error occurs when the graphics card stops responding to commands from the OS. The graphical corruption you're experiencing, with the screen divided into rectangles, suggests that the graphics card is not recovering from a timeout as expected.

Your troubleshooting steps have correctly identified that enabling ASPM L1 in BIOS to reduce power consumption is linked to the instability. ASPM (Active State Power Management) is a power management protocol that can sometimes cause instability when not properly supported by all components or drivers.

The workaround you've found, setting PCI Express Power saving to Medium, which stabilizes the system at the cost of increased power consumption, indicates that the power-saving features are likely causing the instability. It's a common trade-off between power efficiency and system stability.

Given that the issue persists across different Arc driver versions and BIOS updates, it seems like a compatibility issue between the power-saving features and your specific hardware configuration. Here are some additional steps you can consider:

 

1. Check for BIOS Updates: Ensure that your BIOS is up-to-date with the latest version from the motherboard manufacturer's website.
2. Check for Windows Updates: Make sure that Windows 11 is updated with the latest patches and updates.
3. Monitor Temperatures: Use software to monitor your GPU and CPU temperatures to ensure they are within normal ranges.
4. Stress Test: Run a stress test on your GPU to see if the issue can be replicated under controlled conditions.
5.
Contact Support: Reach out to MSI support for further assistance, as they may have more information on compatibility issues with the A770 and your motherboard.

 

If none of these steps resolve the issue, you may need to consider whether the power savings are worth the instability, or if using the system at a higher power consumption is the more reliable solution.

 

Cheers,

Max

0 Kudos
SomethingForNothing
3,022 Views

Hi Max, thanks for your reply. Regarding the additional steps you mentioned I'm pretty confident I can rule out most other potential sources of errors, so long story short, it does appear like there currently is a compatibility issue between the A770, the X570 Tomahawk WiFi and ASPM L1. My hope is that it's something fixable by a (card or board) BIOS update.

1. BIOS is up to date, however as long as the issue persists I'll also try newer releases when/if MSI publishes one.

2. Windows is always updated with everything offered by Windows Update. Same goes for the A770 for which I hope a newer driver/firmware might fix the issue just like the lower idle power consumption was introduced with some (I cannot remember which) version of the Arc driver.

3. As for temperatures, when the crash happens it's under idle conditions with the CPU at 62°C and the A770 as well as its VRAM at 50°C. Under load I never had any crash so load temperatures are most likely unrelated to the issue but even while gaming all readings are well below 80°C which are in fact very good values for a Ryzen 5800X CPU. Also the A770 has its software limiter set to 86°C so that shouldn't be an issue either.

4. Also just for the sake of completeness as my crashes only occur in idle/low load situations I never had any issues running 3DMark or any 3D games with about 98% reported GPU activity.

5. I did in fact consider contacting MSI since both the A770 and the mainboard are involved. However since the crashes always mention the graphics card and none of the other PCI express devices, I first reached out to Intel. As you mentioned, the error indicates failure of the graphics card or driver to respond (my gut feeling would be that the card sometimes has issues waking up when returning from L1, but I have no ways to debug this). It might as well be a problem with the ASPM implementation of the board or BIOS but my hope is that if it's a problem of the A770, it could be confirmed or even better fixed. Else if the result is that there is no issue with the A770's ASPM implementation or the driver I'll have a much better position contacting MSI telling them that Intel already had investigated the issue and suggests that it's the board's implementation which needs to be checked. Else I'm quite certain that MSI will send me right back here to check if Intel is aware of any compatibility issues with the A770 in certain environments.

 

With that said, I'm currently operating with the workaround and high idle power consumption. This keeps the system stable but I do hope for an actual solution in the future which allows me to enable ASPM L1 again.

0 Kudos
LilithTwilight
2,983 Views

Hi SomethingFor Nothing,

I have the exact identical phenomenon when I set the  PCI Express power saving in windows to maximum. But it seems to get better with every version of the ARC driver.

In older versions I had the blue screen once a week, and it was flickering like you described it. In newer versions, 5739 at the moment, I get the blue screen every 2-3 weeks, the screens goes black and after a few seconds I get a "stable" blue screen with no flickering, soooo it is improving

So hopefully the devs will fix this soon completely.

I have a Gigabyte mainboard, for the record.

 

 

0 Kudos
그래요ITech
New Contributor I
2,841 Views
Thank you for your detailed response SomethingForNothing
It appears that you've thoroughly investigated the potential sources of the issue and have narrowed it down to a compatibility concern between the A770, the X570 Tomahawk WiFi, and ASPM L1.
  1. It's wise to keep your BIOS up to date, and trying newer releases from MSI when available could be beneficial in resolving the issue.

  2. Keeping Windows and the A770 updated is essential, and a newer driver or firmware update may indeed address the problem, similar to previous updates that improved idle power consumption.

  3. Your temperature observations are insightful, indicating that load temperatures are likely unrelated to the issue. The fact that crashes only occur during idle or low-load situations suggests that temperature is not a contributing factor.

  4. Your experience with 3DMark and 3D games further supports the notion that the issue is specific to idle or low-load conditions.

  5. Considering reaching out to MSI is a prudent step, especially if the issue persists after investigating with Intel. If the problem is determined to be with the A770, MSI may be able to provide further assistance or updates to address compatibility issues.

In the meantime, operating with the workaround to maintain system stability is a practical approach, but I understand your desire for a permanent solution. Hopefully, a future update or resolution will allow you to enable ASPM L1 without issues.

 

Cheers,

Max

0 Kudos
SomethingForNothing
2,942 Views

Hi LilithTwilight, sorry to hear that ASPM L1 isn't working for you either. I'm at 5382 currently and still getting the BSODs once the system is going idle or operated under low load. It's not immediately and I can see in Arc Control that the A770 is alternating between around 15 W and 40 W power consumption for a while. So sometimes the wakeup does work but at some point it triggers the BSOD. It's hard to quantify, but subjectively I didn't notice any change over different Arc driver or BIOS versions I used. I do share your hope though that it can be fixed completely. By the way, my mainboard is currently on AGESA 1.2.0.B.

0 Kudos
그래요ITech
New Contributor I
2,674 Views
Hi. Apologies it's been quite a hectic week for me.
Are you still experiencing random BSODs and crashes on your system?...
Also have you tried to contact MSI about this concern?

Cheers,
Max
0 Kudos
SomethingForNothing
2,656 Views

Hi Max, my experience regarding stability with ASPM L1 hasn't changed. Meanwhile I was able to test with both an upgraded A770 graphics driver (5382) and an upgraded BIOS (7C84v1G which comes with the newer AGESA 1.2.0.C).

The only (subtle) difference I noticed is that upon crashing the card wasn't able to display the (visually corrupted) BSOD as it used to but instead showed a black screen until the reboot happened.

Meanwhile Arc driver 5444 is installed on my system but I haven't tested setting ASPM L1 with version 5444 because the driver's changelog didn't mention anything related to this issue. However, with PCIE energy savings set to medium the system is still stable.

I have not yet contacted MSI because I'd like to be able to provide them with (at least preliminary) results from my investigation with Intel.

0 Kudos
SomethingForNothing
2,375 Views

Hi, meanwhile I'm on Arc driver 5448 and thought I'd give the maximum PCIE energy savings setting a try again. Surprisingly (since no changes in the context of power management are mentioned) the system appears more stable so far. So far I've had a full day of stable operation which I had not achieved with the previous versions. BIOS is unchanged at version 7C84v1G, Windows ran its usual updates. So I cannot tell where the change happened but from my observations it's quite likely that something somewhere was improved. So I can confirm LilithTwilight's observations that things are improving on my setup as well.

I'll keep ASPM L1 enabled and will observe system stability for the next days. What I can say so far is that according to Arc telemetry the system again goes to the lower power states with a reported power consumption betwen 13 W and 26 W in idle/low load situations. Sometimes the card seems to get "stuck" at 41 W as a lower limit. It never goes below this value even when all (obvious) applications are closed and just the plain desktop remains. I have not yet found out what triggers this. It could be the situation which would have resulted in the previously observed BSOD but that's not much more than a wild guess. After either a reboot or just logging off and on the current user, the card is able to draw less power than 41 W again.

TLDR: With all software at their current versions as of now and ASPM L1 enabled the system appears more stable so far. I haven't observed a crash in a full day but it's too early to say it's perfectly stable now.

I'm quite happy how stability appears to improve behind the scenes, however sometimes I'd be happy about a little more information if something is being worked on or changed so I could have done tests more systematically and reduced the amount of guessing at my side. Still, progress appears promising so far.

0 Kudos
SomethingForNothing
2,347 Views

Too early to celebrate. With the settings posted above, the system just crashed to a black screen (while using Firefox) and automatically rebooted. Before the screen went black, I briefly observed a small rectangular area flickering in the area of the mouse cursor. Then the screen went black and after a few seconds the system rebooted. For the time being, I'm reverting to the workaround settings of PCIE energy savings set to just medium.

0 Kudos
LilithTwilight
1,810 Views

I had the 5522 which seems stable over weeks with no BSOD an reboot, now I have updated to the 5768 and again the BSOD and reboots are back  with "Max PCI Express Power saving" selected -.-

 

Seems the error was fixed (or much better) and now it is back and worse again, sooo @intel what did you change again from 5522 to 5678 to bring the error back? Please change it back again ^_-

0 Kudos
SomethingForNothing
1,754 Views

Interesting observations... Especially considering that none of the changelogs managed anything about changes in power savings. I must say I've given up a little on testing power savings. I had the 5592 running stable (with just medium powersavings) and did updates mostly to have support for current games. Beginning with 5593 I got a different bluescreen (with exactly the same settings as the formerly stable configuration) about HYPERVISOR_ERROR which luckily got resolved with 5762 and the included firmware. So for me (minus power settings) the 5762 is currently stable.

I tried setting maximum power savings briefly, but GPU power consumption didn't go below 42 W anymore (with 5762) after the system was running for a while. This means, for me the setting doesn't give any actual power savings (it used to do so with older drivers) so I reverted back to the known stable workaround "medium powersavings".

0 Kudos
LilithTwilight
1,747 Views

Thank you for sharing your observations

 

At the moment I switched also back to medium PCIe power savings, to check if the driver is running more stable with this. If the GPU takes more power ok, I have 4 monitors connected, I think it will anyway not go down that much with power consumption. But what bothers me more is, that with medium power savings my nvmes also gets hotter then with maximum, becaus they also benefit from the setting...but if the GPU gets unstable I can't use it

So I will see if this is more stable now...even if this get a little bit frustrating...

 

Why has never someone from Intel answered in this thread, or gave any comment at all? Seems to me this problem has no importance to Intel??

0 Kudos
SomethingForNothing
1,713 Views

I'm rather lucky and need it only for the graphics card. Would be nice to have it go down from 40 W to 20 W in low load situations, but it's not a big deal for me. But I really feel your frustration

 

So far 5768 after a clean install has been stable for me even with maximum power savings. But then, I only have one screen connected so my situation is probably simpler for the card to not crash. I have them enabled now and repeatedly observed that after longer system uptime, the card did not go from 40 W (which it also consumes in medium mode) to the 20 W level (which seems to be the actual power saving mode it can enter when PCIe power savings is maximum).

 

Reading back, I had a similar impression with 5448 earlier. It appears as if the card would indeed crash upon waking up in specific situations and the newer drivers are preventing GPU's sleep mode in such situations which are "known troublesome". Resulting in less crashes but more frequently not going to idle in low load situations. Just a wild guess, but it looks that way.

 

I think Max was from Intel, at least he had the tech tag after his name. Unfortunately he didn't reply anymore when I suspected that the error appears to be something in Intel's domain rather than MSI's (motherboard manufacturer).

 

Maybe 5768 works for you, too? If not, do you have any chance to test with just one screen? For me it's rather stable so far. If you manage to show that it crashes with 4 screens but not with 1 screen, you'd have a very easy to reproduce situation which might be worth opening another ticket.

0 Kudos
LilithTwilight
1,709 Views

At the momennt I already have 5768 and try it mit medium settings to see if it runs stable with this, if it is not crashing anymore, that would also be a hint something is wrong with the driver. Also with 5522 ist runs stable for weeks without crash AND maximum powersavin, so something has changed in the driver back to a state where it is not stable anymore. I try to report back if or if not it runs stable with this settings. Evntually if a new driver comes out I will test these too.

 

I don't think Max was from Intel, the Tech means nothing and he has no official logo and no moderator status, so to me its only another normal user who trys to help but not from intel.

 

One monitor is no option at the moment for me, I need the PC also to work with and it's not useable for me with only one monitor.

0 Kudos
SomethingForNothing
1,510 Views

After about 2 weeks of crash free operation, I got the VIDEO_TDR_FAILURE again with 5768. For me it's the most stable version so far, but I'll go back to medium again. But I do fear our problem is a rare case now, else I'd have suspected someone with a similar problem would at least find this still active thread and add that they were having issues, too.

 

Regarding your monitor, I was mainly thinking of a temporary debugging setup just to create two scenarios where ideally the single monitor setup would be stable while the multi monitor setup would still crash without anything else changed. This might be specific enough for another bug report to actually reach the tech department.

0 Kudos
SomethingForNothing
1,707 Views

That's nice, so at least web search will have some documentation if others have the same problem. I got the impression most people don't notice because hardly anyone sets the Intel recommended setting for better power saving on a desktop. At least in my setup, I would have never noticed any problems (apart from application crashes in the earlier versions when the driver was rather unstable. But luckily these times seem long ago).

So far, 5768 and maximum power savings haven't crashed on me. I'll keep my fingers crossed that it'll stay that way and that this version will do it for you, too.

0 Kudos
Reply