Closed Bug 1877726 Opened 5 months ago Closed 4 months ago

Rendering stops, but content reacts to input

Categories

(Core :: Graphics, defect, P1)

Unspecified
Windows 11
defect

Tracking

()

RESOLVED FIXED
124 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox122 --- unaffected
firefox123 blocking fixed
firefox124 --- fixed

People

(Reporter: chutten, Assigned: bas.schouten)

References

(Depends on 1 open bug, Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression, regressionwindow-wanted)

Attachments

(5 files)

Crash report: https://crash-stats.mozilla.org/report/index/87550adc-5c60-424b-ab59-b70de0240131

On my Windows Nightly I opened a private browser to perform a youtube search for Andrew W.K. and then clicked on the first result to listen to Party Hard. Though the window remained painting the youtube SERP, and the browser looked like it locked up, the audio from the video played through my headphones.

Symptoms:

  • Painting stops. Resizing the window means expanded sections (tabstrip, content, all) are white.
  • Web Content remains responsive. (e.g. a Youtube video was playing and kept playing. Pressing K paused/unpaused it. I could open a new tab to about:crashparent which is how I got the crash report)

(( I've been experiencing "lockups" on my Linux Nightly for the past little while and assumed it was something stupid to do with my borked GTK or nvidia or whatever. But then it happened on my Windows Nightly, so maybe it's the same problem and maybe it's not. Now that I know it's still responsive when it appears deadlocked, I'll try capturing profiles next time ))

Not sure what's going on, but now I have the profiler installed on that profile so I can try capturing that if it happens again.

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: General → Widget: Gtk
Product: Firefox → Core
Component: Widget: Gtk → Graphics

Sounds similar to bug 1877515

See Also: → 1877515

At least superficially, it doesn't look like it is hanging in the same place as bug 1877515, where in that profile we see a block down in WebRender inside a D3D device flush.

OTOH, there are no renderer threads in the crash report, which is highly suspicious as well.

Can you maybe try to get a profile with the Graphics preset while the situation is happening, so we can compare with the one we have for bug 1877515?

Blocks: gfx-triage
Severity: -- → S2
Flags: needinfo?(chutten)

Glenn, or Andrew, any off-hand guesses what might be causing this? Is this a WR thing? A media thing?

Flags: needinfo?(gwatson)
Flags: needinfo?(aosmond)

Wonder if there is any tangential relation to bug 1876380, and some issue related to stack size...

See Also: → 1876380

I'll try my best to get a profile if it happens again, but it resists intentional reproduction. I will note that unlike bug 1877515 things didn't improve after 30s (it was in its state for several minutes, at least long enough to play Party Hard more than once) so either the gpu process wasn't restarted or it didn't help.

Flags: needinfo?(chutten)

Nothing obvious I can think of, let's see if a profile can be captured that provides any clues.

Flags: needinfo?(gwatson)
See Also: → 1878101

I can repro this (although not reliably) on my Intel Tigerlake laptop on Windows as well. Opening the web developer tools (ctrl-shift-I) on several pages seems to contribute, particularly YouTube and certain other heavy pages like Element matrix client and Google Docs, but I haven't found an exact trigger. Since the browser is still interactive (even if I can't see what is going on), I tried hitting ctrl-N and that restarted the GPU process (in software rendering however, which is a fallback mode) and made the browser usable again, I'm going to keep an about:memory window open so that I can press the Measure and save button next time this happens and see what is accumulating vast amounts of memory when it is frozen like this (probably just the animated theme Dark Space theme by nicothin, and youtube video - it was accumulating at a fast rate of 24MB/s according to task manager, and ended up running the laptop out of disk space for swap file).

Dangit, it happened again, but this time on my Beta 123.0b4 (no profiler extension. I only added it to Nightly). Once again on my Windows laptop. Once again when opening a youtube video in a tab in a private browsing window (this time the link was https://youtu.be/L4u7Zy_b868?si=qPxnJISoAZ8w2dRt&t=482 ), but this time the video couldn't be heard playing and pressing K to play it didn't work. Curious.

Without the profiler, I had no way to capture interesting information, so I used Task Manager to identify and End Task the gpu process. Got a frame of white, then rendering came back, and as soon as it rendered the private tab, the youtube video began playing. And now everything's working just peachily, with the gpu process having about 300MB lower allocations than when I killed it (566 down from 870ish)

Now to enable the profiler and to wait to see if it happens again. Twice in under a week, odds are decent I'll get another hit before too long.

I ran into this bug as well last night. I'm on Windows 11, current Nightly. No profile unfortunately, but I can recount a blow by blow:

  • I was watching a YouTube video (call this Tab A)
  • the browser appeared to freeze visually, but the audio continued playing
  • I clicked on another tab already open in the tab strip (call this Tab B) and was able to see Tab B, but the audio from Tab A was still playing
  • the browser became totally frozen visually, showing Tab B
  • when I clicked on Tab B, the clicks were actually interacting with the video in Tab A. So I could pause/resume the Tab A video while the browser was visually frozen on Tab B
  • at that point, all I could do was restart my machine, after which things were back to normal

I have seen this 3-4 times in the last week or so. Symptoms are similar to Comment#10 and Comment #9. I tried to record a profile, but during the "capture" phase, the profiler got stuck. I then ultimately had to "end task" Firefox.

Heya @padenot, @alwu:
Does this sound like overproduction?

Flags: needinfo?(padenot)
Flags: needinfo?(alwu)

@sotaro: Could a wait call be getting stuck here?

Flags: needinfo?(sotaro.ikeda.g)

[Tracking Requested - why for this release]: I'm worried that this is a regression in 123 and it seems bad when it happens.

Crash Signature: [ @ CrashChannel::OpenContentStream ]

Removing the signature, as it is an artifact of the method :chutten used to get some stacks.

What we want here is an ETW trace of this happening, since we cannot use the Firefox Profiler.

:chutten, Mayank, do you mind setting your machine up with https://github.com/google/UIforETW/releases, and making an initial capture of something random, because the tool will set itself up on first use, then making a capture when this happens, and sharing it with a mozilla developer? I don't know who yet, I can help routing this to the right people I guess.

A small tutorial is at https://randomascii.wordpress.com/2015/04/14/uiforetw-windows-performance-made-easier/ ("How do you use it?"), but I'm around if need be.

Crash Signature: [ @ CrashChannel::OpenContentStream ]
Flags: needinfo?(padenot)

The bug is marked as tracked for firefox123 (beta). We have limited time to fix this, the soft freeze is in 6 days. However, the bug still isn't assigned.

:bhood, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(bhood)

If this only happens on Windows, it might be related with hardware decoding crashing the whole gpu process due to driver/graphic card problem. If it happens on Linux as well, then it's less likely caused by that because it doesn't run in the same process with the rendering.

:chutten, :scunnane, when the tab is frozen, would you still be able to open the Web Develop Tool (Ctrl + Shift + I) on the frozen tab? If so, could you try to use this add-on. After installing that add-on, you can see a Media-Webrtc tab in the devtool. Click the icon before "Media Info" would extend all the information of the media pipeline, then clicking "save page as" would be able to save the whole page. Then we can at least know what the status of the media pipeline is at the moment the tab is frozen.

Flags: needinfo?(alwu)
Flags: needinfo?(scunnane)
Flags: needinfo?(chutten)

:alwu, I've installed the dev tools media panel, but I'm not able to intentionally reproduce this bug at the moment. However I have encountered the bug about 5 times now over the past week and a half, so I imagine I'll be able to grab the media pipeline info for the frozen tab soon enough.

(In reply to Paul Adenot (:padenot) from comment #15)

What we want here is an ETW trace of this happening, since we cannot use the Firefox Profiler.

I've downloaded, unzipped, run (with sdk download), and given a go at recording an ETW trace. The resulting trace opened in WPA and appears to be working, so I'll set that to run and then wait for reproduction. (Though I admit I haven't had luck running into this problem in the past week : | ).

(In reply to Alastor Wu [:alwu] from comment #17)

If this only happens on Windows, it might be related with hardware decoding crashing the whole gpu process due to driver/graphic card problem. If it happens on Linux as well, then it's less likely caused by that because it doesn't run in the same process with the rendering.

The GPU process was alive and allocating memory (according to Task Manager) when this last happened to me, if it helps.

:chutten, :scunnane, when the tab is frozen, would you still be able to open the Web Develop Tool (Ctrl + Shift + I) on the frozen tab? If so, could you try to use this add-on. After installing that add-on, you can see a Media-Webrtc tab in the devtool. Click the icon before "Media Info" would extend all the information of the media pipeline, then clicking "save page as" would be able to save the whole page. Then we can at least know what the status of the media pipeline is at the moment the tab is frozen.

It's not just the tab that freezes, it's the whole shebang, so I'm not sure whether devtools will paint for me. But I've installed the addon in case it works out.

Flags: needinfo?(chutten)

Here's a chutten's profile: https://share.firefox.dev/4bvQwFM. Affected GPU process should be 10656. Weirdly, it seems to be drawing like normal.

Another thing to try, when this happens is dxcap -forcetdr to see if that gets things going again.

I'm curious what happens to you guys who are experiencing this if you do a Win+Ctrl+Shift+B to restart the GPU driver? If it stops the behavior, it may point to something with the driver or interaction with the driver. Everyone having this issue has fully updated GPU drivers?

Also, if it happens again please kill the gpu process with https://github.com/b0bh00d/crash-firefox so that we can get a crash report for the GPU process.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)

Here's a chutten's profile: https://share.firefox.dev/4bvQwFM. Affected GPU process should be 10656. Weirdly, it seems to be drawing like normal.

After talking with chutten, it seems like Firefox was not hung in the normal drawing part of the profile. My guess is that the hang happens around the 3m36s part of the profile. There's not a thread that seems obviously hung after that point. i.e. mozilla::wr::RenderThread::HandleWrNotifierEvents still successfully gets called on Renderer thread but don't spend any time that shows up in the profile in HandleFrameOneDocInner

Mayank, what GPU are you seeing this happen on?

Flags: needinfo?(mayankleoboy1)
Attached file about:support

(In reply to Jeff Muizelaar [:jrmuizel] from comment #25)

Mayank, what GPU are you seeing this happen on?

Unfortunately, since i posted comment #11, I havent reproduced this bug.

Flags: needinfo?(mayankleoboy1)

When you did see it what GPU were you using?

Flags: needinfo?(mayankleoboy1)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #28)

When you did see it what GPU were you using?

I have been using the same GPU and drivers. Nothing has changed in my hardware/software configuration since I posted comment #11. I reprod the freeze 2-4 times in a span of 2 days, reproted it here, and then never saw it again.
I have a AMD 5800U APU. ( Please see the attached about:support in comment#26).

Flags: needinfo?(mayankleoboy1)

(In reply to Mayank Bansal from comment #29)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #28)

When you did see it what GPU were you using?

I have been using the same GPU and drivers. Nothing has changed in my hardware/software configuration since I posted comment #11. I reprod the freeze 2-4 times in a span of 2 days, reproted it here, and then never saw it again.
I have a AMD 5800U APU. ( Please see the attached about:support in comment#26).

Mayank,

You're also on Win 11 and you're running Adrenalin v24.1.1 drivers from 1/23/24? Patch Tuesday is in 2 days. Wondering if an OS update will help.

[EDIT] Nevermind, you are on 31.0.21910.5 (24.1.1).

Ah, I didn't see that you had posted your about:support.

QA Whiteboard: [qa-regression-triage]
See Also: → 1879816

A smaller STR appears to be to open the PBM window and in the tab load a youtube video search engine result page (I choose Andrew W.K. for nostalgia reasons) then click into and back out of videos repeatedly. I'll look into adding the tools per the above comments and try again.

Chris, any chance you can attach a video or gif of your STRs above?

Flags: needinfo?(bhood)
Flags: needinfo?(scunnane)

I haven't run into this bug over the past few days, but I wanted to post the raw data from my about:support page in case that's helpful and also note that for me, this has always occurred in a regular window, not a PBM one.

I also tried chutten's STR from comment 32, but unfortunately wasn't able to repro the bug.

chutten was able to capture a profile from during the hang: https://share.firefox.dev/42AaT0s

To clarify my STR a smidge, I:

  1. Opened a private browsing window (though this has been reproduced on non-PBM)
  2. Performed a search on youtube via a quicksearch for "andrew w.k."
  3. Clicked on any of the top three videos
  4. If the load completed and the video began to play visibly, click the back button and retry step 3

I haven't determined the confluence of circumstances that results in a reproduction, alas. Clicking on the same video over and over? Clicking on different ones? Exhausting some internal resource? Still unclear

I'll see if I can reproduce on Nightly.

This may have been caused by bug 1873085

Regressed by: 1873085

Set release status flags based on info from the regressing bug 1873085

:teleter pointed me toward bug 1873056 - :jrmuizel, can you tell if these 2 bugs are related?

Flags: needinfo?(jmuizelaar)

They might be somewhat related.

Flags: needinfo?(jmuizelaar)

Bug 1664063 is another candidate for the regressor

Regressed by: 1664063

Our theory is that we're not handling errors from AcquireSync properly and getting stuck with mHandlingDeviceReset=true

So, the root cause here almost certain lies in https://bugzilla.mozilla.org/show_bug.cgi?id=1664063.

Before this patch, SyncObjectD3D11Host::Synchronize would return true if any AcquireSync error other than hr == WAIT_TIMEOUT occurred, see the function before the change: https://hg.mozilla.org/mozilla-central/annotate/ce53d4215ca91122e12f63ca196acefc22a8d98c/gfx/layers/d3d11/TextureD3D11.cpp#l1472

Now after that patch, the failure reason is ignored. And we either return false or crash whenever AcquireSync doesn't succeed, see: https://searchfox.org/mozilla-central/source/gfx/layers/d3d11/TextureD3D11.cpp#1552

This means that on devices where some idiopathic error is returned like E_INVALIDARG we used to continue working as normal, where we now decide to call for a device reset here: https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RenderCompositorANGLE.cpp#428

This proceeds to set mHandlingDeviceReset to true. However, we now hit another bug that has probably been dormant for a long time, and was exposed here, when we try to reset the devices here: https://searchfox.org/mozilla-central/source/gfx/thebes/DeviceManagerDx.cpp#1053 we look if any of the devices are removed, we find out that they are all fine, and we then proceed not to reset anything.

mHandlingDeviceReset now remains true. And we stop rendering anything because of the check here: https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RenderThread.cpp#587

The browser now stops rendering without any subsequent failsafe's being triggered.

We can't ship with this regression according to the gfx team, so setting this bug as release blocker and P1.

Priority: -- → P1

As far as I can tell there are a couple of reasonably safe options to bandaid this:

  • Make the call to SyncObjectD3D11Host::Synchronize pass false to the aFallible parameter. This would cause the GPU process to crash in the case of any failure not triggered by device removal and fall back to the usual recovery route.

Pros: Tearing down the GPU process is well tested, no lingering idiopathic errors which are ignored, compact patch
Cons: Introduces a new, untested behavior, will cause a small increase in jank from GPU process recreation, may move more people off of acceleration.

  • Attempt to restore old behavior to SyncObjectD3D11Host::Synchronize by inserting an if (hr != WAIT_TIMEOUT) { return true; }

Pros: Should restore closest to old behavior, compact patch
Cons: Ignores an error (that was ignored before)

  • Set a flag whenever setting mHandlingDeviceReset to true that ensures any subsequent call to DeviceManagerDx::MaybeResetAndReacquireDevices -always- resets the devices regardless of the result of HasDeviceResetLocked

Pros: Ensures fresh devices with minimal disruption to the user, potentially addresses more issues that could arise from this
Cons: More code changes required, more indirect, subtle behavioral changes

Assignee: nobody → bas
Status: NEW → ASSIGNED

Not sure if this will help with fixing, but i seem to get this bug more if i have a video playing in one Firefox nightly window and a separate Firefox window snapped beside each other.

Other times it has broke my workflow when i have had a few too many tabs open with background content playing in the background.

Could this issue be caused by WEBRENDER or ACCELERATED_CANVAS2D ?

Here is also failure logs from about:support -

Failure Log
(#0) GP+[GFX1-]: SyncObjectD3D11Host::Synchronize ReleaseSync failed 0x80070057
(#20) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#21) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#22) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#23) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#24) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#25) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#26) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#27) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#28) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#29) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#30) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#31) CP+[GFX1-]: CompositorBridgeChild receives IPC close with reason=AbnormalShutdown
(#32) CP+[GFX1-]: Attempt to render into a Canvas2d after shutdown.
(#33) CP+[GFX1-]: Attempt to render into a Canvas2d after shutdown.
(#34) CP+[GFX1-]: Attempt to render into a Canvas2d after shutdown.

No longer regressed by: 1873085

This reverts commit e25a5f344af32bdd689500bae7b4f24f205ba9f0.

We believe bug 1664063 was causing to hit some broken device reset
handling code.

Flags: needinfo?(sotaro.ikeda.g)
See Also: → 1879953
Pushed by jmuizelaar@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6167c735a157
Revert "Bug 1664063 - Ensure we consistently handle all errors from IDXGIKeyedMutex::AcquireSync.".

(In reply to Chris H-C :chutten from comment #37)

Created attachment 9379774 [details]
chutten's about:support

To clarify my STR a smidge, I:

  1. Opened a private browsing window (though this has been reproduced on non-PBM)
  2. Performed a search on youtube via a quicksearch for "andrew w.k."
  3. Clicked on any of the top three videos
  4. If the load completed and the video began to play visibly, click the back button and retry step 3

I haven't determined the confluence of circumstances that results in a reproduction, alas. Clicking on the same video over and over? Clicking on different ones? Exhausting some internal resource? Still unclear

I'll see if I can reproduce on Nightly.

Just playback youtube video did not trigger SyncObjectD3D11Host::Synchronize () call. The STR of 2) - 4) triggered the Synchronize () call. Step 1) might not be necessary for the problem.

Depends on: 1880011

Bug 1880011 cold reduce frequency of SyncObjectD3D11Host::Synchronize () call. Though it does not affect to STR of comment 37.

SyncObjectD3D11Host::Synchronize () call In STR of comment 37 case could be removed in default setting. Since by default, hardware video decoder uses compositor device for D3D11Device in GPU process. In this case, mSyncObject->Synchronize() in RenderCompositorANGLE::BeginFrame() is not necessary.

Created Bug 1880016 for it,

Depends on: 1880016
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 124 Branch

Here's a crash report from the GPU process when this was happening: https://crash-stats.mozilla.org/report/index/d6dd9052-84b5-41c7-8c12-370600240212

The critical error contains the following:

|[G0][GFX1-]: SyncObjectD3D11Host::Synchronize ReleaseSync failed 0x80070057 (t=16743.3) 
|[G1][GFX1-]: SyncObjectD3D11Host::Synchronize AcquireSync failed 0x887a0001 (t=16743.4) 
|[G2][GFX1-]: GFX: D3D11 failure on the D3D11 sync lock. (t=16743.4) 
|[G3][GFX1-]: GFX: RenderThread detected a device reset in SyncObject (t=16743.4) 
Duplicate of this bug: 1877515

We were unable to reproduce the issue on our systems, therefore we are unable to confirm the fix on the latest builds. We have tried on several systems (AMD and Intel), both on Windows 10 and 11, natively installed and on VMs but had no luck in reproducing the issue even once. Chris, if possible, could you verify that the issue is fixed for you in Firefox 123.0 and latest Nightly? Thanks!

Flags: needinfo?(chutten)

I did try latest Nightly and couldn't reproduce it, but it's been elusive so that's not conclusive.

Flags: needinfo?(chutten)
See Also: → 1878555
Flags: needinfo?(aosmond)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: