Closed Bug 1331548 Opened 5 years ago Closed 5 years ago

Poor recovery from killing the GPU process

Categories

(Core :: Graphics: Layers, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla54
Tracking Status
firefox53 --- verified
firefox54 --- verified

People

(Reporter: caspy77, Assigned: dvander)

Details

(Whiteboard: [gfx-noted])

Attachments

(4 files)

I read a post about the new GPU process, so I updated my Nightly install and tested in a fresh profile.

I tried playing an HD MP4 Youtube video and killing the process from about:config and the tab went black while the sound continued. Later I tried and it went white (with sound continuing) and iirc only switching tabs fixed the issue. Since then I've had the same behavior except that moving the mouse over any part of the page content fixed the white.

I had one instance where things paused a bit and continued.

I've attached a video showing the typical behavior.
Oops, almost forgot graphics section from about:config

===========
Graphics
Features
Compositing	Direct3D 11
Asynchronous Pan/Zoom	wheel input enabled; touch input enabled
WebGL Renderer	Google Inc. -- ANGLE (Intel(R) HD Graphics 520 Direct3D11 vs_5_0 ps_5_0)
WebGL2 Renderer	Google Inc. -- ANGLE (Intel(R) HD Graphics 520 Direct3D11 vs_5_0 ps_5_0)
Hardware H264 Decoding	Yes; Using D3D11 API
Audio Backend	wasapi
Direct2D	true
DirectWrite	true (10.0.14393.351)
GPU #1
Active	Yes
Description	Intel(R) HD Graphics 520
Vendor ID	0x8086
Device ID	0x1916
Driver Version	20.19.15.4483
Driver Date	7-1-2016
Drivers	igdumdim64 igd10iumd64 igd10iumd64 igd12umd64 igdumdim32 igd10iumd32 igd10iumd32 igd12umd32
Subsys ID	073d1028
RAM	Unknown
Diagnostics
AzureCanvasAccelerated	0
AzureCanvasBackend	direct2d 1.1
AzureContentBackend	direct2d 1.1
AzureFallbackCanvasBackend	cairo
Decision Log
D3D9_COMPOSITING	
disabled by default: Disabled by default
Crash Guard Disabled Features
D3D9 Video Decoder
Apparently disabling the GPU process decoder doesn't help, so this sounds like a generic GPU process issue.

Want to look into this dvander, or do you want me to?
Flags: needinfo?(dvander)
Sure, I can look into this. Caspy7, can you attach the video? Also, are you able to reproduce this?
Flags: needinfo?(dvander) → needinfo?(caspy77)
Attached video Kill GPU process Demo
Sorry, apparently the file was too big and if it threw an error I missed it.
This should do it.
Flags: needinfo?(caspy77)
And yes, I can reproduce easily.

As I said before, it can be a bit inconsistent, but most of the time it fails in some way, and usually in the manner shown in the video.
Priority: -- → P1
Whiteboard: [gfx-noted]
This appears to be a timing issue. Sometimes the first paint occurs before media sends a new video frame, so we end up sending an OpAttachAsyncCompositable with a stale CompositableHandle. This causes the transaction to abort, leaving the compositor side of the layer tree in a half-completed state. Eventually everything starts drawing right but there can be any number of botched frames.
Assignee: nobody → dvander
Status: NEW → ASSIGNED
No longer depends on: 1331761
I think it will always be possible via some race between media and layout to get a bad async compositable in the pipeline. So we might as well just make it a soft error on the compositor side, versus have permanently corrupt layer state.
Attachment #8832369 - Flags: review?(matt.woodrow)
One of the races between media and layout is easily fixed. If media decodes a frame before the main thread receives a ReinitCompositor message, then it will attempt to create a new ImageClient on a disconnected ImageBridgeChild. This fails, and EnsureImageClient() then permanently thinks the container is synchronous, and never accepts any more images.

This patch makes EnsureImageClient resume taking async images as soon as ImageBridgeChild is back online. It also stops us from sending stale CompositableHandles.

(Addendum: As I wrote this, I realized we could tear the 64-bit value on 32-bit systems, since it resets on the video thread and is read on the main thread. It's hard to imagine this mattering though? And it would still be fundamentally racy with a mutex.)
Attachment #8832373 - Flags: review?(matt.woodrow)
There is one more bug here: sometimes we just never repaint. It looks like we can fail to revoke a paint and the refresh driver gets stuck. It is very hard to reproduce but it definitely happens. Refreshing the page or resizing the window unsticks the refresh driver, and sometimes it just resumes after a timeout.
Attachment #8832369 - Flags: review?(matt.woodrow) → review+
Attachment #8832373 - Flags: review?(matt.woodrow) → review+
We weren't guaranteed to call LayerManager::Destroy when shutting down content ClientLayerManagers. If we were waiting for a transaction when the GPU process reset, the refresh driver would become stuck forever waiting.

Shutting down the old layer manager ensures we revoke the transaction.
Attachment #8832682 - Flags: review?(matt.woodrow)
Attachment #8832682 - Flags: review?(matt.woodrow) → review+
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0a0c6707b1c6
Don't let stale async compositables break main-thread paint transactions. (bug 1331548 part 1, r=mattwoodrow)
https://hg.mozilla.org/integration/mozilla-inbound/rev/10b33dedede2
Keep trying async video until PImageBridge is alive again. (bug 1331548 part 2, r=mattwoodrow)
https://hg.mozilla.org/integration/mozilla-inbound/rev/29c5eb7dcd7d
Make sure the last transaction is revoked when resetting content layer managers. (bug 1331548 part 3, r=mattwoodrow)
https://hg.mozilla.org/mozilla-central/rev/0a0c6707b1c6
https://hg.mozilla.org/mozilla-central/rev/10b33dedede2
https://hg.mozilla.org/mozilla-central/rev/29c5eb7dcd7d
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
No longer depends on: 1159532
Comment on attachment 8832369 [details] [diff] [review]
part 1, ignore bad async compositables

Approval Request Comment
[Feature/Bug causing the regression]: GPU Process
[User impact if declined]: broken painting when GPU process dies while playing video
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]:
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: No.
[Why is the change risky/not risky?]: previously we would abort painting if we got an invalid command, which leaves the rest of the page unpainted. This simply ignores the broken command, which is harmless.
[String changes made/needed]:
Attachment #8832369 - Flags: approval-mozilla-beta?
Comment on attachment 8832373 [details] [diff] [review]
part 2, resume video

Approval Request Comment
[Feature/Bug causing the regression]: GPU Process
[User impact if declined]: video stops playing if the GPU process dies
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]: 
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: Very unlikely
[Why is the change risky/not risky?]: Previously, if video could not decode due to the GPU process dying, it would enter a permanently broken state. This ensures the decoder will keep retrying until Firefox is painting properly again.
[String changes made/needed]:
Attachment #8832373 - Flags: approval-mozilla-beta?
Comment on attachment 8832682 [details] [diff] [review]
part 3, fix transactions

Approval Request Comment
[Feature/Bug causing the regression]: GPU Process
[User impact if declined]: Firefox might stop painting if the GPU process dies
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]: 
[List of other uplifts needed for the feature/fix]: 
[Is the change risky?]: No
[Why is the change risky/not risky?]: The only way to enter this state is if the GPU process crashes. When that happens, we are just clearing the old painting state sooner than we did previously, to ensure that we paint again.
[String changes made/needed]:
Attachment #8832682 - Flags: approval-mozilla-beta?
Comment on attachment 8832369 [details] [diff] [review]
part 1, ignore bad async compositables

Cleanup for behavior after the process crashes. It is late in 53 for uplift, and we don't have a lot of time to evaluate the results, but let's give it a try.
Attachment #8832369 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8832373 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8832682 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I reproduced this issue using Fx 53.0b7 on Windows 10 x64.
This issue is fixed. I verified using Fx 54.0a2 (build ID: 20170404004003) and Fx 53.0b9 on Windows 10 x64.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.