Closed
Bug 1331548
Opened 8 years ago
Closed 8 years ago
Poor recovery from killing the GPU process
Categories
(Core :: Graphics: Layers, defect, P1)
Core
Graphics: Layers
Tracking
()
VERIFIED
FIXED
mozilla54
People
(Reporter: caspy77, Assigned: dvander)
Details
(Whiteboard: [gfx-noted])
Attachments
(4 files)
1.59 MB,
video/mp4
|
Details | |
1.57 KB,
patch
|
mattwoodrow
:
review+
lizzard
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
6.76 KB,
patch
|
mattwoodrow
:
review+
lizzard
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
948 bytes,
patch
|
mattwoodrow
:
review+
lizzard
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
I read a post about the new GPU process, so I updated my Nightly install and tested in a fresh profile.
I tried playing an HD MP4 Youtube video and killing the process from about:config and the tab went black while the sound continued. Later I tried and it went white (with sound continuing) and iirc only switching tabs fixed the issue. Since then I've had the same behavior except that moving the mouse over any part of the page content fixed the white.
I had one instance where things paused a bit and continued.
I've attached a video showing the typical behavior.
Oops, almost forgot graphics section from about:config
===========
Graphics
Features
Compositing Direct3D 11
Asynchronous Pan/Zoom wheel input enabled; touch input enabled
WebGL Renderer Google Inc. -- ANGLE (Intel(R) HD Graphics 520 Direct3D11 vs_5_0 ps_5_0)
WebGL2 Renderer Google Inc. -- ANGLE (Intel(R) HD Graphics 520 Direct3D11 vs_5_0 ps_5_0)
Hardware H264 Decoding Yes; Using D3D11 API
Audio Backend wasapi
Direct2D true
DirectWrite true (10.0.14393.351)
GPU #1
Active Yes
Description Intel(R) HD Graphics 520
Vendor ID 0x8086
Device ID 0x1916
Driver Version 20.19.15.4483
Driver Date 7-1-2016
Drivers igdumdim64 igd10iumd64 igd10iumd64 igd12umd64 igdumdim32 igd10iumd32 igd10iumd32 igd12umd32
Subsys ID 073d1028
RAM Unknown
Diagnostics
AzureCanvasAccelerated 0
AzureCanvasBackend direct2d 1.1
AzureContentBackend direct2d 1.1
AzureFallbackCanvasBackend cairo
Decision Log
D3D9_COMPOSITING
disabled by default: Disabled by default
Crash Guard Disabled Features
D3D9 Video Decoder
Comment 2•8 years ago
|
||
Apparently disabling the GPU process decoder doesn't help, so this sounds like a generic GPU process issue.
Want to look into this dvander, or do you want me to?
Flags: needinfo?(dvander)
Assignee | ||
Comment 3•8 years ago
|
||
Sure, I can look into this. Caspy7, can you attach the video? Also, are you able to reproduce this?
Flags: needinfo?(dvander) → needinfo?(caspy77)
Sorry, apparently the file was too big and if it threw an error I missed it.
This should do it.
Flags: needinfo?(caspy77)
And yes, I can reproduce easily.
As I said before, it can be a bit inconsistent, but most of the time it fails in some way, and usually in the manner shown in the video.
Updated•8 years ago
|
Priority: -- → P1
Whiteboard: [gfx-noted]
Assignee | ||
Comment 6•8 years ago
|
||
This appears to be a timing issue. Sometimes the first paint occurs before media sends a new video frame, so we end up sending an OpAttachAsyncCompositable with a stale CompositableHandle. This causes the transaction to abort, leaving the compositor side of the layer tree in a half-completed state. Eventually everything starts drawing right but there can be any number of botched frames.
Assignee | ||
Comment 7•8 years ago
|
||
I think it will always be possible via some race between media and layout to get a bad async compositable in the pipeline. So we might as well just make it a soft error on the compositor side, versus have permanently corrupt layer state.
Attachment #8832369 -
Flags: review?(matt.woodrow)
Assignee | ||
Comment 8•8 years ago
|
||
One of the races between media and layout is easily fixed. If media decodes a frame before the main thread receives a ReinitCompositor message, then it will attempt to create a new ImageClient on a disconnected ImageBridgeChild. This fails, and EnsureImageClient() then permanently thinks the container is synchronous, and never accepts any more images.
This patch makes EnsureImageClient resume taking async images as soon as ImageBridgeChild is back online. It also stops us from sending stale CompositableHandles.
(Addendum: As I wrote this, I realized we could tear the 64-bit value on 32-bit systems, since it resets on the video thread and is read on the main thread. It's hard to imagine this mattering though? And it would still be fundamentally racy with a mutex.)
Attachment #8832373 -
Flags: review?(matt.woodrow)
Assignee | ||
Comment 9•8 years ago
|
||
There is one more bug here: sometimes we just never repaint. It looks like we can fail to revoke a paint and the refresh driver gets stuck. It is very hard to reproduce but it definitely happens. Refreshing the page or resizing the window unsticks the refresh driver, and sometimes it just resumes after a timeout.
Updated•8 years ago
|
Attachment #8832369 -
Flags: review?(matt.woodrow) → review+
Updated•8 years ago
|
Attachment #8832373 -
Flags: review?(matt.woodrow) → review+
Assignee | ||
Comment 10•8 years ago
|
||
We weren't guaranteed to call LayerManager::Destroy when shutting down content ClientLayerManagers. If we were waiting for a transaction when the GPU process reset, the refresh driver would become stuck forever waiting.
Shutting down the old layer manager ensures we revoke the transaction.
Attachment #8832682 -
Flags: review?(matt.woodrow)
Updated•8 years ago
|
Attachment #8832682 -
Flags: review?(matt.woodrow) → review+
Comment 11•8 years ago
|
||
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0a0c6707b1c6
Don't let stale async compositables break main-thread paint transactions. (bug 1331548 part 1, r=mattwoodrow)
https://hg.mozilla.org/integration/mozilla-inbound/rev/10b33dedede2
Keep trying async video until PImageBridge is alive again. (bug 1331548 part 2, r=mattwoodrow)
https://hg.mozilla.org/integration/mozilla-inbound/rev/29c5eb7dcd7d
Make sure the last transaction is revoked when resetting content layer managers. (bug 1331548 part 3, r=mattwoodrow)
Comment 12•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/0a0c6707b1c6
https://hg.mozilla.org/mozilla-central/rev/10b33dedede2
https://hg.mozilla.org/mozilla-central/rev/29c5eb7dcd7d
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
status-firefox54:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
No longer depends on: 1159532
Assignee | ||
Comment 13•8 years ago
|
||
Comment on attachment 8832369 [details] [diff] [review]
part 1, ignore bad async compositables
Approval Request Comment
[Feature/Bug causing the regression]: GPU Process
[User impact if declined]: broken painting when GPU process dies while playing video
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]:
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: No.
[Why is the change risky/not risky?]: previously we would abort painting if we got an invalid command, which leaves the rest of the page unpainted. This simply ignores the broken command, which is harmless.
[String changes made/needed]:
Attachment #8832369 -
Flags: approval-mozilla-beta?
Assignee | ||
Comment 14•8 years ago
|
||
Comment on attachment 8832373 [details] [diff] [review]
part 2, resume video
Approval Request Comment
[Feature/Bug causing the regression]: GPU Process
[User impact if declined]: video stops playing if the GPU process dies
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]:
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: Very unlikely
[Why is the change risky/not risky?]: Previously, if video could not decode due to the GPU process dying, it would enter a permanently broken state. This ensures the decoder will keep retrying until Firefox is painting properly again.
[String changes made/needed]:
Attachment #8832373 -
Flags: approval-mozilla-beta?
Assignee | ||
Comment 15•8 years ago
|
||
Comment on attachment 8832682 [details] [diff] [review]
part 3, fix transactions
Approval Request Comment
[Feature/Bug causing the regression]: GPU Process
[User impact if declined]: Firefox might stop painting if the GPU process dies
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]:
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: No
[Why is the change risky/not risky?]: The only way to enter this state is if the GPU process crashes. When that happens, we are just clearing the old painting state sooner than we did previously, to ensure that we paint again.
[String changes made/needed]:
Attachment #8832682 -
Flags: approval-mozilla-beta?
Updated•8 years ago
|
status-firefox53:
--- → affected
Comment 16•8 years ago
|
||
Comment on attachment 8832369 [details] [diff] [review]
part 1, ignore bad async compositables
Cleanup for behavior after the process crashes. It is late in 53 for uplift, and we don't have a lot of time to evaluate the results, but let's give it a try.
Attachment #8832369 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Updated•8 years ago
|
Attachment #8832373 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Updated•8 years ago
|
Attachment #8832682 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 17•8 years ago
|
||
bugherder uplift |
Comment 18•8 years ago
|
||
I reproduced this issue using Fx 53.0b7 on Windows 10 x64.
This issue is fixed. I verified using Fx 54.0a2 (build ID: 20170404004003) and Fx 53.0b9 on Windows 10 x64.
You need to log in
before you can comment on or make changes to this bug.
Description
•