GPU process shutdown hang during testing on Windows / Debug builds
Categories
(Core :: Graphics, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox92 | --- | fixed |
People
(Reporter: gerard-majax, Assigned: gerard-majax)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
https://treeherder.mozilla.org/logviewer?job_id=343827996&repo=try&lineNumber=2192
From what I could debug a few weeks ago when reproducing while working on bug 1651133, it was some CompositorThreadHolder
reference that was being taken more than it was released.
Assignee | ||
Comment 1•4 years ago
|
||
Checking current status on our CI, since I can't repro locally on my VM: https://treeherder.mozilla.org/jobs?repo=try&revision=8a00c49357b7044c2eb53bd49c64f53bbd744ccb
But from bug 1651133 it's certain there is something going on.
Assignee | ||
Comment 2•4 years ago
|
||
Assignee | ||
Comment 3•4 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #1)
Checking current status on our CI, since I can't repro locally on my VM: https://treeherder.mozilla.org/jobs?repo=try&revision=8a00c49357b7044c2eb53bd49c64f53bbd744ccb
But from bug 1651133 it's certain there is something going on.
It seems I re-repro with MOZ_CHAOSMODE=0
:)
Assignee | ||
Comment 4•4 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #3)
(In reply to Alexandre LISSY :gerard-majax from comment #1)
Checking current status on our CI, since I can't repro locally on my VM: https://treeherder.mozilla.org/jobs?repo=try&revision=8a00c49357b7044c2eb53bd49c64f53bbd744ccb
But from bug 1651133 it's certain there is something going on.
It seems I re-repro with
MOZ_CHAOSMODE=0
:)
More precisely, MOZ_CHAOSMODE=0x4
: https://searchfox.org/mozilla-central/rev/352b525ab841278cd9b3098343f655ef85933544/mfbt/ChaosMode.h#25
Assignee | ||
Comment 5•4 years ago
|
||
When it passes and when it hangs:
- We call
VideoBridgeChild::Shutdown()
at https://searchfox.org/mozilla-central/rev/59e797b66f5ce8a27ede0e7677688931be7aed20/dom/media/ipc/RemoteDecoderManagerParent.cpp#97 on bothGPU
andRDD
processes - We call
sVideoBridge->Close()
at https://searchfox.org/mozilla-central/rev/59e797b66f5ce8a27ede0e7677688931be7aed20/gfx/layers/ipc/VideoBridgeChild.cpp#46 on bothGPU
andRDD
processes
When it passes:
- We hit
VideoBridgeParent::ActorDestroy()
twice onGPU
process,Compositor
thread
When it hangs:
- We hit
VideoBridgeParent::ActorDestroy()
only once onGPU
process,Compositor
thread - During
VideoBridgeChild::Shutdown()
call tosVideoBridge->Close()
we get warnings from https://searchfox.org/mozilla-central/rev/59e797b66f5ce8a27ede0e7677688931be7aed20/ipc/glue/NodeController.cpp#293-299 aboutGOODBYE_MESSAGE
andEVENT_MESSAGE
fromRDD
process - We don't hit a second time
VideoBridgeParent::ActorDestroy()
Assignee | ||
Comment 6•4 years ago
|
||
VideoBridgeParent::ActorDealloc()
is in charge of handing back mCompositorThreadHolder
reference: https://searchfox.org/mozilla-central/rev/59e797b66f5ce8a27ede0e7677688931be7aed20/gfx/layers/ipc/VideoBridgeParent.cpp#93
According to further debug, when it hangs, I can see that CompositorThreadHolder::Stop()
is called twice in a row, specifically before VideoBridgeChild::Shutdown()
call for RDD
process is done. Hence, CompositorThreadHolder::~CompositorThreadHolder()
destructor is only called once upon hang case, while it is called twice on successful execution.
Assignee | ||
Comment 7•4 years ago
|
||
Oh, and connecting with Visual Studio to the GPU process shows we are blocked in: https://searchfox.org/mozilla-central/rev/59e797b66f5ce8a27ede0e7677688931be7aed20/gfx/layers/ipc/CompositorThread.cpp#141-144, which is why tracking CompositorThreadHolder::~CompositorThreadHolder()
above is important.
Comment 8•4 years ago
|
||
So here is a patch to force to release the compositor thread eariler when we are going to shutdown video bridge from RDD. Although I didn't see any crash, I also don't see any crash on try if I push without this patch. So I'm not sure if that really helps or not, would you mind to try that on your side? Thank you.
Assignee | ||
Comment 9•4 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #8)
So here is a patch to force to release the compositor thread eariler when we are going to shutdown video bridge from RDD. Although I didn't see any crash, I also don't see any crash on try if I push without this patch. So I'm not sure if that really helps or not, would you mind to try that on your side? Thank you.
It does help, but it introduces some leak: https://treeherder.mozilla.org/logviewer?job_id=346439465&repo=try&lineNumber=2293
On an intermittent basis, I am seeing at least locally instances of assertion failure: https://searchfox.org/mozilla-central/rev/072710086ddfe25aa2962c8399fefb2304e8193b/xpcom/threads/ThreadEventTarget.cpp#59
Comment 10•4 years ago
•
|
||
After some debugging, I found that the way that using the pref to prevent GPU process from creating video bridge is actually incorrect, because we would still need to share the decoded video data from RDD to GPU surface for rendering (I hit one assertion on the try, and the rendering pipeline would still need to request for the texture in GPU process, which would be done by VideoBridgeParent)
However, I also found that simply turning off the pref is already enough for solving the hang/leaking issue on the try server so I think that might be the easiest way for this issue, because that only happens on those test cases, rather than the real scenario. Although I don't really understand why simply turning off the pref can achieve that, because we would still create the video bridge in the GPU process. My best guess is doing that affects other graphic parts inside GPU process which magically works around this issue?!
Assignee | ||
Comment 11•4 years ago
|
||
Updated•4 years ago
|
Assignee | ||
Comment 12•4 years ago
|
||
Depends on D121045
Assignee | ||
Comment 13•4 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #10)
After some debugging, I found that the way that using the pref to prevent GPU process from creating video bridge is actually incorrect, because we would still need to share the decoded video data from RDD to GPU surface for rendering (I hit one assertion on the try, and the rendering pipeline would still need to request for the texture in GPU process, which would be done by VideoBridgeParent)
However, I also found that simply turning off the pref is already enough for solving the hang/leaking issue on the try server so I think that might be the easiest way for this issue, because that only happens on those test cases, rather than the real scenario. Although I don't really understand why simply turning off the pref can achieve that, because we would still create the video bridge in the GPU process. My best guess is doing that affects other graphic parts inside GPU process which magically works around this issue?!
No, unfortunately, the pref is not enough, as I verified locally already. Your try link is likely missing removal of the test manifest skip, as shown in https://treeherder.mozilla.org/logviewer?job_id=346552669&repo=try&lineNumber=1785, the browser_sandbox_test.js
test is not run, it's skipped.
Updated•4 years ago
|
Comment 14•4 years ago
|
||
Updated•4 years ago
|
Comment 15•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/7f327f66ea52
https://hg.mozilla.org/mozilla-central/rev/822082abb694
Description
•