Closed Bug 1718210 Opened 4 years ago Closed 4 years ago

GPU process shutdown hang during testing on Windows / Debug builds

Categories

(Core :: Graphics, defect)

defect

Tracking

()

RESOLVED FIXED
92 Branch
Tracking Status
firefox92 --- fixed

People

(Reporter: gerard-majax, Assigned: gerard-majax)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

https://treeherder.mozilla.org/logviewer?job_id=343827996&repo=try&lineNumber=2192

From what I could debug a few weeks ago when reproducing while working on bug 1651133, it was some CompositorThreadHolder reference that was being taken more than it was released.

Checking current status on our CI, since I can't repro locally on my VM: https://treeherder.mozilla.org/jobs?repo=try&revision=8a00c49357b7044c2eb53bd49c64f53bbd744ccb

But from bug 1651133 it's certain there is something going on.

(In reply to Alexandre LISSY :gerard-majax from comment #1)

Checking current status on our CI, since I can't repro locally on my VM: https://treeherder.mozilla.org/jobs?repo=try&revision=8a00c49357b7044c2eb53bd49c64f53bbd744ccb

But from bug 1651133 it's certain there is something going on.

It seems I re-repro with MOZ_CHAOSMODE=0 :)

(In reply to Alexandre LISSY :gerard-majax from comment #3)

(In reply to Alexandre LISSY :gerard-majax from comment #1)

Checking current status on our CI, since I can't repro locally on my VM: https://treeherder.mozilla.org/jobs?repo=try&revision=8a00c49357b7044c2eb53bd49c64f53bbd744ccb

But from bug 1651133 it's certain there is something going on.

It seems I re-repro with MOZ_CHAOSMODE=0 :)

More precisely, MOZ_CHAOSMODE=0x4: https://searchfox.org/mozilla-central/rev/352b525ab841278cd9b3098343f655ef85933544/mfbt/ChaosMode.h#25

When it passes and when it hangs:

When it passes:

  • We hit VideoBridgeParent::ActorDestroy() twice on GPU process, Compositor thread

When it hangs:

VideoBridgeParent::ActorDealloc() is in charge of handing back mCompositorThreadHolder reference: https://searchfox.org/mozilla-central/rev/59e797b66f5ce8a27ede0e7677688931be7aed20/gfx/layers/ipc/VideoBridgeParent.cpp#93

According to further debug, when it hangs, I can see that CompositorThreadHolder::Stop() is called twice in a row, specifically before VideoBridgeChild::Shutdown() call for RDD process is done. Hence, CompositorThreadHolder::~CompositorThreadHolder() destructor is only called once upon hang case, while it is called twice on successful execution.

Oh, and connecting with Visual Studio to the GPU process shows we are blocked in: https://searchfox.org/mozilla-central/rev/59e797b66f5ce8a27ede0e7677688931be7aed20/gfx/layers/ipc/CompositorThread.cpp#141-144, which is why tracking CompositorThreadHolder::~CompositorThreadHolder() above is important.

So here is a patch to force to release the compositor thread eariler when we are going to shutdown video bridge from RDD. Although I didn't see any crash, I also don't see any crash on try if I push without this patch. So I'm not sure if that really helps or not, would you mind to try that on your side? Thank you.

Flags: needinfo?(lissyx+mozillians)

(In reply to Alastor Wu [:alwu] from comment #8)

So here is a patch to force to release the compositor thread eariler when we are going to shutdown video bridge from RDD. Although I didn't see any crash, I also don't see any crash on try if I push without this patch. So I'm not sure if that really helps or not, would you mind to try that on your side? Thank you.

It does help, but it introduces some leak: https://treeherder.mozilla.org/logviewer?job_id=346439465&repo=try&lineNumber=2293

On an intermittent basis, I am seeing at least locally instances of assertion failure: https://searchfox.org/mozilla-central/rev/072710086ddfe25aa2962c8399fefb2304e8193b/xpcom/threads/ThreadEventTarget.cpp#59

Flags: needinfo?(lissyx+mozillians)

After some debugging, I found that the way that using the pref to prevent GPU process from creating video bridge is actually incorrect, because we would still need to share the decoded video data from RDD to GPU surface for rendering (I hit one assertion on the try, and the rendering pipeline would still need to request for the texture in GPU process, which would be done by VideoBridgeParent)

However, I also found that simply turning off the pref is already enough for solving the hang/leaking issue on the try server so I think that might be the easiest way for this issue, because that only happens on those test cases, rather than the real scenario. Although I don't really understand why simply turning off the pref can achieve that, because we would still create the video bridge in the GPU process. My best guess is doing that affects other graphic parts inside GPU process which magically works around this issue?!

Assignee: nobody → lissyx+mozillians
Status: NEW → ASSIGNED

Depends on D121045

(In reply to Alastor Wu [:alwu] from comment #10)

After some debugging, I found that the way that using the pref to prevent GPU process from creating video bridge is actually incorrect, because we would still need to share the decoded video data from RDD to GPU surface for rendering (I hit one assertion on the try, and the rendering pipeline would still need to request for the texture in GPU process, which would be done by VideoBridgeParent)

However, I also found that simply turning off the pref is already enough for solving the hang/leaking issue on the try server so I think that might be the easiest way for this issue, because that only happens on those test cases, rather than the real scenario. Although I don't really understand why simply turning off the pref can achieve that, because we would still create the video bridge in the GPU process. My best guess is doing that affects other graphic parts inside GPU process which magically works around this issue?!

No, unfortunately, the pref is not enough, as I verified locally already. Your try link is likely missing removal of the test manifest skip, as shown in https://treeherder.mozilla.org/logviewer?job_id=346552669&repo=try&lineNumber=1785, the browser_sandbox_test.js test is not run, it's skipped.

Attachment #9233441 - Attachment description: Bug 1718210 - force to release the compositor thread earlier in order to prevent a destruction racing between GPUParent and VideoBridgeParent. r?alwu → Bug 1718210 - Release the compositor thread earlier r?mattwoodrow
Pushed by alissy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7f327f66ea52 Release the compositor thread earlier r=mattwoodrow https://hg.mozilla.org/integration/autoland/rev/822082abb694 Enable SandboxTest on Windows/Debug r=gcp
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch
Blocks: 1723098
See Also: → 1730374
Regressions: 1782219
Blocks: 1840557
See Also: → 1875095
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: