Closed Bug 947227 Opened 6 years ago Closed 6 years ago

Hang in mozilla::gfx::SurfaceStream_TripleBuffer_Async::WaitForCompositor()

Categories

(Core :: Canvas: WebGL, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla30
blocking-b2g 1.3+
Tracking Status
firefox28 --- wontfix
firefox29 + wontfix
firefox30 --- wontfix
firefox31 --- fixed
b2g-v1.3 --- wontfix
b2g-v1.3T --- wontfix
b2g-v1.4 --- wontfix
b2g-v2.0 --- fixed

People

(Reporter: BenWa, Assigned: mattwoodrow)

References

Details

Attachments

(4 files)

Attached file hang samples
See attachment for where the threads are during the hang. The main thread is waiting for the compositor but the compositor is waiting for events. So we're waiting for mStaging. Either the compositor doesn't signal mMonitor or the compositor never clears it.
Flagging as tracking. Hangs are bad and I triggered this a few times while doing a lot of webgl.
I just got this again with the webgl app in the background.
Duplicate of this bug: 944823
Bug 944823 has stacks of all threads. Happens on linux with GL layers, too.
OS: Mac OS X → All
Hardware: x86 → All
The SurfaceStream has a frame that needs compositing in mStaging, but the compositor is not working on that. It looks like maybe the layer transaction didn't happen, for whatever reason.
Is this happening with all WebGL apps or just some? Trying to better understand how big of an impact this will have on our users.
Flags: needinfo?(snorp)
(In reply to Benjamin Kerensa [:bkerensa] from comment #6)
> Is this happening with all WebGL apps or just some? Trying to better
> understand how big of an impact this will have on our users.

It doesn't look specific at all. I just got it during webgl mochitest in a try run.
Flags: needinfo?(snorp)
We will track this since the impact on WebGL apps is high.
I just hit this twice in the last two days.

I think it occurs only when WebGL is running in a background tab.

I guess the canvas is still producing frames, but the compositing isn't reading them since it's not displaying that tab. Or something along those lines.

I guess ideally we'd make sure this never happens, but maybe we should add a time limit for waiting on the compositor to prevent hanging.
For me it mostly happens when I had weblg in a background tab and tried to switch back to the tab. I received a hang before the first frame of the tab was shown on the screen.

This happened a lot while developing a small webgl app, say every 10-15 mins as I would switch tab to look up documentation.
Attached patch Don't hangSplinter Review
What do you think?
Attachment #8377276 - Flags: review?(jgilbert)
Comment on attachment 8377276 [details] [diff] [review]
Don't hang

Review of attachment 8377276 [details] [diff] [review]:
-----------------------------------------------------------------

It looks like this should work. Its caller seems to handle it fine. But, doesn't this code get executed on the main thread? Won't we be blocking that?
Comment on attachment 8377276 [details] [diff] [review]
Don't hang

Maybe this is ok for now, but we should really get on bug 854421 for a proper fix.
Attachment #8377276 - Flags: feedback+
(In reply to Jeff Gilbert [:jgilbert] from comment #12)
> Comment on attachment 8377276 [details] [diff] [review]
> Don't hang
> 
> Review of attachment 8377276 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> It looks like this should work. Its caller seems to handle it fine. But,
> doesn't this code get executed on the main thread? Won't we be blocking that?

Yes, this happens on the main thread. But this bug is about where we get in a bad state and block the main thread *forever*. This patch just caps the hang to 100ms and makes us throw a frame away instead (which seems like a huge improvement in experience).

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #13)
> Comment on attachment 8377276 [details] [diff] [review]
> Don't hang
> 
> Maybe this is ok for now, but we should really get on bug 854421 for a
> proper fix.

We should, but It doesn't really affect this bug. This is about us hanging forever, not just waiting for the compositor to catch up.
(In reply to Matt Woodrow (:mattwoodrow) from comment #14)
> (In reply to Jeff Gilbert [:jgilbert] from comment #12)
> > Comment on attachment 8377276 [details] [diff] [review]
> > Don't hang
> > 
> > Review of attachment 8377276 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > It looks like this should work. Its caller seems to handle it fine. But,
> > doesn't this code get executed on the main thread? Won't we be blocking that?
> 
> Yes, this happens on the main thread. But this bug is about where we get in
> a bad state and block the main thread *forever*. This patch just caps the
> hang to 100ms and makes us throw a frame away instead (which seems like a
> huge improvement in experience).
> 
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #13)
> > Comment on attachment 8377276 [details] [diff] [review]
> > Don't hang
> > 
> > Maybe this is ok for now, but we should really get on bug 854421 for a
> > proper fix.
> 
> We should, but It doesn't really affect this bug. This is about us hanging
> forever, not just waiting for the compositor to catch up.

We should take the extra day or so to figure out how to actually solve this, rather than band-aiding it with something this kludgey. The simplest fix is actually just to use the old TripleBuffer SurfStream, which doesn't bother with this. Is there a reason we don't want to do that?
Well then we won't have anything stopping us from overproducing. But yes, maybe that is the right thing to do, and we can just prevent overcompositing with bug 854421.
Attachment #8377276 - Attachment is obsolete: true
Attachment #8377276 - Flags: review?(jgilbert)
Attachment #8378724 - Flags: review?(jgilbert)
Nothing uses this now, any reason to keep it around?
Attachment #8378725 - Flags: review?(jgilbert)
Attachment #8378724 - Flags: review?(jgilbert) → review+
Attachment #8378725 - Flags: review?(snorp)
Attachment #8378725 - Flags: review?(jgilbert)
Attachment #8378725 - Flags: review+
Let's check with snorp.
Comment on attachment 8378725 [details] [diff] [review]
Remove Async surface stream

Review of attachment 8378725 [details] [diff] [review]:
-----------------------------------------------------------------

You can remove SurfaceStream_TripleBuffer::WaitForCompositor() too
Attachment #8378725 - Flags: review?(snorp) → review+
https://hg.mozilla.org/mozilla-central/rev/6db7636b8b19
https://hg.mozilla.org/mozilla-central/rev/42996f3b6b75
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Matt, could you fill the uplift request to make sure it is in 29? Thanks
Flags: needinfo?(matt.woodrow)
Assignee: nobody → matt.woodrow
Flags: needinfo?(matt.woodrow)
Comment on attachment 8378724 [details] [diff] [review]
Stop using the Async surface stream

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unsure
User impact if declined: Hangs when using WebGL in a background tab
Testing completed (on m-c, etc.): Been on m-c for a week.
Risk to taking this patch (and alternatives if risky): Low risk, just stops us waiting on the compositor.
String or IDL/UUID changes made by this patch: None
Attachment #8378724 - Flags: approval-mozilla-aurora?
Attachment #8378724 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Flagging for verification in case the QA team could reproduce this issue.
Keywords: verifyme
This seems to have caused a significant regression for pages such as http://azarask.in/screensaver/shaders/ (note: unless Aza has updated that page to not be webkit only, you will need to paste this into the console for it to run https://pastebin.mozilla.org/4409399 - it replaces the WebAudio calls to the standard Firefox supports)

Nightly from 2/21 runs it "fine" at something like 25-30 fps on my computer, while the one from 2/22 can only run it at around 1-3 fps
Duplicate of this bug: 969070
According to https://bugzilla.mozilla.org/show_bug.cgi?id=974807#c95 this patch is required in order to resolve bug 974807.  That bug is 1.3+, so I'm nominating this bug as well.
blocking-b2g: --- → 1.3?
Matt,

Could you comment on the riskiness of this patch and whether it would be safe to uplift to 1.3 to fix bug 974807?
Flags: needinfo?(matt.woodrow)
Blocks: 974807
No longer blocks: 974807
Duplicate of this bug: 974807
blocking-b2g: 1.3? → 1.3+
(In reply to David Flanagan [:djf] from comment #30)
> Matt,
> 
> Could you comment on the riskiness of this patch and whether it would be
> safe to uplift to 1.3 to fix bug 974807?

Yes, it should be safe to uplift, it's very low risk.
Flags: needinfo?(matt.woodrow)
Please request approval-mozilla-b2g28 on this patch if it's ready for uplift to v1.3.
Flags: needinfo?(matt.woodrow)
Comment on attachment 8378724 [details] [diff] [review]
Stop using the Async surface stream

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unsure
User impact if declined: Hangs when using WebGL in a background tab
Testing completed (on m-c, etc.): Been on m-c for a week.
Risk to taking this patch (and alternatives if risky): Low risk, just stops us waiting on the compositor.
String or IDL/UUID changes made by this patch: None
Attachment #8378724 - Flags: approval-mozilla-b2g28?
Flags: needinfo?(matt.woodrow)
Attachment #8378724 - Flags: approval-mozilla-b2g28? → approval-mozilla-b2g28+
I tried reproducing the bug using the Firefox 28 release without success. 
Without proper steps to reproduce the bug I'm unable to verify the bug.
Keywords: verifyme
Depends on: 991434
Comment on attachment 8377276 [details] [diff] [review]
Don't hang

Do you want to review this, or should we wait for Jeff?
Attachment #8377276 - Attachment is obsolete: false
Attachment #8377276 - Flags: review?(snorp)
Comment on attachment 8377276 [details] [diff] [review]
Don't hang

Review of attachment 8377276 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me, but can't hurt to have jgilbert weigh in.
Attachment #8377276 - Flags: review?(snorp)
Attachment #8377276 - Flags: review?(jgilbert)
Attachment #8377276 - Flags: review+
Comment on attachment 8377276 [details] [diff] [review]
Don't hang

Review of attachment 8377276 [details] [diff] [review]:
-----------------------------------------------------------------

Important nit.

::: gfx/gl/SurfaceStream.cpp
@@ +459,5 @@
>      PROFILER_LABEL("SurfaceStream_TripleBuffer_Async", "WaitForCompositor");
>  
>      // We are assumed to be locked
> +    while (mStaging) {
> +        if (!NS_SUCCEEDED(mMonitor.Wait(PR_MillisecondsToInterval(100)))) {

Please comment that Monitor::Wait returns FAILURE on timeout or interrupt. A number of related functions return SUCCESS on timeout.
Attachment #8377276 - Flags: review?(jgilbert) → review+
Comment on attachment 8377276 [details] [diff] [review]
Don't hang

Review of attachment 8377276 [details] [diff] [review]:
-----------------------------------------------------------------

Actually, this is confusing enough I'm going to r- it in this bug. Make a new bug for this, please!
Attachment #8377276 - Flags: review+ → review-
In particular, I can no longer find where we use or would use WaitForCompositor(), so I can't check its usage.
(In reply to Jeff Gilbert [:jgilbert] from comment #41)
> In particular, I can no longer find where we use or would use
> WaitForCompositor(), so I can't check its usage.

This is because we deleted all that code...

We'd have to back out the changes made by this bug thus far, and then apply the new patch.
Also, we have about a week to get this landed and uplifted to beta before we end up shipping a severe WebGL performance regression.
Flags: needinfo?(jgilbert)
(In reply to Matt Woodrow (:mattwoodrow) from comment #43)
> Also, we have about a week to get this landed and uplifted to beta before we
> end up shipping a severe WebGL performance regression.

Where is the code though? I don't see it in any of the patches in this bug. It's hard to r+ this if I can't have the whole cset here in front of me. In particular, the callers of WaitForCompositor() will need to handle a non-true retval properly, and we've had issues with callers ignoring 'always true' retvals in the past in code very similar to this.

It doesn't help anyone if the fix here doesn't end up fixing the problem. This patch *by itself* will not fix the problem if WaitForCompositor is never called, or not called properly. This is what the patch r?me'd does, so I don't feel safe r+ing it.

In general, it's confusing to do fixes to regressions inside the same bug that caused them. The normal procedure is to make a new bug for resolving the regression. This would keep things *much* clearer here, as well. This would also make it clear that I need an actual cset to review, not just half a cset and promise of a correct backout. :)

We can both fix this correctly, and fix it quickly.
Flags: needinfo?(jgilbert)
Depends on: 995065
I'm not convinced that backing the entirety of this bug out and starting out is the same as fixing a regression, but filed bug 995065.
You need to log in before you can comment on or make changes to this bug.