Closed Bug 1486876 Opened 6 years ago Closed 6 years ago

Tab switching composite metrics aren't reliable because of timer-based MozAfterPaint

Categories

(Core :: Graphics, defect, P3)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1445570

People

(Reporter: bholley, Assigned: mattwoodrow)

References

(Blocks 1 open bug)

Details

(Whiteboard: [gfx-noted])

STR:
* Enable WR, launch browser
* Load http://diana-adrianne.com/purecss-francine/ in a background tab
* Launch the profiler, tab over to francine, and capture a profile [1].
* Note that AsyncTabSwitch:Composited comes in the middle of a long Composite on the Renderer thread, and doesn't correspond to any related operations in the profile.

The profile at [1] also includes a few custom markers. The relevant bit is "EnsureEventualDidPaint Callback", which was a marker I placed at [2]. This comes immediately before the chain of markers leading to AsyncTabSwitch:Composited, so it seems pretty clear that our profile marker and Telemetry are being clamped by a 100ms timer. This prevents us from detecting long tab switches in the wild with WR.

Matt, do we need this timer? It looks like Mantaroh tried to remove it in bug 1419226, but the patch was backed out. Alternatively, is there another more-reliable metric we can use?


[1] https://perfht.ml/2Nqdwug
[2] https://searchfox.org/mozilla-central/rev/55da592d85c2baf8d8818010c41d9738c97013d2/layout/base/nsPresContext.cpp#3232
Flags: needinfo?(matt.woodrow)
Priority: -- → P3
Whiteboard: [gfx-noted]
We should definitely not need the timer. It exists because it's hard to guarantee that the compositor will always deliver a DidComposite message, so we have a sad-fallback to make sure callbacks don't get forgotten.

Looks like bug 1445570 now has the remaining patches that we'd need, but it sounds like there are indeed some cases where we miss the message. Fixing these bugs should be possible, but there's a real downside in that the opportunity for new bugs (and potentially very subtle intermittent test failures) will exist.

I'll rebase the patch and push it to try to see what the current state is.

I can't think of any other ways for JS to observe the actual composite happening.

Some other (somewhat bad) options:

* Increase the EnsureEventualDidPaint timeout (to 1000ms?) so that we get somewhat more useful results, but the bugs affecting some tests will wait longer.

* Make C++ code responsible for recording the telemetry/marker, since the DidComposite message should still arrive as expected. This is pretty awkward since all the rest of the tabswitch code is in JS, and we really don't want the platform to have to care about these details.

* Expose DidComposite more directly to JS, in addition to MozAfterPaint.
Flags: needinfo?(matt.woodrow)
Yeah, the timer feels pretty icky. Worth seeing if there's one or two bugs that can be fixed so that we can drop it.

If it's more involved, probably not worth rat-holing on it. I think a decent workaround would be to bump the timer, _and_ add a boolean to the event to say whether it was timer-generated or not. This would let the tab switcher telemetry give it special handling, though we'd want to think about what that handling would be (drop the sample, ceil(), or record it in a separate bucket?).
Also, would it be straightforward to fire twice - once for the timer-generated one, and once for the real one? If so, the tab switcher code could just ignore the timer-generated one.
(In reply to Bobby Holley (:bholley) from comment #3)
> Also, would it be straightforward to fire twice - once for the
> timer-generated one, and once for the real one? If so, the tab switcher code
> could just ignore the timer-generated one.

I think just firing it the second time under a different event name would be easiest.

That way we have MozAfterPaint which is guaranteed to be fired, and hopefully after compositing (but maybe earlier), and MozDidComposite, which *should* fire, but definitely after compositing.
Unfortunately, it appears that this problem also affects some of our talos tests. Removing the timer generated event results in quite a few serious regressions.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=2dedceddb210599ee8dbf20697e821fcbaeda657&framework=1&selectedTimeRange=172800
Assignee: nobody → matt.woodrow
Depends on: 1445570
Verified fixed by bug 1445570. Thanks Matt!

Note that bug 1482614 still prevents us from getting reliable telemetry for a lot of tab switching, but at least things are moving in the right direction.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
(In reply to Bobby Holley (:bholley) from comment #7)
> Verified fixed by bug 1445570. Thanks Matt!
> 
> Note that bug 1482614 still prevents us from getting reliable telemetry for
> a lot of tab switching, but at least things are moving in the right
> direction.

Do you have a feeling for how often we hit the missing path? My understanding was that it was supposed to be rare, but now I can't find out how I came to that understanding.
(In reply to Matt Woodrow (:mattwoodrow) from comment #8)
> (In reply to Bobby Holley (:bholley) from comment #7)
> > Verified fixed by bug 1445570. Thanks Matt!
> > 
> > Note that bug 1482614 still prevents us from getting reliable telemetry for
> > a lot of tab switching, but at least things are moving in the right
> > direction.
> 
> Do you have a feeling for how often we hit the missing path? My
> understanding was that it was supposed to be rare, but now I can't find out
> how I came to that understanding.

Well, it happens any time we show the spinner, which happens any time the content process takes longer than 400ms to flush layout and cough up a DL. That's not the case we're aiming for, but it's certainly an interesting one, and happens quite a bit on slower machines / high load. So I think it's one we want to account for somehow if we want our Telemetry to give us useful insight into the tail parts of the curve.

I actually tried to verify _this_ bug on the 2017 reference laptop, but kept hitting the spinner while profiling (so I verified on my MBP instead).
You need to log in before you can comment on or make changes to this bug.