Closed Bug 1316575 Opened 8 years ago Closed 7 years ago

4.35% tp5o Main_RSS (osx-10-10) regression on push fab432069073 (Tue Nov 8 2016)

Categories

(Core :: XPCOM, defect)

52 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox50 --- unaffected
firefox51 --- unaffected
firefox52 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix

People

(Reporter: jmaher, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push fab432069073. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  4%  tp5o Main_RSS osx-10-10 opt e10s     315003903.33 -> 328711710.23

Improvements:

  4%  tps summary linux64 opt e10s     43.1 -> 41.56


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=4064

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
here is a compare view with the retriggers to show the effects of this change only:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=bd9dc9379305&newProject=mozilla-inbound&newRevision=fab432069073&framework=1

for the regression, we have gone bi-modal :(

:smaug, can you take a look at this and help determine what we should do to resolve this bug as per the guidelines in the original comment?
Flags: needinfo?(bugs)
I was expecting regressions, but we should see also improvements in some tests measuring stability of painting performance. I guess tps ends up sort of doing that.
(tryserver is unfortunately often too unreliable to detect talos regressions, or would need to trigger all the tests many many times).

Based on manual testing, and some feedback from others too, I'm pretty confident that we should not backout the patch.

What we possibly could do is to slow down vsync processing, or actually refreshdriver handling while we're loading top level pages... except that it would make using multiple windows jankier.
Flags: needinfo?(bugs)
I'm pretty sure this is just the timing of GC and the memory snapshot getting changed.  Maybe we should make the test framework use requestIdleCallback() somehow to wait for the browser to truly idle before trying to snapshot the memory.
Or hmm, maybe I could do the slowing down in a bit different level. investigating.
(In reply to Ben Kelly [:bkelly] from comment #3)
> I'm pretty sure this is just the timing of GC and the memory snapshot
> getting changed.
Why you think so? With the new setup we can process vsync sooner, so while still loading, the page
may end up doing painting more often.
(In reply to Olli Pettay [:smaug] from comment #5)
> (In reply to Ben Kelly [:bkelly] from comment #3)
> > I'm pretty sure this is just the timing of GC and the memory snapshot
> > getting changed.
> Why you think so? With the new setup we can process vsync sooner, so while
> still loading, the page
> may end up doing painting more often.

Well, vsync and paint running first could delay a GC till after the test decides to snapshot memory.  This would look like a memory regression.
hmm, right. Yeah, I'm not at all familiar what tp5o actually does, or how the memory is measured.
It doesn't seem to be documented in https://wiki.mozilla.org/Buildbot/Talos/Tests
I am open to changing that in pageloader if we need to.
Or we could just say that the regression is understandable, and do nothing.
I would expect us to get lots of similar-ish changes to the numbers while we're adding more quantum stuff.
would it make sense to review how we collect memory so we have something that we can track and trust?
(In reply to Joel Maher ( :jmaher) from comment #11)
> would it make sense to review how we collect memory so we have something
> that we can track and trust?

We already know that measurement is no good on e10s (bug 1250169, comment 15). Whether or not to force a GC is an interesting question, we've seen leaks in the past that were cleaned up by a forced GC (minimize in about:memory), but weren't with just incremental. For AWSY we measure both.
Component: Untriaged → XPCOM
Product: Firefox → Core
(In reply to Eric Rahm [:erahm] from comment #12)
> (In reply to Joel Maher ( :jmaher) from comment #11)
> > would it make sense to review how we collect memory so we have something
> > that we can track and trust?
> 
> We already know that measurement is no good on e10s (bug 1250169, comment
> 15).

Can you clarify a bit, Eric? Does this mean we should close this regression?
Flags: needinfo?(erahm)
(In reply to Andrew Overholt [:overholt] from comment #13)
> (In reply to Eric Rahm [:erahm] from comment #12)
> > (In reply to Joel Maher ( :jmaher) from comment #11)
> > > would it make sense to review how we collect memory so we have something
> > > that we can track and trust?
> > 
> > We already know that measurement is no good on e10s (bug 1250169, comment
> > 15).
> 
> Can you clarify a bit, Eric? Does this mean we should close this regression?

As long as we have a test that sums the RSS(main) + USS(child processes) -- I think we do -- I'd say we can disable this test on e10s and ignore the regression.
Flags: needinfo?(erahm)
Joel, are we OK to close this out now?
Flags: needinfo?(jmaher)
There are a handful of things which cause me to get confused, dealing with memory management and talos have always confused me.

I don't believe we have proper memory reporting in talos when it comes to main/child processes.  Unfortunately I am not clear on what to do to fix this inside of talos- ideally we could get AWSY running on all platforms and track memory that way.

As to the question of closing this, the regression in talos numbers still exists:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=%5Bmozilla-inbound,d9b2865ade63ca5702db9cbd1572d36051440fd5,1%5D&series=%5Bautoland,d9b2865ade63ca5702db9cbd1572d36051440fd5,0%5D&selected=%5Bmozilla-inbound,d9b2865ade63ca5702db9cbd1572d36051440fd5%5D

It would be sad to close this with no plan to fix talos or ensure we are measuring the right numbers.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) from comment #16)
> It would be sad to close this with no plan to fix talos or ensure we are
> measuring the right numbers.

Can you file this and close this bug out?
Flags: needinfo?(jmaher)
added a comment in bug 1142357.
Flags: needinfo?(jmaher)
(In reply to Jim Mathies [:jimm] from comment #17)
> Can you file this and close this bug out?
Agree as long as it's been clarified and there is a follow-up bug being tracked by module owner with priority.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.