1316575 - 4.35% tp5o Main_RSS (osx-10-10) regression on push fab432069073 (Tue Nov 8 2016)

Reporter

Description

•

8 years ago

Talos has detected a Firefox performance regression from push fab432069073. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  4%  tp5o Main_RSS osx-10-10 opt e10s     315003903.33 -> 328711710.23

Improvements:

  4%  tps summary linux64 opt e10s     43.1 -> 41.56


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=4064

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

8 years ago

here is a compare view with the retriggers to show the effects of this change only:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=bd9dc9379305&newProject=mozilla-inbound&newRevision=fab432069073&framework=1

for the regression, we have gone bi-modal :(

:smaug, can you take a look at this and help determine what we should do to resolve this bug as per the guidelines in the original comment?

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 2

•

8 years ago

I was expecting regressions, but we should see also improvements in some tests measuring stability of painting performance. I guess tps ends up sort of doing that.
(tryserver is unfortunately often too unreliable to detect talos regressions, or would need to trigger all the tests many many times).

Based on manual testing, and some feedback from others too, I'm pretty confident that we should not backout the patch.

What we possibly could do is to slow down vsync processing, or actually refreshdriver handling while we're loading top level pages... except that it would make using multiple windows jankier.

Flags: needinfo?(bugs)

Ben Kelly [:bkelly, not reviewing]

Comment 3

•

8 years ago

I'm pretty sure this is just the timing of GC and the memory snapshot getting changed.  Maybe we should make the test framework use requestIdleCallback() somehow to wait for the browser to truly idle before trying to snapshot the memory.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 4

•

8 years ago

Or hmm, maybe I could do the slowing down in a bit different level. investigating.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 5

•

8 years ago

(In reply to Ben Kelly [:bkelly] from comment #3)
> I'm pretty sure this is just the timing of GC and the memory snapshot
> getting changed.
Why you think so? With the new setup we can process vsync sooner, so while still loading, the page
may end up doing painting more often.

Ben Kelly [:bkelly, not reviewing]

Comment 6

•

8 years ago

(In reply to Olli Pettay [:smaug] from comment #5)
> (In reply to Ben Kelly [:bkelly] from comment #3)
> > I'm pretty sure this is just the timing of GC and the memory snapshot
> > getting changed.
> Why you think so? With the new setup we can process vsync sooner, so while
> still loading, the page
> may end up doing painting more often.

Well, vsync and paint running first could delay a GC till after the test decides to snapshot memory.  This would look like a memory regression.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 7

•

8 years ago

hmm, right. Yeah, I'm not at all familiar what tp5o actually does, or how the memory is measured.
It doesn't seem to be documented in https://wiki.mozilla.org/Buildbot/Talos/Tests

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 8

•

8 years ago

I think it's here:
http://searchfox.org/mozilla-central/source/testing/talos/talos/pageloader/chrome/memory.js

The critical call seems to be here:
http://searchfox.org/mozilla-central/search?q=symbol:%23collectMemory&redirect=false

Interestingly, we don't force any GCs or wait for anything before the measurement.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 9

•

8 years ago

I am open to changing that in pageloader if we need to.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 10

•

8 years ago

Or we could just say that the regression is understandable, and do nothing.
I would expect us to get lots of similar-ish changes to the numbers while we're adding more quantum stuff.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 11

•

8 years ago

would it make sense to review how we collect memory so we have something that we can track and trust?

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

8 years ago

Depends on: 1316686

Eric Rahm [:erahm]

Comment 12

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #11)
> would it make sense to review how we collect memory so we have something
> that we can track and trust?

We already know that measurement is no good on e10s (bug 1250169, comment 15). Whether or not to force a GC is an interesting question, we've seen leaks in the past that were cleaned up by a forced GC (minimize in about:memory), but weren't with just incremental. For AWSY we measure both.

Simona Badau, Desktop QA

Updated

•

8 years ago

Component: Untriaged → XPCOM

Product: Firefox → Core

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox50: --- → unaffected

status-firefox51: --- → unaffected

status-firefox53: --- → affected

Andrew Overholt [:overholt]

Comment 13

•

8 years ago

(In reply to Eric Rahm [:erahm] from comment #12)
> (In reply to Joel Maher ( :jmaher) from comment #11)
> > would it make sense to review how we collect memory so we have something
> > that we can track and trust?
> 
> We already know that measurement is no good on e10s (bug 1250169, comment
> 15).

Can you clarify a bit, Eric? Does this mean we should close this regression?

Flags: needinfo?(erahm)

Eric Rahm [:erahm]

Comment 14

•

7 years ago

(In reply to Andrew Overholt [:overholt] from comment #13)
> (In reply to Eric Rahm [:erahm] from comment #12)
> > (In reply to Joel Maher ( :jmaher) from comment #11)
> > > would it make sense to review how we collect memory so we have something
> > > that we can track and trust?
> > 
> > We already know that measurement is no good on e10s (bug 1250169, comment
> > 15).
> 
> Can you clarify a bit, Eric? Does this mean we should close this regression?

As long as we have a test that sums the RSS(main) + USS(child processes) -- I think we do -- I'd say we can disable this test on e10s and ignore the regression.

Flags: needinfo?(erahm)

Ryan VanderMeulen [:RyanVM]

Comment 15

•

7 years ago

Joel, are we OK to close this out now?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 16

•

7 years ago

There are a handful of things which cause me to get confused, dealing with memory management and talos have always confused me.

I don't believe we have proper memory reporting in talos when it comes to main/child processes.  Unfortunately I am not clear on what to do to fix this inside of talos- ideally we could get AWSY running on all platforms and track memory that way.

As to the question of closing this, the regression in talos numbers still exists:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=%5Bmozilla-inbound,d9b2865ade63ca5702db9cbd1572d36051440fd5,1%5D&series=%5Bautoland,d9b2865ade63ca5702db9cbd1572d36051440fd5,0%5D&selected=%5Bmozilla-inbound,d9b2865ade63ca5702db9cbd1572d36051440fd5%5D

It would be sad to close this with no plan to fix talos or ensure we are measuring the right numbers.

Flags: needinfo?(jmaher)

Jim Mathies [:jimm]

Comment 17

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #16)
> It would be sad to close this with no plan to fix talos or ensure we are
> measuring the right numbers.

Can you file this and close this bug out?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 18

•

7 years ago

added a comment in bug 1142357.

Flags: needinfo?(jmaher)

Astley Chen (inactive)

Comment 19

•

7 years ago

(In reply to Jim Mathies [:jimm] from comment #17)
> Can you file this and close this bug out?
Agree as long as it's been clarified and there is a follow-up bug being tracked by module owner with priority.

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox52: affected → wontfix

status-firefox53: affected → wontfix

status-firefox54: --- → wontfix

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED