Open Bug 1426743 Opened 6 years ago Updated 2 years ago

3.68% cpstartup content-process-startup (windows10-64) regression on push 40eb00c21e7c776d4781c28e88a9b0917b4396ce (Thu Dec 21 2017)

Categories

(Core :: DOM: Content Processes, defect, P3)

53 Branch
defect

Tracking

()

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=40eb00c21e7c776d4781c28e88a9b0917b4396ce

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  4%  cpstartup content-process-startup windows10-64 pgo e10s     278.58 -> 288.83


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=11042

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:sylvestre, I see you landed all the patches in this commit:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=51a3a1b8513ce4bf424dce4d14cb2465529de4f4&tochange=40eb00c21e7c776d4781c28e88a9b0917b4396ce

I understand you might be out for the holidays (the fun of landing code the day before you leave for a long break), so I am fine with you fixing this the first week in January.
Flags: needinfo?(sledru)
This is just renaming our application from Nightly to Firefox Nightly. I wonder how this could regress the startup.
Are you sure there isn't something else?
this change shows up on pgo and opt, and I have a few extra retriggers to show before/after, the data holds true.  I don't see other regressions, just the cpstartup (this was added in bug 1336389 by :gabor who is inactive on bugzilla now), possibly ping :mconley to learn more about the test?
Mike, do you have a clue why the renaming from Nightly to Firefox Nightly regress the performances by 3.7%?
I had to land some fixes in parallel but they are just build & packaging plumbing.
List of patches is available in comment #1.

(this can wait for next year :)
Flags: needinfo?(sledru) → needinfo?(mconley)
The ts_paint regressions bellow were also noticed for this bug:

Regressions:

  2%  ts_paint windows10-64 pgo e10s                              490.67 -> 498.25
  1%  ts_paint windows10-64 opt e10s                              572.42 -> 580.25
That's pretty mysterious. It's also interesting to see that we're losing roughly 8-10ms on all of these regressions.

I don't have a great explanation or theory for this. The only thing I can recommend is to extract some profiles to see if we can suss out what is going on here.

Hey igoldan - is the mechanism for automatically creating performance profiles for the before and after case set up yet? Or should I do this myself manually for now?
Flags: needinfo?(mconley) → needinfo?(igoldan)
(In reply to Mike Conley (:mconley) (:⚙️) - Backlogged on reviews and needinfos from comment #6)
> Hey igoldan - is the mechanism for automatically creating performance
> profiles for the before and after case set up yet? Or should I do this
> myself manually for now?
I don't think we added this mechanism yet.

Robert, have you managed to look over this feature?
Flags: needinfo?(igoldan) → needinfo?(rwood)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #7)
> (In reply to Mike Conley (:mconley) (:⚙️) - Backlogged on reviews and
> needinfos from comment #6)
> > Hey igoldan - is the mechanism for automatically creating performance
> > profiles for the before and after case set up yet? Or should I do this
> > myself manually for now?
> I don't think we added this mechanism yet.
> 
> Robert, have you managed to look over this feature?

No that's correct we don't have that automation. Only the --geckoProfile flag for profiling the current talos run. I did a search and don't see a bug filed for that ability, although seems to me I heard something in Austin about this perhaps. Is this part of the larger idea of running talos twice on a single push (i.e. on the current rev and the rev -1) and providing the diff?
Flags: needinfo?(rwood)
Flags: needinfo?(mconley)
Flags: needinfo?(jmaher)
we don't have the ability for this- I think in the coming weeks we will get our sheriffing process updated to have two try pushes (before/after) which include geckoProfile;  To do this on existing pushes has a lot of small problems to work around and tooling to do, so adding an extra sheriffing step is the path of least resistance for now.

:mconley, I would just do 2 try pushes for now;
Flags: needinfo?(jmaher)
Just coming back round to this now. Apparently, these try pushes failed. :( Timing out in ts_paint for some reason.

I'll remove all tests except for cpstartup to see if that helps.
I just took a quick peek at these profiles and the scores that the cpstartup tests got with them, and it's not very encouraging. Here are the scores:

Baseline: 

11:45:40     INFO -  PID 7012 | ------- Summary: start -------
11:45:40     INFO -  PID 7012 | Number of tests: 1
11:45:40     INFO -  PID 7012 |
11:45:40     INFO -  PID 7012 | [#0] content-process-startup  Cycles:2  Average:491.50  Median:525.00  stddev:47.38 (9.0%)
11:45:40     INFO -  PID 7012 | Values: 525.0  458.0
11:45:40     INFO -  PID 7012 | -------- Summary: end --------


Regression:

11:45:42     INFO -  PID 8968 | ------- Summary: start -------
11:45:42     INFO -  PID 8968 | Number of tests: 1
11:45:42     INFO -  PID 8968 |
11:45:42     INFO -  PID 8968 | [#0] content-process-startup  Cycles:2  Average:395.50  Median:432.00  stddev:51.62 (11.9%)
11:45:42     INFO -  PID 8968 | Values: 432.0  359.0
11:45:42     INFO -  PID 8968 | -------- Summary: end --------


This means that the "regression" patch actually performed better than the baseline when profiling.

This means the profiles in comment 13 are not likely to contain anything actionable for this issue.
Flags: needinfo?(mconley)
(In reply to Mike Conley (:mconley) (:⚙️) - Backlogged on reviews and needinfos from comment #14)
> I just took a quick peek at these profiles and the scores that the cpstartup
> tests got with them, and it's not very encouraging. Here are the scores:
> 
> Baseline: 
> 
> 11:45:40     INFO -  PID 7012 | ------- Summary: start -------
> 11:45:40     INFO -  PID 7012 | Number of tests: 1
> 11:45:40     INFO -  PID 7012 |
> 11:45:40     INFO -  PID 7012 | [#0] content-process-startup  Cycles:2 
> Average:491.50  Median:525.00  stddev:47.38 (9.0%)
> 11:45:40     INFO -  PID 7012 | Values: 525.0  458.0
> 11:45:40     INFO -  PID 7012 | -------- Summary: end --------
> 
> 
> Regression:
> 
> 11:45:42     INFO -  PID 8968 | ------- Summary: start -------
> 11:45:42     INFO -  PID 8968 | Number of tests: 1
> 11:45:42     INFO -  PID 8968 |
> 11:45:42     INFO -  PID 8968 | [#0] content-process-startup  Cycles:2 
> Average:395.50  Median:432.00  stddev:51.62 (11.9%)
> 11:45:42     INFO -  PID 8968 | Values: 432.0  359.0
> 11:45:42     INFO -  PID 8968 | -------- Summary: end --------
> 
> 
> This means that the "regression" patch actually performed better than the
> baseline when profiling.
> 
> This means the profiles in comment 13 are not likely to contain anything
> actionable for this issue.

Can we do collect other kinds of insight data, so we can figure this out?
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #15)
> Can we do collect other kinds of insight data, so we can figure this out?

I'm not sure. Profiling is our biggest hammer. If enabling profiling causes the issue to disappear, that means we're dealing with a bug we can't really observe with our instruments.

When that fails, sometimes one can try to instrument things manually - have Gecko in the content process dump process-start-deltas during its start-up steps until the point it sends the result to the parent process. Do this for both revisions with and without the regressions changeset, and see if a gap opens up somewhere where one step tends to take longer for the regressor than the baseline.
Component: Untriaged → DOM: Content Processes
Product: Firefox → Core
Priority: -- → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.