If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

6.11 - 8.94% tp5o Main_RSS / tp5o Private Bytes (windowsxp) regression on push f2110c264778e98b9f09be6c8c3e410b4fb1c1f5 (Tue Jun 28 2016)




a year ago
a year ago


(Reporter: rwood, Assigned: bkelly)


(Blocks: 1 bug, {perf, regression, talos-regression})

50 Branch
perf, regression, talos-regression
Dependency tree / graph

Firefox Tracking Flags

(firefox50 affected)



(2 attachments)



a year ago
Talos has detected a Firefox performance regression from push f2110c264778e98b9f09be6c8c3e410b4fb1c1f5. As author of one of the patches included in that push, we need your help to address this regression.

Summary of tests that regressed:

  tp5o Main_RSS windowsxp opt e10s - 6.11% worse
  tp5o Private Bytes windowsxp opt e10s - 8.94% worse

You can find links to graphs and comparsion views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=1627

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Comment 1

a year ago
I checked talos before landing that patch and it showed no regression:

(In reply to Ben Kelly [:bkelly] from comment #1)
> I checked talos before landing that patch and it showed no regression:
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=233bfce6458f&newProject=try&newR
> evision=f358ff3e819d4478252c2ae58d699658cc836ae1&framework=1&showOnlyImportan
> t=0

The regressions are on Windows XP, which I don't see in that compare view.
Keywords: talos-regression
Whiteboard: talos_regression

Comment 3

a year ago
This regression is a bit different in that it has caused the data to actually become bimodal. As you can see here on windowsxp:


And here for windows7-32:


Comment 4

a year ago
Looking at the graph it appears to have become bimodal after bug 1279389 landed.  This kind of makes sense.  Sometimes the worker is killed before the measurement is taken.  Other times the worker is still running or just starting up.

My best guess as to why the memory increased is because the worker is starting up, delaying some amount of data from being written to disk.  This then shows up in the chart as higher memory usage during the talos snapshot.

I'm not sure what to do here.  I think the patch is working as intended as we allow the process to reduce to a lower value when work is not being done.  You can see this in the lower bimodal value.  But it does make measuring memory harder because there is more of a timing element in play now.
:erahm, have you seen this bi-modal memory showing up in AWSY since June 29th?
Flags: needinfo?(erahm)
(In reply to Joel Maher ( :jmaher ) from comment #5)
> :erahm, have you seen this bi-modal memory showing up in AWSY since June
> 29th?

Note: AWSY in automation is only running linux64-opt, non-e10s.

It does look like we're seeing up to 11MiB swings in our RSS start measurement (explicit has remained flat), *but* that measurement has always been volatile and it doesn't appear to be worse than before.

Maybe we can tweak that pref in testing to get more stable results?
Flags: needinfo?(erahm)
:bkelly, we are merging to aurora next week- is there anything to do here?
Flags: needinfo?(bkelly)

Comment 8

a year ago


Tweak timeout to 25 seconds:


Tweak timeout to 35 seconds:


Comment 9

a year ago
Joel, what am I doing wrong?  Its been over 24 hours and these have still not run.  I'm in a spec meeting today and tomorrow so don't have a lot of time to debug this.
Flags: needinfo?(bkelly) → needinfo?(jmaher)
I have been chipping away at bumping up the priority of the tp jobs, and now I see all tests have completed, most of them without my help.
Flags: needinfo?(jmaher)
after looking at the data in compare view, both of these options are not giving us any gains.  Possibly with 40 data points we could see the rate of bi-modality could change, but as it stands with 6 data points we still show a common bi-modal distribution.

25 seconds vs baseline:

35 seconds vs baseline:

it seems that for both opt/e10s and main_rss/private_bytes, both the 25 second delay and 35 second delay all seem to have 1 less high data point (the regression is that we have gone bi-modal by adding hi data points) than the baseline.  My assertion of more data would prove that- I have requested more jobs and we will see if they can get run in a reasonable amount of time to see if we end up with a more agreeable trend.

Comment 12

a year ago
I think we need to see a larger sample set, but it seems to me the 25 second value showed a non-trivial improvement for non-e10s.  The current set of samples for non-e10s with 25 second value is not bimodal at all.

NI, my self to check back on Monday for the additional results.

By the way, note that this pref is only enabled on non-release builds.  So this will not ride the trains to beta/release yet.
Flags: needinfo?(bkelly)
I did more retriggers and we hvae 16 data points for tp5 e10s on all try pushes mentioned above.  We see no change for main_rss or private bytes.  With more data I see us hitting 75% low values and 25% high values.

Comment 14

a year ago
So do you think its worth trying to land the 25 second pref value?
Flags: needinfo?(bkelly) → needinfo?(jmaher)
I see no performance advantage of doing that, but if there is another reason for doing so, please do!
Flags: needinfo?(jmaher)

Comment 16

a year ago
Ok, then I think I'd like to mark this WONTFIX.  I believe this bi-modal behavior is a timing artifact of when we try to snapshot RSS values.
Last Resolved: a year ago
Resolution: --- → WONTFIX

Comment 17

a year ago
I have another idea of whats going on.

The osfile_async_front.jsm code does this to dispose of the worker:

  this._worker = null;

This depends on GC to actually shutdown the worker.  In the meantime, we could then spin up a new worker since we can't get to the old one.

Let me see if a patch that does this helps at all:

  this._worker = null;

Or we could use a weak ref to opportunistically reuse a stale worker thread.
Assignee: nobody → bkelly
Resolution: WONTFIX → ---

Comment 18

a year ago
Created attachment 8777354 [details] [diff] [review]
P1 Add a BasePromiseWorker.prototype.terminate() method. r=yoric

First we have to expose the underlying ChromeWorker.terminate() to the PromiseWorker wrapper interface.

Comment 19

a year ago
Created attachment 8777355 [details] [diff] [review]
P2 Make osfile_async_front.jsm explicitly terminate its worker. r=yoric

This works in local testing.  Lets do some try builds.

Comment 20

a year ago
Talos baseline:


Try and talos build with patches:


I hope I got the try syntax right.

Comment 21

a year ago
Joel, can you do your magic to get these to run on winxp?  Still nothing after an overnight run.
Flags: needinfo?(jmaher)
I have bumped up all the tp5 jobs to priority 4, there are 1500+ winxp jobs in the queue, so p4 won't guarantee it.  I do see that you cancelled a lot of the jobs for winxp, but in the future this is a tp5 only regressions for winxp, so just do -u none -t tp5o[Windows XP] in your try syntax which will save resources on windows xp.
Flags: needinfo?(jmaher)

Comment 23

a year ago
I didn't cancel those jobs.  I guess it was someone cleaning out old try jobs that they thought were done. :-\

Comment 24

a year ago
Initial results seem fairly similar.  Trying a few more retriggers to see if more samples reveals a trend.

Comment 25

a year ago
There is no appreciable change from this change either.  Moving this back to WONTFIX.
Last Resolved: a year agoa year ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.