Closed Bug 1284621 Opened 8 years ago Closed 8 years ago

6.11 - 8.94% tp5o Main_RSS / tp5o Private Bytes (windowsxp) regression on push f2110c264778e98b9f09be6c8c3e410b4fb1c1f5 (Tue Jun 28 2016)

Categories

(Toolkit Graveyard :: OS.File, defect)

50 Branch
defect
Not set
normal

Tracking

(firefox50 affected)

RESOLVED WONTFIX
Tracking Status
firefox50 --- affected

People

(Reporter: rwood, Assigned: bkelly)

References

Details

(Keywords: perf, regression, talos-regression)

Attachments

(2 files)

Talos has detected a Firefox performance regression from push f2110c264778e98b9f09be6c8c3e410b4fb1c1f5. As author of one of the patches included in that push, we need your help to address this regression.

Summary of tests that regressed:

  tp5o Main_RSS windowsxp opt e10s - 6.11% worse
  tp5o Private Bytes windowsxp opt e10s - 8.94% worse

You can find links to graphs and comparsion views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=1627

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
(In reply to Ben Kelly [:bkelly] from comment #1)
> I checked talos before landing that patch and it showed no regression:
> 
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=233bfce6458f&newProject=try&newR
> evision=f358ff3e819d4478252c2ae58d699658cc836ae1&framework=1&showOnlyImportan
> t=0

The regressions are on Windows XP, which I don't see in that compare view.
Whiteboard: talos_regression
Looking at the graph it appears to have become bimodal after bug 1279389 landed.  This kind of makes sense.  Sometimes the worker is killed before the measurement is taken.  Other times the worker is still running or just starting up.

My best guess as to why the memory increased is because the worker is starting up, delaying some amount of data from being written to disk.  This then shows up in the chart as higher memory usage during the talos snapshot.

I'm not sure what to do here.  I think the patch is working as intended as we allow the process to reduce to a lower value when work is not being done.  You can see this in the lower bimodal value.  But it does make measuring memory harder because there is more of a timing element in play now.
:erahm, have you seen this bi-modal memory showing up in AWSY since June 29th?
Flags: needinfo?(erahm)
(In reply to Joel Maher ( :jmaher ) from comment #5)
> :erahm, have you seen this bi-modal memory showing up in AWSY since June
> 29th?

Note: AWSY in automation is only running linux64-opt, non-e10s.

It does look like we're seeing up to 11MiB swings in our RSS start measurement (explicit has remained flat), *but* that measurement has always been volatile and it doesn't appear to be worse than before.

Maybe we can tweak that pref in testing to get more stable results?
Flags: needinfo?(erahm)
:bkelly, we are merging to aurora next week- is there anything to do here?
Flags: needinfo?(bkelly)
Joel, what am I doing wrong?  Its been over 24 hours and these have still not run.  I'm in a spec meeting today and tomorrow so don't have a lot of time to debug this.
Flags: needinfo?(bkelly) → needinfo?(jmaher)
I have been chipping away at bumping up the priority of the tp jobs, and now I see all tests have completed, most of them without my help.
Flags: needinfo?(jmaher)
after looking at the data in compare view, both of these options are not giving us any gains.  Possibly with 40 data points we could see the rate of bi-modality could change, but as it stands with 6 data points we still show a common bi-modal distribution.

25 seconds vs baseline:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6103134ef5e&newProject=try&newRevision=64e48394c9c8cea1b0cb0f26f32ddd080108f0b8&framework=1&filter=private&showOnlyImportant=0

35 seconds vs baseline:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6103134ef5e&newProject=try&newRevision=dbbe9f7a45cfc5749923ae408fbd910a66137cb5&framework=1&filter=private&showOnlyImportant=0


it seems that for both opt/e10s and main_rss/private_bytes, both the 25 second delay and 35 second delay all seem to have 1 less high data point (the regression is that we have gone bi-modal by adding hi data points) than the baseline.  My assertion of more data would prove that- I have requested more jobs and we will see if they can get run in a reasonable amount of time to see if we end up with a more agreeable trend.
I think we need to see a larger sample set, but it seems to me the 25 second value showed a non-trivial improvement for non-e10s.  The current set of samples for non-e10s with 25 second value is not bimodal at all.

NI, my self to check back on Monday for the additional results.

By the way, note that this pref is only enabled on non-release builds.  So this will not ride the trains to beta/release yet.
Flags: needinfo?(bkelly)
I did more retriggers and we hvae 16 data points for tp5 e10s on all try pushes mentioned above.  We see no change for main_rss or private bytes.  With more data I see us hitting 75% low values and 25% high values.
So do you think its worth trying to land the 25 second pref value?
Flags: needinfo?(bkelly) → needinfo?(jmaher)
I see no performance advantage of doing that, but if there is another reason for doing so, please do!
Flags: needinfo?(jmaher)
Ok, then I think I'd like to mark this WONTFIX.  I believe this bi-modal behavior is a timing artifact of when we try to snapshot RSS values.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
I have another idea of whats going on.

The osfile_async_front.jsm code does this to dispose of the worker:

  this._worker = null;

This depends on GC to actually shutdown the worker.  In the meantime, we could then spin up a new worker since we can't get to the old one.

Let me see if a patch that does this helps at all:

  this._worker.terminate();
  this._worker = null;

Or we could use a weak ref to opportunistically reuse a stale worker thread.
Assignee: nobody → bkelly
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
First we have to expose the underlying ChromeWorker.terminate() to the PromiseWorker wrapper interface.
This works in local testing.  Lets do some try builds.
Joel, can you do your magic to get these to run on winxp?  Still nothing after an overnight run.
Flags: needinfo?(jmaher)
I have bumped up all the tp5 jobs to priority 4, there are 1500+ winxp jobs in the queue, so p4 won't guarantee it.  I do see that you cancelled a lot of the jobs for winxp, but in the future this is a tp5 only regressions for winxp, so just do -u none -t tp5o[Windows XP] in your try syntax which will save resources on windows xp.
Flags: needinfo?(jmaher)
I didn't cancel those jobs.  I guess it was someone cleaning out old try jobs that they thought were done. :-\
Initial results seem fairly similar.  Trying a few more retriggers to see if more samples reveals a trend.
There is no appreciable change from this change either.  Moving this back to WONTFIX.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → WONTFIX
Product: Toolkit → Toolkit Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: