Closed Bug 1284621 Opened 8 years ago Closed 8 years ago

6.11 - 8.94% tp5o Main_RSS / tp5o Private Bytes (windowsxp) regression on push f2110c264778e98b9f09be6c8c3e410b4fb1c1f5 (Tue Jun 28 2016)

Tracking

(firefox50 affected)

Status:

RESOLVED WONTFIX

Tracking Flags:

Tracking

Status

firefox50

---

affected

People

(Reporter: rwood, Assigned: bkelly)

References

Details

(Keywords: perf, regression, talos-regression)

Attachments

(2 files)

P1 Add a BasePromiseWorker.prototype.terminate() method. r=yoric 8 years ago Ben Kelly [:bkelly, not reviewing] 995 bytes, patch		Details \| Diff \| Splinter Review
P2 Make osfile_async_front.jsm explicitly terminate its worker. r=yoric 8 years ago Ben Kelly [:bkelly, not reviewing] 1.46 KB, patch		Details \| Diff \| Splinter Review

Robert Wood [:rwood]

Reporter

Description

•

8 years ago

Talos has detected a Firefox performance regression from push f2110c264778e98b9f09be6c8c3e410b4fb1c1f5. As author of one of the patches included in that push, we need your help to address this regression.

Summary of tests that regressed:

  tp5o Main_RSS windowsxp opt e10s - 6.11% worse
  tp5o Private Bytes windowsxp opt e10s - 8.94% worse

You can find links to graphs and comparsion views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=1627

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 1

•

8 years ago

I checked talos before landing that patch and it showed no regression:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=233bfce6458f&newProject=try&newRevision=f358ff3e819d4478252c2ae58d699658cc836ae1&framework=1&showOnlyImportant=0

William Lachance (:wlach)

Comment 2

•

8 years ago

(In reply to Ben Kelly [:bkelly] from comment #1)
> I checked talos before landing that patch and it showed no regression:
> 
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=233bfce6458f&newProject=try&newR
> evision=f358ff3e819d4478252c2ae58d699658cc836ae1&framework=1&showOnlyImportan
> t=0

The regressions are on Windows XP, which I don't see in that compare view.

William Lachance (:wlach)

Updated

•

8 years ago

Keywords: talos-regression

Whiteboard: talos_regression

Robert Wood [:rwood]

Reporter

Comment 3

•

8 years ago

This regression is a bit different in that it has caused the data to actually become bimodal. As you can see here on windowsxp:

https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,c224a6af64b18a61a2aecd530bd7b05fc46cc778,1,1%5D&series=%5Bfx-team,c224a6af64b18a61a2aecd530bd7b05fc46cc778,1,1%5D&zoom=1466997576457.8948,1467488044181.0527,170985505.47724184,189536230.114923

And here for windows7-32:

https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,fcd6b5e848a5e81cc798d360e10520e0ea014f60,1,1%5D&series=%5Bfx-team,fcd6b5e848a5e81cc798d360e10520e0ea014f60,1,1%5D&zoom=1466970443600,1467639029280,110695650.40477808,128956519.96999547

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 4

•

8 years ago

Looking at the graph it appears to have become bimodal after bug 1279389 landed.  This kind of makes sense.  Sometimes the worker is killed before the measurement is taken.  Other times the worker is still running or just starting up.

My best guess as to why the memory increased is because the worker is starting up, delaying some amount of data from being written to disk.  This then shows up in the chart as higher memory usage during the talos snapshot.

I'm not sure what to do here.  I think the patch is working as intended as we allow the process to reduce to a lower value when work is not being done.  You can see this in the lower bimodal value.  But it does make measuring memory harder because there is more of a timing element in play now.

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

8 years ago

:erahm, have you seen this bi-modal memory showing up in AWSY since June 29th?

Flags: needinfo?(erahm)

Eric Rahm [:erahm]

Comment 6

•

8 years ago

(In reply to Joel Maher ( :jmaher ) from comment #5)
> :erahm, have you seen this bi-modal memory showing up in AWSY since June
> 29th?

Note: AWSY in automation is only running linux64-opt, non-e10s.

It does look like we're seeing up to 11MiB swings in our RSS start measurement (explicit has remained flat), *but* that measurement has always been volatile and it doesn't appear to be worse than before.

Maybe we can tweak that pref in testing to get more stable results?

Flags: needinfo?(erahm)

Joel Maher ( :jmaher ) (UTC -8)

Comment 7

•

8 years ago

:bkelly, we are merging to aurora next week- is there anything to do here?

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 8

•

8 years ago

Baseline:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6103134ef5e

Tweak timeout to 25 seconds:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=64e48394c9c8

Tweak timeout to 35 seconds:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dbbe9f7a45cf

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 9

•

8 years ago

Joel, what am I doing wrong?  Its been over 24 hours and these have still not run.  I'm in a spec meeting today and tomorrow so don't have a lot of time to debug this.

Flags: needinfo?(bkelly) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

8 years ago

I have been chipping away at bumping up the priority of the tp jobs, and now I see all tests have completed, most of them without my help.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

8 years ago

after looking at the data in compare view, both of these options are not giving us any gains.  Possibly with 40 data points we could see the rate of bi-modality could change, but as it stands with 6 data points we still show a common bi-modal distribution.

25 seconds vs baseline:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6103134ef5e&newProject=try&newRevision=64e48394c9c8cea1b0cb0f26f32ddd080108f0b8&framework=1&filter=private&showOnlyImportant=0

35 seconds vs baseline:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6103134ef5e&newProject=try&newRevision=dbbe9f7a45cfc5749923ae408fbd910a66137cb5&framework=1&filter=private&showOnlyImportant=0


it seems that for both opt/e10s and main_rss/private_bytes, both the 25 second delay and 35 second delay all seem to have 1 less high data point (the regression is that we have gone bi-modal by adding hi data points) than the baseline.  My assertion of more data would prove that- I have requested more jobs and we will see if they can get run in a reasonable amount of time to see if we end up with a more agreeable trend.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 12

•

8 years ago

I think we need to see a larger sample set, but it seems to me the 25 second value showed a non-trivial improvement for non-e10s.  The current set of samples for non-e10s with 25 second value is not bimodal at all.

NI, my self to check back on Monday for the additional results.

By the way, note that this pref is only enabled on non-release builds.  So this will not ride the trains to beta/release yet.

Flags: needinfo?(bkelly)

Joel Maher ( :jmaher ) (UTC -8)

Comment 13

•

8 years ago

I did more retriggers and we hvae 16 data points for tp5 e10s on all try pushes mentioned above.  We see no change for main_rss or private bytes.  With more data I see us hitting 75% low values and 25% high values.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 14

•

8 years ago

So do you think its worth trying to land the 25 second pref value?

Flags: needinfo?(bkelly) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 15

•

8 years ago

I see no performance advantage of doing that, but if there is another reason for doing so, please do!

Flags: needinfo?(jmaher)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 16

•

8 years ago

Ok, then I think I'd like to mark this WONTFIX.  I believe this bi-modal behavior is a timing artifact of when we try to snapshot RSS values.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 17

•

8 years ago

I have another idea of whats going on.

The osfile_async_front.jsm code does this to dispose of the worker:

  this._worker = null;

This depends on GC to actually shutdown the worker.  In the meantime, we could then spin up a new worker since we can't get to the old one.

Let me see if a patch that does this helps at all:

  this._worker.terminate();
  this._worker = null;

Or we could use a weak ref to opportunistically reuse a stale worker thread.

Assignee: nobody → bkelly

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 18

•

8 years ago

Attached patch P1 Add a BasePromiseWorker.prototype.terminate() method. r=yoric — Details — Splinter Review

First we have to expose the underlying ChromeWorker.terminate() to the PromiseWorker wrapper interface.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 19

•

8 years ago

Attached patch P2 Make osfile_async_front.jsm explicitly terminate its worker. r=yoric — Details — Splinter Review

This works in local testing.  Lets do some try builds.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 20

•

8 years ago

Talos baseline:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f7cd47d83082

Try and talos build with patches:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f085c51ab239

I hope I got the try syntax right.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 21

•

8 years ago

Joel, can you do your magic to get these to run on winxp?  Still nothing after an overnight run.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 22

•

8 years ago

I have bumped up all the tp5 jobs to priority 4, there are 1500+ winxp jobs in the queue, so p4 won't guarantee it.  I do see that you cancelled a lot of the jobs for winxp, but in the future this is a tp5 only regressions for winxp, so just do -u none -t tp5o[Windows XP] in your try syntax which will save resources on windows xp.

Flags: needinfo?(jmaher)

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 23

•

8 years ago

I didn't cancel those jobs.  I guess it was someone cleaning out old try jobs that they thought were done. :-\

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 24

•

8 years ago

Initial results seem fairly similar.  Trying a few more retriggers to see if more samples reveals a trend.

Ben Kelly [:bkelly, not reviewing]

Assignee

Comment 25

•

8 years ago

There is no appreciable change from this change either.  Moving this back to WONTFIX.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

11 months ago

Product: Toolkit → Toolkit Graveyard

You need to log in before you can comment on or make changes to this bug.