Closed
Bug 1284621
Opened 9 years ago
Closed 9 years ago
6.11 - 8.94% tp5o Main_RSS / tp5o Private Bytes (windowsxp) regression on push f2110c264778e98b9f09be6c8c3e410b4fb1c1f5 (Tue Jun 28 2016)
Categories
(Toolkit Graveyard :: OS.File, defect)
Tracking
(firefox50 affected)
RESOLVED
WONTFIX
Tracking | Status | |
---|---|---|
firefox50 | --- | affected |
People
(Reporter: rwood, Assigned: bkelly)
References
Details
(Keywords: perf, regression, talos-regression)
Attachments
(2 files)
Talos has detected a Firefox performance regression from push f2110c264778e98b9f09be6c8c3e410b4fb1c1f5. As author of one of the patches included in that push, we need your help to address this regression.
Summary of tests that regressed:
tp5o Main_RSS windowsxp opt e10s - 6.11% worse
tp5o Private Bytes windowsxp opt e10s - 8.94% worse
You can find links to graphs and comparsion views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=1627
On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.
To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests
For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running
*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***
Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Assignee | ||
Comment 1•9 years ago
|
||
I checked talos before landing that patch and it showed no regression:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=233bfce6458f&newProject=try&newRevision=f358ff3e819d4478252c2ae58d699658cc836ae1&framework=1&showOnlyImportant=0
Comment 2•9 years ago
|
||
(In reply to Ben Kelly [:bkelly] from comment #1)
> I checked talos before landing that patch and it showed no regression:
>
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=233bfce6458f&newProject=try&newR
> evision=f358ff3e819d4478252c2ae58d699658cc836ae1&framework=1&showOnlyImportan
> t=0
The regressions are on Windows XP, which I don't see in that compare view.
Updated•9 years ago
|
Keywords: talos-regression
Whiteboard: talos_regression
Reporter | ||
Comment 3•9 years ago
|
||
This regression is a bit different in that it has caused the data to actually become bimodal. As you can see here on windowsxp:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,c224a6af64b18a61a2aecd530bd7b05fc46cc778,1,1%5D&series=%5Bfx-team,c224a6af64b18a61a2aecd530bd7b05fc46cc778,1,1%5D&zoom=1466997576457.8948,1467488044181.0527,170985505.47724184,189536230.114923
And here for windows7-32:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,fcd6b5e848a5e81cc798d360e10520e0ea014f60,1,1%5D&series=%5Bfx-team,fcd6b5e848a5e81cc798d360e10520e0ea014f60,1,1%5D&zoom=1466970443600,1467639029280,110695650.40477808,128956519.96999547
Assignee | ||
Comment 4•9 years ago
|
||
Looking at the graph it appears to have become bimodal after bug 1279389 landed. This kind of makes sense. Sometimes the worker is killed before the measurement is taken. Other times the worker is still running or just starting up.
My best guess as to why the memory increased is because the worker is starting up, delaying some amount of data from being written to disk. This then shows up in the chart as higher memory usage during the talos snapshot.
I'm not sure what to do here. I think the patch is working as intended as we allow the process to reduce to a lower value when work is not being done. You can see this in the lower bimodal value. But it does make measuring memory harder because there is more of a timing element in play now.
Comment 5•9 years ago
|
||
:erahm, have you seen this bi-modal memory showing up in AWSY since June 29th?
Flags: needinfo?(erahm)
Comment 6•9 years ago
|
||
(In reply to Joel Maher ( :jmaher ) from comment #5)
> :erahm, have you seen this bi-modal memory showing up in AWSY since June
> 29th?
Note: AWSY in automation is only running linux64-opt, non-e10s.
It does look like we're seeing up to 11MiB swings in our RSS start measurement (explicit has remained flat), *but* that measurement has always been volatile and it doesn't appear to be worse than before.
Maybe we can tweak that pref in testing to get more stable results?
Flags: needinfo?(erahm)
Comment 7•9 years ago
|
||
:bkelly, we are merging to aurora next week- is there anything to do here?
Flags: needinfo?(bkelly)
Assignee | ||
Comment 8•9 years ago
|
||
Baseline:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6103134ef5e
Tweak timeout to 25 seconds:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=64e48394c9c8
Tweak timeout to 35 seconds:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dbbe9f7a45cf
Assignee | ||
Comment 9•9 years ago
|
||
Joel, what am I doing wrong? Its been over 24 hours and these have still not run. I'm in a spec meeting today and tomorrow so don't have a lot of time to debug this.
Flags: needinfo?(bkelly) → needinfo?(jmaher)
Comment 10•9 years ago
|
||
I have been chipping away at bumping up the priority of the tp jobs, and now I see all tests have completed, most of them without my help.
Flags: needinfo?(jmaher)
Comment 11•9 years ago
|
||
after looking at the data in compare view, both of these options are not giving us any gains. Possibly with 40 data points we could see the rate of bi-modality could change, but as it stands with 6 data points we still show a common bi-modal distribution.
25 seconds vs baseline:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6103134ef5e&newProject=try&newRevision=64e48394c9c8cea1b0cb0f26f32ddd080108f0b8&framework=1&filter=private&showOnlyImportant=0
35 seconds vs baseline:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6103134ef5e&newProject=try&newRevision=dbbe9f7a45cfc5749923ae408fbd910a66137cb5&framework=1&filter=private&showOnlyImportant=0
it seems that for both opt/e10s and main_rss/private_bytes, both the 25 second delay and 35 second delay all seem to have 1 less high data point (the regression is that we have gone bi-modal by adding hi data points) than the baseline. My assertion of more data would prove that- I have requested more jobs and we will see if they can get run in a reasonable amount of time to see if we end up with a more agreeable trend.
Assignee | ||
Comment 12•9 years ago
|
||
I think we need to see a larger sample set, but it seems to me the 25 second value showed a non-trivial improvement for non-e10s. The current set of samples for non-e10s with 25 second value is not bimodal at all.
NI, my self to check back on Monday for the additional results.
By the way, note that this pref is only enabled on non-release builds. So this will not ride the trains to beta/release yet.
Flags: needinfo?(bkelly)
Comment 13•9 years ago
|
||
I did more retriggers and we hvae 16 data points for tp5 e10s on all try pushes mentioned above. We see no change for main_rss or private bytes. With more data I see us hitting 75% low values and 25% high values.
Assignee | ||
Comment 14•9 years ago
|
||
So do you think its worth trying to land the 25 second pref value?
Flags: needinfo?(bkelly) → needinfo?(jmaher)
Comment 15•9 years ago
|
||
I see no performance advantage of doing that, but if there is another reason for doing so, please do!
Flags: needinfo?(jmaher)
Assignee | ||
Comment 16•9 years ago
|
||
Ok, then I think I'd like to mark this WONTFIX. I believe this bi-modal behavior is a timing artifact of when we try to snapshot RSS values.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Assignee | ||
Comment 17•9 years ago
|
||
I have another idea of whats going on.
The osfile_async_front.jsm code does this to dispose of the worker:
this._worker = null;
This depends on GC to actually shutdown the worker. In the meantime, we could then spin up a new worker since we can't get to the old one.
Let me see if a patch that does this helps at all:
this._worker.terminate();
this._worker = null;
Or we could use a weak ref to opportunistically reuse a stale worker thread.
Assignee: nobody → bkelly
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Assignee | ||
Comment 18•9 years ago
|
||
First we have to expose the underlying ChromeWorker.terminate() to the PromiseWorker wrapper interface.
Assignee | ||
Comment 19•9 years ago
|
||
This works in local testing. Lets do some try builds.
Assignee | ||
Comment 20•9 years ago
|
||
Talos baseline:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f7cd47d83082
Try and talos build with patches:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f085c51ab239
I hope I got the try syntax right.
Assignee | ||
Comment 21•9 years ago
|
||
Joel, can you do your magic to get these to run on winxp? Still nothing after an overnight run.
Flags: needinfo?(jmaher)
Comment 22•9 years ago
|
||
I have bumped up all the tp5 jobs to priority 4, there are 1500+ winxp jobs in the queue, so p4 won't guarantee it. I do see that you cancelled a lot of the jobs for winxp, but in the future this is a tp5 only regressions for winxp, so just do -u none -t tp5o[Windows XP] in your try syntax which will save resources on windows xp.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 23•9 years ago
|
||
I didn't cancel those jobs. I guess it was someone cleaning out old try jobs that they thought were done. :-\
Assignee | ||
Comment 24•9 years ago
|
||
Initial results seem fairly similar. Trying a few more retriggers to see if more samples reveals a trend.
Assignee | ||
Comment 25•9 years ago
|
||
There is no appreciable change from this change either. Moving this back to WONTFIX.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → WONTFIX
Updated•2 years ago
|
Product: Toolkit → Toolkit Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•