Open Bug 1371214 Opened 7 years ago Updated 2 years ago

18.77% Resident Memory (android-api-15-gradle) regression on push 1c66da2b1e88e56a1b4de45a0721808eacb16338 (Wed Jun 7 2017)

Categories

(Core :: XPCOM, defect, P3)

53 Branch
defect

Tracking

()

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression)

We have detected an awsy regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=1c66da2b1e88e56a1b4de45a0721808eacb16338

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 19%  Resident Memory summary android-api-15-gradle opt      179,889,816.92 -> 213,650,557.25


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=7112

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/AWSY
:smaug, I see you authored bug 1367905, could you take a look at this regression and help figure out if this is expected and what we can do if anything to reduce or remove the regression?
Component: Untriaged → XPCOM
Flags: needinfo?(bugs)
Product: Firefox → Core
as a note, on the perfherder graph you can see the regression, then improvement on the backout.  Now that the patch landed again we see the regression sustained.
odd, I see this is the mochitest-chrome-2 job, not awsy- the data is posted to my awsy dashboard.

It appears this was added as awsy_lite 11 months ago for android in bug 1233220- in case that helps understand what we are testing.
Changing CC/GC timings affects easily to memory reporting.

When is the measurement done? After running what kinds of tests?
How do I see which of the numbers regressed? The test does several checkpoints, and I could easily see some of those being affected (because we're trying to not GC/CC while browser is busy with other stuff).
I am not familiar with the script, possibly :erahm could help?
Flags: needinfo?(erahm)
Geoff wrote that test, I'm not particularly familiar with android testing. The test follows roughly the same logic as the standard AWSY test but uses slightly different methods.

Looking at the sub-tests we see startup regressed the most, even after a 30 second delay which seems to imply fennec never gets idle enough after startup to run GC/CC:

> Med confidence:
> RSS After tabs [+30s, forced GC] opt   204,187,648.00 ± 1.69%   <   214,956,032.00 ± 1.87%  5.27%   
> RSS After tabs [+30s] opt              206,264,320.00 ± 1.28%   <   214,933,504.00 ± 1.88%  4.20%   
> RSS After tabs opt                     208,576,512.00 ± 0.74%   <   217,415,680.00 ± 1.99%  4.24%   
> 
> High confidence:
> **RSS Fresh start [+30s] opt           153,581,568.00 ± 1.69%   <   231,287,808.00 ± 0.85%  50.60%  
> **RSS Fresh start opt                  179,819,520.00 ± 5.05%   <   289,598,464.00 ± 1.21%  61.05%  
> RSS Tabs closed [+30s, forced GC] opt  163,103,744.00 ± 1.27%   <   179,809,280.00 ± 1.72%  10.24%  
> RSS Tabs closed [+30s] opt             163,103,744.00 ± 1.27%   <   179,809,280.00 ± 1.72%  10.24%  
> RSS Tabs closed opt                    190,893,056.00 ± 1.28%   <   202,011,648.00 ± 1.39%  5.82%   

That we're seeing regressions after "forcing" a GC [1] is interesting as well. smaug, how legit does that function look? Did you changes make explicitly running GC no longer work?

[1] http://searchfox.org/mozilla-central/source/mobile/android/tests/browser/chrome/test_awsy_lite.html#66-105
Flags: needinfo?(erahm) → needinfo?(gbrown)
(In reply to Olli Pettay [:smaug] from comment #6)
> How do I see which of the numbers regressed? The test does several
> checkpoints, and I could easily see some of those being affected (because
> we're trying to not GC/CC while browser is busy with other stuff).

Hover over the result name in the alert and then click 'subtests'.
Well, we do run GC/CC also using timers if there isn't idle time.

I didn't change the code which is used to explicitly trigger GC.
Can RSS stay higher if we at some point used more memory and then it got fragmented?
Flags: needinfo?(bugs)
Where do I see those individual test numbers?
Treeherder shows just "Resident Memory summary opt: 211770140"
Flags: needinfo?(erahm)
There's more info in the test log, if that's convenient:

https://public-artifacts.taskcluster.net/SHEUkTJpTs-X2OyGbSp-bg/0/public/logs/live_backing.log

09:25:08     INFO -  101 INFO TEST-START | mobile/android/tests/browser/chrome/test_awsy_lite.html
09:25:08     INFO -  102 INFO Fresh start | Resident Memory: 290934784
09:25:52     INFO -  103 INFO Fresh start [+30s] | Resident Memory: 231944192
09:25:52     INFO -  104 INFO opening tab with url [http://mochi.test:8888/chrome/mobile/android/tests/browser/chrome/tp5/baidu.com/www.baidu.com/s@wd=mozilla.html]
09:26:02     INFO -  105 INFO opening tab with url [http://mochi.test:8888/chrome/mobile/android/tests/browser/chrome/tp5/twitter.com/twitter.com/ICHCheezburger.html]
09:26:24     INFO -  106 INFO opening tab with url [http://mochi.test:8888/chrome/mobile/android/tests/browser/chrome/tp5/msn.com/www.msn.com/index.html]
09:26:46     INFO -  107 INFO opening tab with url [http://mochi.test:8888/chrome/mobile/android/tests/browser/chrome/tp5/163.com/www.163.com/index.html]
09:27:51     INFO -  108 INFO opening tab with url [http://mochi.test:8888/chrome/mobile/android/tests/browser/chrome/tp5/bbc.co.uk/www.bbc.co.uk/news/index.html]
09:28:13     INFO -  109 INFO After tabs | Resident Memory: 221622272
09:28:44     INFO -  110 INFO After tabs [+30s] | Resident Memory: 218329088
09:28:44     INFO -  111 INFO After tabs [+30s, forced GC] | Resident Memory: 218345472
09:28:44     INFO -  112 INFO Tabs closed | Resident Memory: 203350016
09:29:17     INFO -  113 INFO Tabs closed [+30s] | Resident Memory: 181518336
09:29:17     INFO -  114 INFO Tabs closed [+30s, forced GC] | Resident Memory: 181518336
09:29:17     INFO -  115 INFO PERFHERDER_DATA: {"framework": {"name": "awsy"}, "suites": [{"name": "Resident Memory", "subtests": [{"name": "Fresh start", "value": 290934784}, {"name": "Fresh start [+30s]", "value": 231940096}, {"name": "After tabs", "value": 221622272}, {"name": "After tabs [+30s]", "value": 218329088}, {"name": "After tabs [+30s, forced GC]", "value": 218345472}, {"name": "Tabs closed", "value": 203350016}, {"name": "Tabs closed [+30s]", "value": 181518336}, {"name": "Tabs closed [+30s, forced GC]", "value": 181518336}], "value": 216217447}]}
09:29:17     INFO -  116 INFO TEST-PASS | mobile/android/tests/browser/chrome/test_awsy_lite.html | memory logging complete -- view results in Perfherder
09:29:17     INFO -  117 INFO TEST-OK | mobile/android/tests/browser/chrome/test_awsy_lite.html | took 244042ms
Flags: needinfo?(erahm)
I don't see any other unanswered questions for me...let me know if I can be of help.

Also cc'd snorp in case he is interested / has something to add.
Flags: needinfo?(gbrown)
Am I reading this right that we are running a performance-like test on android emulator hardware?  Or does this somehow run on a real device?

FWIW I don't think we should tune GC/CC for android emulator.
This test is indeed running on an android emulator, which is running on an aws host. The previous iteration of awsy-for-android ran on real hardware but was always breaking. The thinking was something like "even though the emulator is limited, it is reliable and scalable".

It occurs to me that today, another option is to run something like this on Autophone, using a real device.
The android emulator is something like an order of magnitude slower than our slowest real physical device.  While memory seems like it shouldn't be timing dependent, it definitely is given we have to choose when to run GC/CC.
overall this is pretty stable.  All of the AWSY desktop tests run on AWS VMs (I know emulators are another factor of slow on top of that).

Autophone might be a real option.  I don't want to discount what we have as it is pretty stable (if you accept a 15MB range of values):
https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=%5Bmozilla-inbound,5d7ccb5c36cbfb7c60a7b06e92f07eaa9e59b122,1,4%5D
I see bug 1367905 backed out and have seen a few improvements, working on analyzing them.
I'll probably land bug 1367905 today again.
The re-land considerably dimmed down the original regressions, but 4-5% from them still remained:

== Change summary for alert #7193 (as of June 10 2017 00:34 UTC) ==

Regressions:

  5%  Resident Memory summary android-4-2-x86 opt      182,034,354.17 -> 190,508,672.67
  4%  Resident Memory summary android-api-15-gradle opt 179,586,539.42 -> 186,667,111.08

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7193
I noticed you pushed a bigger patch, which involved 3 bugs, including this one. If either bug 1371224 or bug 1371438 is related to this issue, please let me know in a comment.
Flags: needinfo?(bugs)
This is still from bug 1367905. The difference between the first landing and the current one is
https://bug1367905.bmoattachments.org/attachment.cgi?id=8876336
Especially 
-  int64_t budget = sActiveIntersliceGCBudget;
+  // We use longer budgets when timer runs since that means
+  // there hasn't been idle time recently and we may have significant amount
+  // garbage to collect.
+  int64_t budget = sActiveIntersliceGCBudget * 2;

Before bug 1367905 we collected 10ms after tick and 40ms after timer (which run only when refresh driver wasn't ticking), 
now 5-50ms during idle time, and 10ms when timer runs (it runs if there hasn't been idle time).

Perhaps I need to bring back some of the old behavior that when there isn't idle time, we collect up to 10ms after tick, though, the test doesn't seem to load pages too fast after previous ones. There should be idle time.
Flags: needinfo?(bugs)
Can we measure if this shows a regression on a real device?  I really don't think we should be tuning for the emulator. (Sorry for repeating myself.)
Priority: -- → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.