Closed Bug 1000117 Opened 6 years ago Closed 5 years ago

Resident memory usage increase on Mac and Windows since April 9th

Categories

(Core :: JavaScript: GC, defect)

defect
Not set

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: ioana_damy, Unassigned)

References

Details

(Keywords: regression, Whiteboard: [MemShrink:P3])

Attachments

(2 files)

The resident memory usage increased by about 60MB between April 8th and 9th on Nightly:
04/08:
Explicit memory (MB):	Minimum: 88 / Maximum: 174 / Average: 114
Resident memory (MB):	Minimum: 161 / Maximum: 296 / Average: 208

04/09:
Explicit memory (MB):	Minimum: 88 / Maximum: 182 / Average: 115
Resident memory (MB):	Minimum: 211 / Maximum: 353 / Average: 270

Reports:
http://mozmill-daily.blargon7.com/#/endurance/report/7ab760e27012969ae02e3b9e413acf20
http://mozmill-daily.blargon7.com/#/endurance/report/7ab760e27012969ae02e3b9e418e0c06

The main differences in tests are:
* testOpenAndCloseMultipleWindows uses ~50MB more memory on the 04/09 build:
Resident memory (min, max, avg)
04/08  175	296	257
04/09  223	353	307

*testVideo_OGVBuffering is only present in the 04/09 run - this is treated in bug 999449 too and might not have much to do with this bug.

Notes:
* This issue is reproducible on all Windows and Mac OSs, but not on Linux. It only reproducible on nightly.
* I am currently investigating this issue locally and will update this report as soon as I get more details.
Here's the pushlog between the two dates:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8883360b1edb&tochange=5811efc11011

CC'ing Peter since he did most of the window work there.
Summary: Resident memory increase on Mac and Windows since April 9th → Resident memory usage increase on Mac and Windows since April 9th
Whiteboard: [MemShrink]
about:memory reports would be useful.
One before and one after, which will let us use "Load and diff".
(In reply to Ioana Budnar, QA [:ioana] from comment #0)
> The main differences in tests are:
> * testOpenAndCloseMultipleWindows uses ~50MB more memory on the 04/09 build:
> Resident memory (min, max, avg)
> 04/08  175	296	257
> 04/09  223	353	307

Scratch that... all the tests use 43 to 65 more resident memory on the 04/09 build.

This definitely reproduces locally, even with just one entity and one interation:
http://mozmill-crowd.blargon7.com/#/endurance/report/c90eeeb0a5511695efc7462553918f7f
http://mozmill-crowd.blargon7.com/#/endurance/report/c90eeeb0a5511695efc7462553920ed6

Given this, I should be able to pinpoint the culprit for this regression by at least tomorrow.

(In reply to Andrew McCreight [:mccr8] from comment #3)
> One before and one after, which will let us use "Load and diff".

I'm using mozmill-env to test this. From what I see, the profile gets created when the run starts and it's removed afterwards, so I can't get any useful about:memory info.

Henrik, is there a way I could bypass this and get the about:memory data?
Yeah, I guess if this is just some automated test run, the test harness will require modification.  But it would be really nice if it could be run in some mode where it grabs an about:memory dump right after it measures memory.  It doesn't really even need to do that all of the time, so you don't have to deal with creating a giant database of about:memory reports or anything, it would just be very useful to understand what the regression actually is once one has been identified.
Andrew, is there an API we can call to get this information dumped to the console or disk? Ioana might be able to insert the appropriate snippet in the test module before running the test. Otherwise the endurance tests are pretty simple. So a couple of steps are involved only, which might be done manually.

Given that all tests are showing this issue I wonder if we might even be able to see the memory increase without any test being run. Means just starting Firefox and opening about:memory to create a dump.
John should know how to do that.
Flags: needinfo?(jschoenick)
(In reply to Andrew McCreight [:mccr8] from comment #7)
> John should know how to do that.

I know you can use memory-info-dumper to request reports be written out as such [1], but if you mean some external API I'm not sure if anything like that exists in desktop builds. njn may know.

[1] http://dxr.mozilla.org/mozilla-central/source/toolkit/components/aboutmemory/content/aboutMemory.js#1878
Flags: needinfo?(jschoenick)
I just opened about:memory in a new profile for the last good and first bad builds. Let me know if this doesn't help and you got the external API.
Changing component since this is a platform bug, not an issue with the tests...
Component: Mozmill Tests → JavaScript: GC
Product: Mozilla QA → Core
QA Contact: hskupin
So this regressed by landing attachment 8394260 [details] [diff] [review] from bug 984101 then. Thanks Ioana!
Blocks: 984101
Keywords: regression
(In reply to Henrik Skupin (:whimboo) from comment #13)
> So this regressed by landing attachment 8394260 [details] [diff] [review]
> from bug 984101 then. Thanks Ioana!

But that patch didn't affect the allocation of memory at all. 

We've had numerous false positives with the endurance tests in the past. Enough that I'm skeptical of any claimed increase.
Maybe we interpreted the values not correctly then? Could you have a look at the given reports please? You mostly know best how to read those data.
Whiteboard: [MemShrink] → [MemShrink:P3]
The patch above increased the amount of poisoning we do on dead memory. This allows us to catch more invalid accesses to the GC heap, but it also means we can't decommit. Any discrepancies should go away completely in builds without --enable-js-diagnostics: e.g. everything that is not a nightly. Please re-open if you continue to see the regression on 31 once it hits aurora.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
(In reply to Terrence Cole [:terrence] from comment #16)
> The patch above increased the amount of poisoning we do on dead memory. This
> allows us to catch more invalid accesses to the GC heap, but it also means
> we can't decommit. Any discrepancies should go away completely in builds
> without --enable-js-diagnostics: e.g. everything that is not a nightly.
> Please re-open if you continue to see the regression on 31 once it hits
> aurora.

The memory didn't decrease after 31 merged to Aurora.
Charts: 
http://mozmill-daily.blargon7.com/#/endurance/charts?app=All&branch=31.0&platform=Mac&from=2014-04-01&to=2014-05-19
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Flags: needinfo?(terrence)
The patch in question doesn't actually add any code to aurora. How sure are we that this patch is the regressor?
Looking at Firefox 31 and 32, I see that whenever they reach Beta the memory consumption suddenly goes down:

- Firefox 31 - Mac OS X 10.7.5 - Average Resident Memory goes down by ~70 Mb
     - Aurora 31.0a2 from June 7 (20140607004002) 
     Resident memory (MB) = Minimum: 205 / Maximum: 364 / Average: 281
     http://mozmill-daily.blargon7.com/#/endurance/report/c69af0ad3436e0d472c1ca8a06835476
     - BETA 31.0 from June 16 (20140616143923) 
     Resident memory (MB) = Minimum: 150 / Maximum: 289 / Average: 213
     http://mozmill-release.blargon7.com/#/endurance/report/6959f9a82369610967633890838b691f

- Firefox 32 - Mac OS X 10.9.4 - Average Resident Memory goes down by ~70 Mb
     - Aurora 32.0a2 from July 20 (20140720004002) 
     Resident memory (MB) = Minimum: 206 / Maximum: 339 / Average: 270
     http://mozmill-daily.blargon7.com/#/endurance/report/68ae90e16718c6da68596418405b7e3e
     - BETA 32.0 from July 31 (20140731191115) 
     Resident memory (MB) = Minimum: 151 / Maximum: 283 / Average: 203
     http://mozmill-release.blargon7.com/#/endurance/report/533154d2e18fa05bbdf2eaada8c4cf84

Does this have anything to do with this issue? Why would memory usage suddenly drop when a version enters Beta?

Note that Nightly 34 and Aurora 33 were also until recently ~70 Mb larger than Beta 32 (a drop of ~40 Mb appeared a few days ago in Nightly and now it seems in Aurora, but this may be unrelated to the current bug).
A memory drop like this could indicate that a feature which is enabled in Nightly and Aurora hasn't made it into beta yet. I don't know which features we currently have active for testing. Maybe you know it better than me.

Otherwise I think we have other build configs for beta and release. But I cannot say what specifically. Maybe there is also something different.

Florin, is there a specific test, which shows this problem kinda detailed?
The memory consumption drop is not specific to a single test which makes me think that perhaps it is not caused by some feature not being enabled in Beta, in fact each of the 18 tests exhibit this drop in pretty much the same manner. I was thinking that maybe this has to do with Terrence's comment 16: 
> The patch above increased the amount of poisoning we do on dead memory. This
> allows us to catch more invalid accesses to the GC heap, but it also means
> we can't decommit. Any discrepancies should go away completely in builds
> without --enable-js-diagnostics: e.g. everything that is not a nightly.
> Please re-open if you continue to see the regression on 31 once it hits
> aurora.

However Terrence says this drop should have been seen in Beta as well as Aurora, but I only see it in Beta. So I wonder if this actually is what Terrence said, but for some reason applies only to Beta. If this can be confirmed, then I'm thinking we could close this issue. 

I also wonder... when changes like this are made, which affect the memory consumption, where should we look for them, or where are they announced? Discovering the issues during endurance testing and having to spend days to investigate them seems seriously counterproductive.
When querying MXR for --enable-js-diagnostics I see that for nightlies this flag is not used for builds on Linux. So I bet we don't see this difference on that platform? If that is the case, it might be indeed related.

For the other questions I agree and we should figure out a reasonable process what we could and should do. But therefor we need feedback from the memshrink team. Lets put needinfo for all of them.
Flags: needinfo?(n.nethercote)
Flags: needinfo?(continuation)
> I also wonder... when changes like this are made, which affect the memory
> consumption, where should we look for them, or where are they announced?

They're hard to predict. Especially regressions, which are rarely intentional.

> Discovering the issues during endurance testing and having to spend days to
> investigate them seems seriously counterproductive.

I'll be blunt: I can't remember the last time the endurance tests detected a regression that turned out to be real, rather than some artifact of the testing. AFAICT they don't provide value. In contrast, areweslimyet.com has detected numerous regressions. In theory the two systems are quite similar, so I don't know what explains the difference in efficacy.
Flags: needinfo?(n.nethercote)
(In reply to Henrik Skupin (:whimboo) from comment #22)
> When querying MXR for --enable-js-diagnostics I see that for nightlies this
> flag is not used for builds on Linux. So I bet we don't see this difference
> on that platform? If that is the case, it might be indeed related.

The issue shows only for Mac and Windows. Linux shows a very small drop in memory consumption between Nightly, Aurora and Beta.

(In reply to Nicholas Nethercote [:njn] from comment #23)
> > I also wonder... when changes like this are made, which affect the memory
> > consumption, where should we look for them, or where are they announced?
> 
> They're hard to predict. Especially regressions, which are rarely
> intentional.

I'm specifically referring to changes as the one Terrence mentioned in comment 16, which seem like the kind of change that people know would cause a large increase in memory consumption, and are/should probably be announced somewhere. Regressions are a whole other story. My problem is that this issue was treated as a regression, but it seems it may not be after all.
(In reply to Nicholas Nethercote [:njn] from comment #23)
> > Discovering the issues during endurance testing and having to spend days to
> > investigate them seems seriously counterproductive.
> 
> I'll be blunt: I can't remember the last time the endurance tests detected a
> regression that turned out to be real, rather than some artifact of the
> testing. AFAICT they don't provide value. In contrast, areweslimyet.com has
> detected numerous regressions. In theory the two systems are quite similar,
> so I don't know what explains the difference in efficacy.

I cannot give real stats on that, given that endurance tests were never actually my focus. Dave and Andreea, do you remember the rate of regressions we detected? If it turns out it is really that low, I wonder if we should stop those tests given that they take away a lot of time of our CI.
Flags: needinfo?(dave.hunt)
Flags: needinfo?(andreea.matei)
(In reply to Florin Mezei, QA (:FlorinMezei) from comment #24)
> > When querying MXR for --enable-js-diagnostics I see that for nightlies this
> > flag is not used for builds on Linux. So I bet we don't see this difference
> > on that platform? If that is the case, it might be indeed related.
> 
> The issue shows only for Mac and Windows. Linux shows a very small drop in
> memory consumption between Nightly, Aurora and Beta.

In that case this increase was expected and matches what I thought in comment 22. So this bug may be invalid. From our side it would have been good, if those changes would be announced by devs, which would save us time we use for investigation. Maybe we need a better/closer relationship here.
(In reply to Henrik Skupin (:whimboo) from comment #25)
> (In reply to Nicholas Nethercote [:njn] from comment #23)
> > > Discovering the issues during endurance testing and having to spend days to
> > > investigate them seems seriously counterproductive.
> > 
> > I'll be blunt: I can't remember the last time the endurance tests detected a
> > regression that turned out to be real, rather than some artifact of the
> > testing. AFAICT they don't provide value. In contrast, areweslimyet.com has
> > detected numerous regressions. In theory the two systems are quite similar,
> > so I don't know what explains the difference in efficacy.
> 
> I cannot give real stats on that, given that endurance tests were never
> actually my focus. Dave and Andreea, do you remember the rate of regressions
> we detected? If it turns out it is really that low, I wonder if we should
> stop those tests given that they take away a lot of time of our CI.

I can only recall one regression that the endurance tests identified that turned out to be genuine, and a handful that either turned out to be expected changes or were unable to pinpoint a cause. I haven't worked on these tests in a long time, and if AWSY (which I believe even uses some of the code from the endurance tests) is providing more benefit then perhaps it is time to consider stopping these tests.
Flags: needinfo?(dave.hunt)
We may not completely stop running our tests, but leave them enabled for Nightly builds. I put up a proposal on our mailing list: https://groups.google.com/forum/#!topic/mozilla.dev.automation/NRdiCL_Cvk0
Flags: needinfo?(continuation)
Yeah, few real regressions and a lot of time invested so I gave my input in the thread about removing the testruns below Nightly.
Flags: needinfo?(andreea.matei)
We pushed this as part of https://github.com/mozilla/mozmill-ci/issues/466. It will be live on production tomorrow.

I think there is no actionable item left here on this bug given that the initially reported issue is expected. Closing as WFM.
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Flags: needinfo?(terrence)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.