Open Bug 1172193 Opened 9 years ago Updated 2 years ago

Heuristics for zone GCs are bad

Categories

(Core :: JavaScript: GC, defect)

defect

Tracking

()

Tracking Status
firefox41 --- affected

People

(Reporter: billm, Unassigned)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(1 file)

I'm filing this based on bug 1164014. There's a workaround there in comment 50, but we really need to fix the problem generally. The STR to get the bad behavior is:

1. Compile an opt build that doesn't include the patch in bug 1164014 comment 50.
2. Run mach mochitest -f browser browser_parsable_script.js
3. Observe "MEMORY STAT residentFast after test:" near the end of the log. For me, on 64bit Linux, it's ~1.1GB. That's very high. For a different test I tried, it's 190MB.

What's this test doing? The core of it is here:
http://mxr.mozilla.org/mozilla-central/source/browser/base/content/test/general/browser_parsable_script.js#44
It reads in all the chrome scripts in Firefox and parses them with Reflect.parse. The AST for each file is fairly large, but it becomes garbage as soon as the file is parsed. So, in theory, we should only use as much memory as we need for the largest file (plus GC overhead).

Here's where things go wrong. We import Reflect.parse from a module called reflect.jsm, which lives in a different zone than the test code. So the AST is actually allocated in reflect.jsm's zone. The test just gets a cross-compartment wrapper to it.

We're correctly doing ALLOC_TRIGGER and EAGER_ALLOC_TRIGGER GCs for reflect.jsm's zone. However, we can't collect any of the data in it because the CCWs from the test aren't dying. They aren't dying because we don't collect the test zone. And we don't collect the test zone because not much data is being allocated there, so none of our triggers hit.

The solution is to somehow collect both zones. If I artificially call scheduleGC(this) after each Reflect.parse call from the test, then everything is good.

I think the right way to handle this might as follows: whenever we are about to GC a set of zones, look at all the CCWs coming into it from other zones. Those zones should also be added to the set to be collected. Keep doing this until we reach a fixed point.

A long time ago I think we would have always ended up with all the zones in this set. However, all the e10s work we've been doing has had the side effect of partitioning things a lot better. My hope is that we should be able to set up the zones so that content zones have no incoming pointers from the system zone.

One thing we'll need to change for that to work is here:
http://mxr.mozilla.org/mozilla-central/source/dom/base/nsFrameMessageManager.cpp#1845
We create the global for frame scripts (which, in e10s, run in the child process and are expected to touch content pages) in the system zone. We need to change that so the global is created in the same zone as web content.

There may be other places that need to be fixed, but I think it should be feasible. Terrence, do you think you or someone else could take this on?
Flags: needinfo?(terrence)
Whiteboard: [MemShrink]
We discussed this today at the GC meeting. Jon already has a patch for the first half of it that appears to work on the given testcase. Now that we have a regression mailing list for GC telemetry, we should notice any adverse reactions from collecting more compartments, so we can push this without too much worry.

None of us at all understood the second bit about system globals, however. Bill, could you expand a bit?
Flags: needinfo?(terrence) → needinfo?(wmccloskey)
Only 10% of GCs are zone GCs, so this probably won't have too much effect on telemetry.

The system global thing is just to try to improve performance so that we don't always turn zone GCs into full GCs. Perhaps we could talk about it at the work week. I think we can land Jon's patch without it.
Flags: needinfo?(wmccloskey)
Whiteboard: [MemShrink] → [MemShrink:P2]
FYI, bug 1172468 is nearing the point of permafail, so anything that can be done Really Soon Now would be much appreciated lest we have to resort to disabling the test instead.
No longer blocks: 1172468
Comment on attachment 8617913 [details] [diff] [review]
bug1172193-collect-associated-zones

Review of attachment 8617913 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8617913 - Flags: review?(terrence) → review+
sorry had to back this out in https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=4d38c2c86ba4 seems one of this changes caused a linux memory leak like :


https://treeherder.mozilla.org/logviewer.html#?job_id=10663696&repo=mozilla-inbound
Flags: needinfo?(jcoppeard)
This patch regressed our V8 score by about 6% while it was checked in.  I guess the system global change would help there.
Flags: needinfo?(jcoppeard)
Talos confirmed the v8 regression and improvement upon backout.
(In reply to Joel Maher (:jmaher) from comment #9)
> Talos confirmed the v8 regression and improvement upon backout.

Same thing on the flame:
http://arewefastyet.com/regressions/#/regression/1773134

Except for SplayLatency, which regressed after the backout.
(In reply to Bill McCloskey (:billm) from comment #0)
> A long time ago I think we would have always ended up with all the zones in
> this set. However, all the e10s work we've been doing has had the side
> effect of partitioning things a lot better. My hope is that we should be
> able to set up the zones so that content zones have no incoming pointers
> from the system zone.

Non-SDK addons using sandboxes without setting |sameZoneAs| or possibly interacting with content objects directly from frame scripts might invalidate that assumption, no?

For example I've found extended principal compartments of µBlock and greasemonkey related to content windows living in the system zone (I've already filed issues with each). Other addons are probably doing similar things.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.