Closed Bug 1110928 Opened 10 years ago Closed 7 years ago

Per-document GC poking should not trigger a full GC

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: mccr8, Assigned: mccr8)

References

(Blocks 2 open bugs)

Details

Attachments

(8 files, 13 obsolete files)

1.88 KB, patch
mccr8
: review+
Details | Diff | Splinter Review
1.47 KB, patch
mccr8
: review+
Details | Diff | Splinter Review
7.31 KB, patch
Details | Diff | Splinter Review
740 bytes, patch
mccr8
: review+
Details | Diff | Splinter Review
3.63 KB, patch
Details | Diff | Splinter Review
1.86 KB, patch
Details | Diff | Splinter Review
946 bytes, patch
jonco
: review+
Details | Diff | Splinter Review
27.74 KB, image/png
Details
One thing that limits the effectiveness of bug 1052793 is that whenever you open a new tab, or maybe even switch tabs, we fire off a bunch of PokeGCs, including SET_NEW_DOCUMENT in nsGlobalWindow::SetNewDocument(), PAGE_HIDE in nsDocumentViewer::PageHide(), SET_DOC_SHELL in nsGlobalWindow::DetachFromDocShell(), and to some extent NSJSCONTEXT_DESTROY in nsJSContext::DestroyJSContext().  I wonder if we can get rid of some or all of these now while still destroying pages in a timely fashion.  Maybe we can trigger something else in those cases, or be more selective about it.
Actually, a better fix for these might be to change them to some kind of "ZonePokeGC" that specifies a particular zone they are being triggered for.  A ZonePokeGC would not make us go to a full GC, but would just add the zone to the set of zones being considered.  It might be a little tricky to get this to work with my patch in bug 1052793, but maybe not.
Summary: Consider removing some of the old GC triggers → Per-document GC poking should not trigger a full GC
The API would be a little nicer if it just passes in an object, and nsJSEnvironment can figure out how to get a zone out of that.
Well, I think we could remove at least SET_NEW_DOCUMENT.

Other cases are more likely to cause garbage.
Assignee: nobody → continuation
Depends on: 1111076
Attached patch Require PokeGC take a zone. WIP (obsolete) — Splinter Review
This greatly reduces how many zones we collect.  Whenever we poke the GC, we give it a zone we're poking about.

If we triggered the GC from a PokeGC, we only collect zones we were poked about.
Depends on: 1052793
Depends on: 1116821
nsDocumentViewer knows which document is involved which will help us later.

Also, fix a typo in the comment.
Attachment #8543327 - Flags: review?(bugs)
This ensures that the document isn't destroyed when we call PokeGC, which will be useful later.
Attachment #8543328 - Flags: review?(bugs)
I'm going to wait until bug 1116821 gets reviewed before I ask for review for this.

try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5d619be17e91
Attachment #8535898 - Attachment is obsolete: true
Attachment #8543327 - Flags: review?(bugs) → review+
Attachment #8543328 - Flags: review?(bugs) → review+
Backed out this and bug 1116821 in https://hg.mozilla.org/integration/mozilla-inbound/rev/b9f40d0310d5

At least on Windows (dunno if I didn't know what suites to retrigger, or didn't retrigger enough, to see about other platforms), reftest and web-platform-4 are apparently depending on having per-document GC poking trigger full GCs, since reftests were timing out and w-p-4 was OOMing.
Yeah, oops, I should have done a try run on Windows.  I forgot that's where our OOM problems show up most readily.
Here's the treeherder link:
  https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=9d593597df5f

The reftest failures mostly are like "load failed: timed out waiting for test to complete (waiting for onload scripts to complete)" which is weird, but maybe that's just a symptom of an OOM.  I'll try to do some local analysis of the zones we visit with these tests and how they are affected by the patches.
See Also: → 1173846
Blocks: 1105634
Blocks: 1285355
Rebased. Carrying forward Olli's r+.
Attachment #8773530 - Flags: review+
Attachment #8543327 - Attachment is obsolete: true
Attachment #8543328 - Attachment is obsolete: true
Attachment #8543330 - Attachment is obsolete: true
This adds a new JS API function to schedule a system zone for GC, without requiring that we pass in the zone GC to the embedder.
Attachment #8773532 - Flags: review?(terrence)
This is just a rebased version of the old part 3. Carrying forward review from Olli.
Attachment #8773535 - Flags: review+
Attachment #8773532 - Flags: review?(terrence) → review+
Attached patch part 4b - Fixup. (obsolete) — Splinter Review
This does a bit of fixup. The main thing is that we always schedule the system zone for a non-full GC.

Windows looks fine: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fce8f33de80

I retriggered reftests a bunch of times and there aren't any timeouts.

We still do a full GC for SET_NEW_DOCUMENT with this patch. I guess I need to retry it passing something in.
This makes it so that we pass in the window reflector in another place (rather than just deleting this call), and also it makes sure we always schedule the system zone, which should hopefully make it less leaky while not making times much worse.

try run looks okay so far but I'll wait on more retriggers of C before landing:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=937f2aafe578
Attachment #8776760 - Flags: review?(bugs)
Attachment #8773542 - Attachment is obsolete: true
Comment on attachment 8776760 [details] [diff] [review]
part 4b - Fixup to rolled into part 4.

well I would expect system zone to be by far the largest zone, but sure, should be safer option.
Attachment #8776760 - Flags: review?(bugs) → review+
I did see a reftest timeout on one of the retriggers I did, and it looked like it was probably timing out because it was freeing a ton of windows all at once. I'll have to investigate how bad that is compared to without my patches, and if I can do anything to mitigate it. (Maybe something hacky like only clearing the table of zones to use every 10 seconds or so would help.)
Depends on: 1292218
Pushed by amccreight@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/721fabf360ff
part 1 - Hoist the LOAD_END PokeGC out of nsJSContext::LoadEnd. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/c5bc2d6ed0d7
part 2 - Call PokeGC in nsDocumentViewer::PageHide before the call to OnPageHide. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/325bdb8f8f80
part 3 - Add a method to schedule the system zone for GC. r=terrence
https://hg.mozilla.org/integration/mozilla-inbound/rev/31b56ae647c8
part 4 - Try to pass a relevant zone to PokeGC. r=smaug
Backed out for frequent failures to load onload scripts in reftests on Winodws XP and Windows 7 VM:

https://hg.mozilla.org/integration/mozilla-inbound/rev/18fd827d3e02e5aa8ea2a1ec68e0b8bf35f568fb
https://hg.mozilla.org/integration/mozilla-inbound/rev/6c14a4135dc7e30779963ae07eb320a627deb605
https://hg.mozilla.org/integration/mozilla-inbound/rev/fa452d546d6cbda413d5139447f89c4df303cbe7
https://hg.mozilla.org/integration/mozilla-inbound/rev/9ede0b178fbc87559385a9ffb8701c0de770e9c8

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=31b56ae647c8d23dfa9ea36ff34626bc93878fdb
Had to backfill the jobs here. Take note that this is the only push where Win 7 VM's R fails but Ru passes.

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=34324120&repo=mozilla-inbound
> REFTEST TEST-UNEXPECTED-FAIL | file:///c:/slave/test/build/tests/reftest/tests/layout/reftests/w3c-css/submitted/writing-modes-3/text-combine-upright-compression-005.html | load failed: timed out waiting for test to complete (waiting for onload scripts to complete)
Flags: needinfo?(continuation)
This backout also fixed a permafail of Android 4.3 API15+ debug tc-M(1):

test_multiModuleLargeImports.html | application crashed [@ js::CompartmentChecker::check]

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=34344284&repo=mozilla-inbound

This failure had started on the push after this push (caching issue?).
In the log in comment 21, the test is freeing about 50 windows, which isn't great, but it isn't too bad in the scheme of things. The test also just sits there for a long time after that.

I see this in the log before the failure:
  [GFX3-]: DrawTargetSkia(125A7400): Failure to create source surface from data. Size: Size(800,1000)
  JavaScript error: chrome://reftest/content/reftest.jsm, line 1463: NS_ERROR_FAILURE:
  JavaScript error: chrome://reftest/content/reftest-content.js, line 1090: TypeError: ret is undefined 

Maybe we're hitting some OOM condition which causes the test to stop working.

Comment 21 is weird. I'll file a separate bug for it, but it could be that we run out of memory and then unmarkgray fails or something.
Flags: needinfo?(continuation)
(In reply to Andrew McCreight [:mccr8] from comment #23)
> Comment 21 is weird.
That should be comment 22.
Depends on: 1301189
Bug 1301189 fixes the failure in comment 22, but I'm still seeing the OOM-looking graphics assertions, on Win7 debug:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=e4d2e4913f5f&selectedJob=27168691
I wonder if we could GC only page zone after unload and load.
But in case a tab is closed (top level window removed from docshell), we'd GC also system zone since there is probably tons of frame script js to collect. (should we possibly even do global GC in that case)
In patch 4b, I made it so that we also always GC the system zone, but it wasn't enough. But yeah maybe I could try the non-full GC just for load and unload.
Attachment #8773535 - Attachment is obsolete: true
I'm pretty sure tests will fail with this, since this is even more aggressive than the precious try, but this is close to what I'd like to see.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c90e5f6dad4c80e38bd7ad49d13c24e6e9f74552
Ah, hmm, I think we don't trigger GC when evicting bfcache. That might make difference.
Pushing a patch to tryserver.
Still windows reftest issues. I think those actually hint about some bug in CanvasRenderingContext2D. It is not releasing some resources early enough?
https://treeherder.mozilla.org/#/jobs?repo=try&revision=78a456b6caf3f4a9e7f9d6f58def75d7193e0546

I wonder if JS_updateMallocCounter should always mark the current zone to be collected during next zone gc. 
But let's see if this helps with the test failures.
Comment on attachment 8827829 [details] [diff] [review]
part6, trigger ZoneGC after evicting from bfcache

Ensure documents evicted from bfcache are collected during the next zone gc.
Attachment #8827829 - Flags: review?(continuation)
Comment on attachment 8827997 [details] [diff] [review]
part 7, ensure the zone related to canvas is collected

This is hacky, but reftest framework seems to create tons of canvas elements and keeping then memory alive.

We may want to change JS_updateMallocCounter to take some object as param and trigger JS::PrepareZoneForGC. But perhaps we can first play with this setup.

I think this stuff shouldn't land before the next merge.
Attachment #8827997 - Flags: review?(continuation)
Attachment #8776760 - Attachment is obsolete: true
Comment on attachment 8773532 [details] [diff] [review]
part 3 - Add a method to schedule the system zone for GC.

not needed
Attachment #8773532 - Attachment is obsolete: true
Comment on attachment 8825645 [details] [diff] [review]
trigger full gc only when destroying top level outer windows or with chrome windows

Trigger full gc when closing a tab or unloading chrome window.
Attachment #8825645 - Flags: review?(continuation)
Comment on attachment 8825645 [details] [diff] [review]
trigger full gc only when destroying top level outer windows or with chrome windows

Review of attachment 8825645 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsGlobalWindow.cpp
@@ +3133,5 @@
>  
>    if (mContext) {
> +    // When we're about to destroy top level content window (for example a tab),
> +    // we trigger full GC by passing null as the last param.
> +    // Same happens whenever we're dealing with chrome windows.

nit: should be "destroy a top level", and "trigger a full GC". I think the second sentence would be clearer as "We also trigger a full GC for chrome windows".

@@ +3138,2 @@
>      nsJSContext::PokeGC(JS::gcreason::SET_DOC_SHELL,
> +                        mTopLevelOuterContentWindow || mIsChrome ?

Could you put parens around "mTopLevelOuterContentWindow || mIsChrome" please? I did not remember at first that ? binds less than ||.
Attachment #8825645 - Flags: review?(continuation) → review+
Comment on attachment 8827829 [details] [diff] [review]
part6, trigger ZoneGC after evicting from bfcache

Review of attachment 8827829 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/base/nsDocumentViewer.cpp
@@ +2406,5 @@
>  nsDocumentViewer::ClearHistoryEntry()
>  {
> +  if (mDocument) {
> +    nsJSContext::PokeGC(JS::gcreason::PAGE_HIDE,
> +                        mDocument->GetWrapperPreserveColor(),

Will this always have a wrapper?

@@ +2407,5 @@
>  {
> +  if (mDocument) {
> +    nsJSContext::PokeGC(JS::gcreason::PAGE_HIDE,
> +                        mDocument->GetWrapperPreserveColor(),
> +                        NS_GC_DELAY * 2);

Why is this delay here? Do we usually evict something from the BF cache when we're loading another document?
Attachment #8827829 - Flags: review?(continuation) → review+
Comment on attachment 8827997 [details] [diff] [review]
part 7, ensure the zone related to canvas is collected

Review of attachment 8827997 [details] [diff] [review]:
-----------------------------------------------------------------

Neat. Though I think the usage of the JS API here is wrong.

::: dom/canvas/CanvasRenderingContext2D.cpp
@@ +1747,5 @@
>    }
> +
> +  JSObject* wrapper = GetWrapperPreserveColor();
> +  if (wrapper) {
> +    JS::PrepareZoneForGC(JS::GetObjectZone(wrapper));

I think this should use AddZoneWaitingForGC() instead (which is a method on CycleCollectedJSContext). I think you are only supposed to call this method right before you trigger a GC, and I'm not sure what happens if you call it at some other random time. You could ask jonco about it if you want. I could be wrong.
Attachment #8827997 - Flags: review?(continuation) → review-
(In reply to Andrew McCreight [:mccr8] from comment #39)
> Comment on attachment 8827829 [details] [diff] [review]
> part6, trigger ZoneGC after evicting from bfcache
> 
> Review of attachment 8827829 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: layout/base/nsDocumentViewer.cpp
> @@ +2406,5 @@
> >  nsDocumentViewer::ClearHistoryEntry()
> >  {
> > +  if (mDocument) {
> > +    nsJSContext::PokeGC(JS::gcreason::PAGE_HIDE,
> > +                        mDocument->GetWrapperPreserveColor(),
> 
> Will this always have a wrapper?
In general should have (hard to think a case when there isn't), and if there isn't, we really want full gc since
there is possible tons of garbage to collect somewhere.


> > +  if (mDocument) {
> > +    nsJSContext::PokeGC(JS::gcreason::PAGE_HIDE,
> > +                        mDocument->GetWrapperPreserveColor(),
> > +                        NS_GC_DELAY * 2);
> 
> Why is this delay here? Do we usually evict something from the BF cache when
> we're loading another document?
Just to be consistent with the other use of NS_GC_DELAY + PAGE_HIDE
(In reply to Andrew McCreight [:mccr8] from comment #40)

> I think this should use AddZoneWaitingForGC() instead (which is a method on
> CycleCollectedJSContext).
oh, didn't even think JS::PrepareZoneForGC might have such requirement
At least for consistency I could call AddZoneWaitingForGC

> I think you are only supposed to call this method
> right before you trigger a GC, and I'm not sure what happens if you call it
> at some other random time. You could ask jonco about it if you want. I could
> be wrong.

jonco ^
Flags: needinfo?(jcoppeard)
(In reply to Olli Pettay [:smaug] (review request backlog because of a work week) from comment #42)
You *can* call PrepareZoneForGC at any time - it just sets a flag on the zone that gets picked up when a collection is started.  But if an incremental GC is ongoing this can lead to it being reset non-incrementally, which is bad.

Using AddZoneWaitingForGC sounds sensible to avoid this possibility unless you really need that zone collected ASAP.
Flags: needinfo?(jcoppeard)
Attachment #8825645 - Attachment is obsolete: true
Attached patch part7, use AddZoneWaitingForGC (obsolete) — Splinter Review
Attachment #8828290 - Flags: review?(continuation)
Attachment #8827997 - Attachment is obsolete: true
Comment on attachment 8828290 [details] [diff] [review]
part7, use AddZoneWaitingForGC

Review of attachment 8828290 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for working on this.

::: dom/canvas/CanvasRenderingContext2D.cpp
@@ +71,5 @@
>  
>  #include "jsapi.h"
>  #include "jsfriendapi.h"
>  #include "js/Conversions.h"
> +#include "js/GCAPI.h"

Do you still need GCAPI.h?
Attachment #8828290 - Flags: review?(continuation) → review+
I guess I don't. leftover
Attached patch part7.diffSplinter Review
This may compile. Can't test since m-i doesn't compile on Fedora atm.
Attachment #8828290 - Attachment is obsolete: true
I'm about to land this, and whoever is wondering, the part 3 isn't needed anymore.
Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e809ddd9c7db
part 1 - Hoist the LOAD_END PokeGC out of nsJSContext::LoadEnd. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/41d7f44db1ee
part 2 - Call PokeGC in nsDocumentViewer::PageHide before the call to OnPageHide. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/fb7cb3957c0d
part 4 - Try to pass a relevant zone to PokeGC. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/b587547a6df1
trigger full GC only when closing top level outer window , r=mccr8
https://hg.mozilla.org/integration/mozilla-inbound/rev/5b62dda2ebd0
trigger ZoneGC after evicting from bfcache, r=mccr8
https://hg.mozilla.org/integration/mozilla-inbound/rev/97a60b61a65a
ensure zone GC collects the zone from which canvas context is originated, r=mccr8
Well, that's quite peculiar. Various leaks, including a backstage pass.
Not totally confident on pinning the blame on these patches, but I also saw this intermittent linux debug mochitest failure pop up around when this landed, and seems to have gone away on the backout: https://treeherder.mozilla.org/logviewer.html#?job_id=71298972&repo=mozilla-inbound
I think Olli's looking into this.
Flags: needinfo?(continuation)
The valgrind issue is weird since I didn't see that in try. And there was V issue still after backing this out.
I'll try to push to try again.
oh, the valgrind leak is there in an older try push :/
So the valgrind leaks seem to be about various slots in JS.
js::SetReservedSlot and the promise issue (bug 1333498) is extended slots etc.

Jon, does that ring any bells? Why would zone GC cause something like this?
Flags: needinfo?(jcoppeard)
(In reply to Olli Pettay [:smaug] from comment #58)
> Jon, does that ring any bells? Why would zone GC cause something like this?

It shouldn't.  It looks like some things are not getting finalized, possibly because they are leaked.

For the build in comment 59 though it looks like it's crashing so I guess leaks after that are to be expected.
Flags: needinfo?(jcoppeard)
Testing if some explicit full GC would help the valgrind issue (which I haven't managed to reproduce locally).
https://treeherder.mozilla.org/#/jobs?repo=try&revision=31d049d11bc8a3becc583c19ff844dc10374cf8c
Doesn't help :/
Even always collecting system zone doesn't help
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ffc7f917856c939cbeb5e0c100978e8492d138a3&selectedJob=73236172

I guess I should land changes one by one and see which one starts to cause valgrind issues.
Attachment #8830277 - Attachment is obsolete: true
The c2 failure is hopefully fixed by bug 1335117
Comment on attachment 8832428 [details] [diff] [review]
Full GC before initial CC

The functional changes here are that sHasRunGC (renamed to sHasRunFullGC) is moved to be set to true only after full GC and sNeedsFullGC is set initially to true.
Attachment #8832428 - Flags: review?(jcoppeard)
Attachment #8832428 - Flags: review?(jcoppeard) → review+
Simpler version
Attachment #8832428 - Attachment is obsolete: true
Attachment #8832830 - Flags: review?(jcoppeard)
Comment on attachment 8832830 [details] [diff] [review]
needs_full_gc_simple.diff

Review of attachment 8832830 [details] [diff] [review]:
-----------------------------------------------------------------

That is much simpler :)
Attachment #8832830 - Flags: review?(jcoppeard) → review+
Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/551a93b7b471
part 1 - Hoist the LOAD_END PokeGC out of nsJSContext::LoadEnd. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/b39457b9fe3b
part 2 - Call PokeGC in nsDocumentViewer::PageHide before the call to OnPageHide. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/a1bbc6ed6670
part 4 - Try to pass a relevant zone to PokeGC. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb0cdf573e9d
trigger full GC only when closing top level outer window , r=mccr8
https://hg.mozilla.org/integration/mozilla-inbound/rev/2aa9841bd6e3
trigger ZoneGC after evicting from bfcache, r=mccr8
https://hg.mozilla.org/integration/mozilla-inbound/rev/d17fabbdbe8f
ensure zone GC collects the zone from which canvas context is originated, r=mccr8
https://hg.mozilla.org/integration/mozilla-inbound/rev/256d02d844e0
first request GC should be a full GC, r=jonco
Let's see what kinds of issues the changes in this bug cause, but at least GC is behaving the way I expected.
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: