Stop using GetScriptContextFromJSContext in CycleCollectedJSRuntime::UsefulToMergeZones

RESOLVED FIXED in Firefox 49

Status

()

Core
DOM
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: bz, Assigned: bz)

Tracking

Trunk
mozilla49
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox49 fixed)

Details

(Whiteboard: btpp-active)

Attachments

(1 attachment)

We need to rewrite this in a way that works in the one-JSContext-per-runtime world.  

What do we really need here?  A way to enumerate all live inner windows?  Something else?
Created attachment 8757359 [details] [diff] [review]
Stop using GetScriptContextFromJSContext in CycleCollectedJSRuntime::UsefulToMergeZones
Attachment #8757359 - Flags: review?(continuation)
Assignee: nobody → bzbarsky
Status: NEW → ASSIGNED
Comment on attachment 8757359 [details] [diff] [review]
Stop using GetScriptContextFromJSContext in CycleCollectedJSRuntime::UsefulToMergeZones

Review of attachment 8757359 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for fixing this. Maybe this will make it so this optimization actually done something again...
Attachment #8757359 - Flags: review?(continuation) → review+
had to back this out since this seems has caused frequent crashes like https://treeherder.mozilla.org/logviewer.html#?job_id=29006094&repo=mozilla-inbound
Flags: needinfo?(bzbarsky)
Are you sure that was this checkin?

If we posit that it is, the part of the log right before the failing assertion is:

 00:21:46     INFO -  Initializing context 02ADC000 surface 137F2C40 on display 1E2EE580
 00:21:50     INFO -  JavaScript error: , line 0: uncaught exception: out of memory
 00:21:50     INFO -  [GFX3-]: BufferTextureData: Failed to allocate 1000000 bytes
 00:21:50 INFO - Assertion failure: texClient, at c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/gfx/layers/client/CanvasClient.cpp:287 

That "Initializing" printf is from the GLContextEGL::GLContextEGL constructor and the "Failed to allocate" printf is from InitBuffer in BufferTexture.cpp.  

I'll do a bit of poking around to see whether this patch might have increased memory pressure or something, I guess, but the code coming right before that texClient assert has all sorts of fallible allocations and can easily return null on OOM or fragmentation.  So that assert seems pretty bogus to me.  Especially since right after that we null-check things....  Matt, what's the story here?

Also, is it expected that we run some JS in there between GLContextEGL() and InitBuffer() and possibly end up with an OOM exception?
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(cbook)
(In reply to Boris Zbarsky [:bz] from comment #7)
> Are you sure that was this checkin?
> 

Hi Boris,

yes it was this checkin, also ms2ger took a look too and recommend this checkin to backout (thanks again for looking into this).
Flags: needinfo?(cbook)
This bug was not backed out for this:

(In reply to Carsten Book [:Tomcat] from comment #5)
> had to back this out since this seems has caused frequent crashes like
> https://treeherder.mozilla.org/logviewer.html#?job_id=29006094&repo=mozilla-
> inbound

but for crashes like https://treeherder.mozilla.org/logviewer.html#?job_id=28923749&repo=mozilla-inbound#L4314 in XPCJSRuntime::UsefulToMergeZones

... though now I'm not sure these actually happened after the followup push.
> but for crashes like https://treeherder.mozilla.org/logviewer.html#?job_id=28923749&repo=mozilla-inbound#L4314

OK, that's at least relevant to this bug!  ;)

That crash sure looks like a crash on null reflector to me, which the followup should have fixed.  So _did_ any of them happen after the followup?
Flags: needinfo?(bzbarsky) → needinfo?(cbook)
Whiteboard: btpp-active
(In reply to Boris Zbarsky [:bz] from comment #7)

> I'll do a bit of poking around to see whether this patch might have
> increased memory pressure or something, I guess, but the code coming right
> before that texClient assert has all sorts of fallible allocations and can
> easily return null on OOM or fragmentation.  So that assert seems pretty
> bogus to me.  Especially since right after that we null-check things.... 
> Matt, what's the story here?

I agree, that assertion is totally bogus.

Do we know what got backed out for causing those assertions? Tomcat?
Flags: needinfo?(matt.woodrow)
Matt, do you mean something like https://bugzilla.mozilla.org/show_bug.cgi?id=1276112#c6
Flags: needinfo?(cbook)
I think there are two questions that were needinfo'd there:

1)  My question: Were there any crashes in XPCJSRuntime::UsefulToMergeZones between my second checkin for this bug and the backout?

2)  Matt's question: was there some other bug that was backed out for assertion failures in CanvasClient.cpp, since that's not what this bug was backed out for?
Flags: needinfo?(cbook)
so maybe https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=m(gl)&fromchange=fcedd1b3ef255d38681ab5e2745e45eaea926fb6&tochange=add719ee5b2f0f3475f2aa5f7c4cbdb861b186f7&selectedJob=29006080 helps a little.

after this changes in that changeset we got more and more crashes in gl tests on windows - not permanent but very frequent. 

like https://treeherder.mozilla.org/logviewer.html#?job_id=29003962&repo=mozilla-inbound#L20686

I did retrigger of testruns yesterday till back to friday that pointed to the bug here. After the backout this crashes were gone and a merge possible (which was blocked by this frequent gl crashes) and since then the gl tests are fine.

For the second question, i don't know but i didn't backed out something else for the gl issues.
Flags: needinfo?(cbook)
OK, that link is _way_ more useful.

It's showing reasonably common OOM on the gl tests on WinXP.  That's causing failures on opt _and_ debug, so the bogus assertion is not the only issue here.  I guess I get to do some Windows try runs to try to get more information....
OK so, I did a try run with and without the patch in this bug.  The OOM crashes are _definitely_ caused by this patch.

Comparing a typical green run without this patch to a green run with this patch, I see 80 calls to UsefulToMergeZones without this patch and 42 with the patch.  I also see this function return true once without this patch and 40 times with the patch.

OK, so why does returning true a lot more from this function lead to OOMs?  ;)
Flags: needinfo?(continuation)
Oh, and I can get try to be nicely green by skipping inners whose outers are detached from docshell, by checking for null window->GetScriptContext() and continuing the loop if it's null.  That probably more closely matches the old behavior anyway... Andrew, thoughts on just doing that and filing a followup to get the problems when this function _does_ return true more often sorted out?
(In reply to Boris Zbarsky [:bz] from comment #17)
> Andrew, thoughts on just doing that and filing a
> followup to get the problems when this function _does_ return true more
> often sorted out?

This function should only return true for windows that we think are likely entirely garbage collectible by the CC. If a window isn't detached from a docshell, does the docshell have a owning reference to the window? If it does, then we don't want to merge it.
Flags: needinfo?(continuation)
> If a window isn't detached from a docshell, does the docshell have a owning reference to the window?

The docshell has an owning reference to the outer window.  The outer window does have an owning reference to its current inner, I believe....

> If it does, then we don't want to merge it.

So...  The code before this patch only merges things that are NOT detached from docshells (if it ever merges anything).  The code after this patch that starts merging things that ARE detached from docshells causes those OOMs on the gl tests.  If I change the new code to also only merge things that are NOT detached from docshells, then things go green again.
Flags: needinfo?(continuation)
Well, it sounds like this is totally busted then. Just make UsefulToMergeZones return false all the time (with some kind of comment about how doing it the right way causes OOMs, referring to this bug), and I'll either rip out this whole mechanism or figure out why your original patches cause OOMs.

The merging is probably causing GL stuff to not be torn down quickly enough, causing an OOM. It ties the lifetime of everything in compartment together when it returns true.
Flags: needinfo?(continuation)
Confirmed that simply always returning false also makes the tree green.
Bug 1277036 filed for the followup work.

Comment 23

2 years ago
Pushed by bzbarsky@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/160a0b1ffd2b
Stop using GetScriptContextFromJSContext in CycleCollectedJSRuntime::UsefulToMergeZones.  r=mccr8

Comment 24

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/160a0b1ffd2b
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox49: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in before you can comment on or make changes to this bug.