Closed Bug 1542461 Opened 6 months ago Closed 5 months ago

Assertion failure: cache->PreservingWrapper(), at src/dom/bindings/DOMJSProxyHandler.cpp:86

Categories

(Core :: DOM: Bindings (WebIDL), defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox68 --- fixed

People

(Reporter: bzbarsky, Assigned: bzbarsky)

References

Details

Attachments

(2 files, 1 obsolete file)

This is to track the various intermittents that got blamed on bug 1513603 but had nothing to do with it.

I reproduced under rr pretty straightforwardly like so:

  mach mochitest --headless --debugger=rr --debugger-args="-M record -h" --run-until-failure browser/components/extensions/test/browser/browser_ext_addon_debugging_netmonitor.js

in a build from rev a5a5ddfb5178 with the fix for bug 1498287 applied on top and MOZ_BROWSER_XHTML=1 in the mozconfig.

Just for my reference, this is the "firefox-147" rr run I have.

We hit the assert because of the following sequence of events:

  1. We unlink an HTMLDocument.
  2. This unlinks the Document superclass
  3. This calls nsINode::Unlink
  4. That calls ReleaseWrapper, so we are no longer preserving the wrapper.
  5. A bit later we try to get the expando object for that document, with this stack:
#2  0x000066295a798e24 in mozilla::dom::DOMProxyHandler::GetExpandoObject(JSObject*) (obj=0x1b85444c48b0)
    at /home/bzbarsky/mozilla/dom-work/mozilla/dom/bindings/DOMJSProxyHandler.cpp:299
#3  0x000066295a798b85 in mozilla::dom::DOMProxyShadows(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::PropertyKey>) (cx=0x7fe536025000, proxy=..., id=...)
    at /home/bzbarsky/mozilla/dom-work/mozilla/dom/bindings/DOMJSProxyHandler.cpp:33
#4  0x000066295fbb74ae in GetProxyStubType(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::PropertyKey>)
    (cx=0x7fe536025000, obj=..., id=...) at /home/bzbarsky/mozilla/dom-work/mozilla/js/src/jit/CacheIR.cpp:183
#5  0x000066295fbb50a0 in js::jit::GetPropIRGenerator::tryAttachProxy(JS::Handle<JSObject*>, js::jit::ObjOperandId, JS::Handle<JS::PropertyKey>) (this=0x7ffc65020888, obj=..., objId=..., id=...)
    at /home/bzbarsky/mozilla/dom-work/mozilla/js/src/jit/CacheIR.cpp:1654
#6  0x000066295fbb1c6b in js::jit::GetPropIRGenerator::tryAttachStub() (this=0x7ffc65020888)
    at /home/bzbarsky/mozilla/dom-work/mozilla/js/src/jit/CacheIR.cpp:289
#7  0x000066295f9ced0b in js::jit::DoGetPropFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICGetProp_Fallback*, JS::MutableHandle<JS::Value>, JS::MutableHandle<JS::Value>)
    (cx=0x7fe536025000, frame=0x7ffc65020b00, stub=0x7fe52c8ab480, val=..., res=...)
    at /home/bzbarsky/mozilla/dom-work/mozilla/js/src/jit/BaselineIC.cpp:2744

and that hits the assert. Above that on the stack is jitcode, looks like; rr doesn't really give me a useful stack after that point.

Now the interesting thing is that in DoGetPropFallback "val" is the same HTMLDocument as we earlier unlinked. Which means we unlinked it while it was still reachable from script!

The property we are trying to get is "ownerGlobal".

The JS stack looks like this:

0 webExtensionTargetPrototype._shouldAddNewGlobalAsDebuggee(newGlobal = [object Object] ... ) ["resource://devtools/server/actors/targets/webextension.js":356:8]
    this = [Actor webExtensionTarget/server1.conn0.child17/webExtensionTarget1]
1 ([object Object], .... ) ["self-hosted":1007:16]
2 filter(callbackfn = [function]) ["self-hosted":318:27]
    this = [object Object], ...

where the ... bits are lots of [object Object].

The global of the unlinked HTMLDocument is an nsGlobalWindowInner which is unlinked (so it has a null mDoc, for example, which is why the document got unlinked).

Analysis coming up in a separate comment, so I can cc people and have them just see that and not all the data above.

OK, so what happens here is:

  1. We have a document and a window.
  2. They are no longer referenced by anything and get unlinked.
  3. nsGlobalWindowInner's unlink nulls out mDoc but does not call ClearDocumentDependentSlots, not least because it has no useful JSContext at this point. We could conceivably fix that, I think, if we inited an AutoJSAPI with the window right here...
  4. At this point the JS objects for the document and window are no longer kept alive by Gecko or referenced by SpiderMonkey in any way, so can get GCed.
  5. Before they get a chance to be GCed, the code in devtools/server/actors/targets/webextension.js runs and does dbg.findAllGlobals().filter(this._shouldAddNewGlobalAsDebuggee) which ends up digging up the global that no one is supposed to have reference to anymore and handing it out to script again.
  6. The script checks that the global is a Window.
  7. The script checks whether ChromeUtils.getClassName(global.document) is "XULDocument" and if it is does nothing. This works, because the .document getter on Window reads from the slot, and we didn't clear the slot in step 3. If we made the change I mention in item 3, this would end up throwing an exception (which the script totally does not expect).
  8. The script tries to get the .ownerGlobal property on the global.document and we hit this assertion.

So a few issues here, apart from item 3:

  • We are effectively resurrecting dying objects via findAllGlobals(). Jim, is there something we can do to make it not return globals that are sufficiently-torn-down on the Gecko side? I assume the iterators used ensure that we can't end up returning an object that tests true for EdgeNeedsSweepUnbarriered() (so is already GCed but is waiting to be finalized) here, or at least I really hope so. We could add a flag on the Realm or something, I suppose...

  • Brendan, it's possible that step 6 above is why MOZ_BROWSER_XHTML=1 might be triggering this sort of assertion in your try runs. Maybe. The stack there is kinda different (interpreter vs JIT for one thing), but could conceivably be the same property get in the end. In any case, in the MOZ_BROWSER_XHTML=1 world step 6 above seems like it needs updating no matter what, to test something that's still relevant, whatever that is.

Flags: needinfo?(jimb)
Flags: needinfo?(bdahl)

Alternately, I suppose we could add something on Window that this code could check and bail out. But I'd really prefer it if we didn't have a way to hand a Window that's been unlinked to script at all.

I don't see that findAllGlobals does check the health of the global. It iterates over the runtime's zones, and each zone's realms, and then just grabs the zones' globals.

Flags: needinfo?(jimb)

I don't really understand the GC rules here. But, Boris, could you see if this assertion triggers when you reproduce the bug?

Perhaps we should be skipping realms for which hasLiveGlobal returns false.

Assignee: nobody → jimb

I can try that, sure. That said, even if the global is so-far-live (in the sense that it hasn't been GCed yet, so that assertion does not fire), it's not live in the sense that matters. If we had a unified GC between SpiderMonkey and Gecko, this situation could not arise. But we don't. What we have a is a cycle collector, which breaks cycles by not marking the JS object as "kept alive by Gecko anymore" and dropping various inside-Gecko references. Then we just wait for the JS GC to collect the JS object. In theory this should all work because the JS engine promised us that nothing was referencing the JS object, so the fact that we're basically ripping out its guts and then letting it die sometime later should be fine: no one should be able to observe it in its no-guts-but-not-dead state.

Except this debugger API hands it out to script in just such a state...

Is there any test we can apply to a global to determine whether we've started tearing it down? Checking that in findAllGlobals would be a quick fix.

A better fix would be to replace findAllGlobals with an interface whose behavior wasn't sensitive to GC timing, and didn't go around the GC's back in such a weird way. For example, maybe an API that devtools could hand a callback saying, "please apply this callback to every new global we're about to try to execute JavaScript in - only one call per global per hook." Then the devtools would have a chance to add every global as a debuggee before any code runs in it, but globals on their way out would never enter the picture, simply because we'd never try to use them again. (I think there's a good way to implement that notification that works efficiently with many globals and multiple debuggers.)

That said, even if the global is so-far-live (in the sense that it hasn't been GCed yet, so that assertion does not fire), it's not live in the sense that matters.

I understand this to mean that assertion my patch adds would not fire when the bug occurs, because the global, while unsafe to use, is still live in the sense the assertion checks. Is that correct?

Flags: needinfo?(bzbarsky)
See Also: → 1542574

So just to be clear, comment 5 is based on theory and my understanding of how things work, not experiment. The experimental evidence I have so far is that the one time I have gotten the new assert to trigger "global" was null. I have moved the assert into the "if (global)" block and am running again to see whether I can reproduce it in that case.

I'm not sure there's an existing test we can apply to the global. Hence my proposal in comment 1 to maybe add a flag on the Realm to indicate that state...

OK, I have now managed to reproduce the PreservingWrapper() assertion with the "assert if non-live global" patch applied. Which means the global was live in terms of the IsAboutToBeFinalizedUnbarriered() bits. in hasLiveGlobal. I stepped back to the Debugger::findAllGlobals code when it added that particular global to the list, and at that point the window has already been unlinked by the cycle collector, but is live as far as the JS engine is concerned.

So in this specific run, comment 7 holds: the assertion about hasLiveGlobal() is not firing, but the Gecko-side assertion is.

Flags: needinfo?(bzbarsky)

If there aren't too many places where the destruction process is set in motion, then a flag on the JS::Realm would be fine.

There's precisely one place that's relevant: nsGlobalWindowInner's unlink method. If we had a flag that we can set, we could set it there and it would work fine, I think.

Assignee: jimb → bzbarsky

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #1)

  1. The script checks whether ChromeUtils.getClassName(global.document) is "XULDocument" and if it is does nothing. This works, because the .document getter on Window reads from the slot, and we didn't clear the slot in step 3. If we made the change I mention in item 3, this would end up throwing an exception (which the script totally does not expect).
  • Brendan, it's possible that step 6 above is why MOZ_BROWSER_XHTML=1 might be triggering this sort of assertion in your try runs. Maybe. The stack there is kinda different (interpreter vs JIT for one thing), but could conceivably be the same property get in the end. In any case, in the MOZ_BROWSER_XHTML=1 world step 6 above seems like it needs updating no matter what, to test something that's still relevant, whatever that is.

That sounds very likely - we've had to deal with a bunch of other issues where code was expecting a XUL document (tracked in blockers on Bug 1453783) and we probably just missed this one. I'm not sure what the right check is here. Assuming we can access the principal without triggering the same error we could check for system principal instead of XUL docs which should fix it more generally than just checking AppConstants.BROWSER_CHROME_URL.

After the changes I'm proposing in this bug happen, you will no longer get torn-down globals here, so you should be able to access the principal pretty easily: just get nodePrincipal on the document. That already happens in the isExtensionWindow() calls.

That said, I don't know what the code here is actually trying to accomplish and whether filtering out all system-principal things would make sense.

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #14)

After the changes I'm proposing in this bug happen, you will no longer get torn-down globals here, so you should be able to access the principal pretty easily: just get nodePrincipal on the document. That already happens in the isExtensionWindow() calls.

That said, I don't know what the code here is actually trying to accomplish and whether filtering out all system-principal things would make sense.

Luca, in webExtensionTargetPrototype._shouldAddNewGlobalAsDebuggee we have:

    // Filter out any global which contains a XUL document.
    if (ChromeUtils.getClassName(global.document) == "XULDocument") {
      return false;
    }

But as we convert more chrome documents to HTML (i.e. browser.xhtml instead of browser.xul) this check won't work anymore. AFAICT this function will still end up returning false in this case due to this.isExtensionWindowDescendent(global.document.ownerGlobal) later on for these documents, but I'm wondering if we should either remove the "XULDocument" check or change it to check for system principal. I'm not sure if there are gotchas with doing the latter (like, extension windows that are also system principal). Do you have an opinion on what to do here?

Flags: needinfo?(lgreco)

(In reply to Brian Grinstead [:bgrins] from comment #15)

But as we convert more chrome documents to HTML (i.e. browser.xhtml instead of browser.xul) this check won't work anymore. AFAICT this function will still end up returning false in this case due to this.isExtensionWindowDescendent(global.document.ownerGlobal) later on for these documents, but I'm wondering if we should either remove the "XULDocument" check or change it to check for system principal. I'm not sure if there are gotchas with doing the latter (like, extension windows that are also system principal). Do you have an opinion on what to do here?

I've been trying to remember if there were any other reasons for that check, but from what I recall it was meant to just exit earlier (without further checks) on a global related to a XUL document.

As you pointed out there are also the two checks that follows it which are still going to filter out any of those documents, and so it seems to me that we could just remove the explicit check for the XUL documents.

Flags: needinfo?(lgreco)

As we convert more chrome documents away from XUL we end up running
through two different paths in this function. These are going to be
filtered out in later checks anyway, so this change removes the early return.

Blocks: 1542843

Comment on attachment 9056638 [details]
Bug 1542461 - Don't special case XUL documents in webExtensionTargetPrototype._shouldAddNewGlobalAsDebuggee;r=rpl

Revision D26574 was moved to bug 1542843. Setting attachment 9056638 [details] to obsolete.

Attachment #9056638 - Attachment is obsolete: true
Pushed by bzbarsky@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1b7c56688127
Flag realms of windows as dead once we unlink their nsGlobalWindowInner, so debugger's findAllGlobals won't hand them out to script.  r=jandem
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Flags: needinfo?(bdahl)
You need to log in before you can comment on or make changes to this bug.