Closed Bug 654015 Opened 13 years ago Closed 13 years ago

"ASSERTION: Requesting refresh driver after document has disconnected"

Categories

(Core :: SVG, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla6
Tracking Status
firefox5 + fixed
firefox6 --- fixed
blocking2.0 --- -
status2.0 --- wanted
status1.9.2 --- unaffected
status1.9.1 --- unaffected

People

(Reporter: jruderman, Assigned: dholbert)

References

Details

(Keywords: assertion, testcase, Whiteboard: [sg:critical?])

Attachments

(8 files, 3 obsolete files)

615 bytes, text/html
Details
3.33 KB, text/plain
Details
439 bytes, text/html
Details
31.19 KB, text/plain
Details
358 bytes, text/html
Details
983 bytes, patch
bzbarsky
: review+
Details | Diff | Splinter Review
1.08 KB, patch
bzbarsky
: review+
Details | Diff | Splinter Review
4.25 KB, patch
bzbarsky
: review+
Details | Diff | Splinter Review
1. Start Firefox (with a new profile?)
2. Load the testcase
3. Click the testcase's button
4. Quit Firefox (or force GC?)

Result: assert during shutdown

###!!! ASSERTION: Requesting refresh driver after document has disconnected!: 'Error', file  content/smil/nsSMILAnimationController.cpp, line 840

Based on hg blame for the assertion, I suspect this might be a regression from bug 653238.
(In reply to comment #0)
> testcase (makes Firefox crash during a subsequent GC)
[...]
> Result: assert during shutdown

Jesse: to be clear, does this actually crash for you, or just assert?

In my debug build, I can repro the assert-at-shutdown and (if I don't quit) the assert-after-a-few-seconds (at a GC), but haven't gotten this to crash yet.

> Based on hg blame for the assertion, I suspect this might be a regression from
> bug 653238.

It's probably more likely that we already had some form of this problem, and we're just catching this now due to the additional sanity-checking & protections added in bug 653238.
Attachment #529377 - Attachment description: testcase (makes Firefox crash during a subsequent GC) → testcase (makes Firefox assert during a subsequent GC)
Sorry, it just asserts. I got confused because it was reduced from a testcase that aborted (before shutdown). This is the abort:

###!!! ABORT: Clearing window pointer while animations are unpaused: 'aScriptGlobalObject || !mAnimationController || mAnimationController->IsPausedByType( nsSMILTimeContainer::PAUSE_PAGEHIDE | nsSMILTimeContainer::PAUSE_BEGIN)', file /builds/slave/cen-osx64-dbg/build/content/base/src/nsDocument.cpp, line 3766
OS: Mac OS X → All
Hardware: x86_64 → All
So, the piece of my logic in bug 653238 that we're violating here is this:
> However, during document teardown, we'll always(?) hit nsDocument::OnPageHide,
> which calls mAnimationController->OnPageHide()

In this case, we have two nested animationControllers -- one for the outer <svg>, and one for the inner <svg>.  It looks like we're only getting an OnPageHide call for the outer one.

I think we probably want nsSMILAnimationController::OnPageHide to propagate that call to all of its children.
Actually I don't think that's the problem, sorry. Digging more...
So the issue here is that we're getting calls in this order (using testcase 2):
 (1) nsDocument::OnPageShow
 (2) nsDocument::OnPageHide
 (3) nsDocument::OnPageShow
 (4) ***FOR A DIFFERENT nsDocument object: nsDocument::OnPageHide
 (5) ~nsDocument (which calls nsSMILAnimationController::Disconnect)

steps 1,2,3, and 5 are all for the same nsHTMLDocument object.
step 4 is for *another* nsHTMLDocument object.

I've attached stack traces of this series of calls (copied directly from a log with GDB set up to print a backtrace & continue at each of these points).

As noted in the quoted chunk in Comment 5, our SMIL code currently assumes that (after receiving a call to OnPageShow) we'll always receive a call to OnPageHide before the document dies.  The testcases here make us violate this assumption.

I'm not sure if the assumption is bad, or if the assumption *should* hold but our document-teardown code is broken.

bz, smaug -- is it busted here that we're getting "OnPageShow" but never a corresponding "OnPageHide", for a particular document?
Note that these...
> (2) nsDocument::OnPageHide
> (3) nsDocument::OnPageShow
...are both inside of a call to "nsDocShell::LoadHistoryEntry", which presumably corresponds to "w.history.back();" in the testcase.
...and the mysterious different document from (4) actually gets *created* between steps (2) and (3), apparently as a transitory "about:blank" viewer document. (in nsDocShell::CreateAboutBlankContentViewer / nsContentDLF::CreateBlankDocument)
So, the most straightforward way to fix this would be to have nsSMILAnimationController::Disconnect() make a call to nsSMILAnimationController::OnPageHide(), for good measure in case we never received an OnPageHide notification from our document.

That'd only make sense if we think it's kosher for Documents to go away without sending us that OnPageHide notification, though, which I'm not sure about per the end of comment 7.
Hmm.  Do we fire a pagehide if the iframe a document is in is removed from the DOM?
Attached file iframe removal test
(In reply to comment #11)
> Hmm.  Do we fire a pagehide if the iframe a document is in is removed from the
> DOM?

Yes - if I put a breakpoint at nsDocument::OnPageHide, I get one hit when I click the "remove iframe" button in this testcase.  (I think that's what you were asking - let me know if I misunderstood)
That's what I meant, yes.  So I wonder how we manage to get a ~nsDocument without a pagehide on the same document....
So the unbalanced OnPageShow is caused by this counterintuitive stack (summarized from the last OnPageShow stack in comment 7's attachment):
 #0  nsDocument::OnPageShow
 #1  DocumentViewerImpl::LoadComplete
 #2  nsDocShell::EndPageLoad
 #3  nsDocShell::OnStateChange
 #4  nsDocLoader::FireOnStateChange
[...]
 #8  nsLoadGroup::RemoveRequest
 #9  nsDocument::SetScriptGlobalObject (setting it to nsnull)
 #10 DocumentViewerImpl::Close
 #11 nsDocShell::SetupNewViewer

So we start out by trying to close our document (levels 11 & 10), which makes us cancel its load groups, which fires a notification that makes us think we're now ready to *show* our document.  That seems wrong to me.

(Note that the nsDocShell at level 11 is the same as the one at levels 2 & 3, and the nsDocument at level 9 is the same as the one at level 0)
Aha -- so I think we *expect* that final OnPageHide to happen at nsDocShell tear-down time.

In particular -- nsDocShell::Destroy calls FirePageHideNotification which calls mContentViewer->PageHide().

However, in this case, the history option *replaces our mContentViewer* with a stub one pointing at an "about:blank" document.  (from "CreateAboutBlankContentViewer" mentioned above)  So nsDocShell::Destroy sends PageHide to *that* document, not our old one.

If I comment out the call to CreateAboutBlankContentViewer and just behave as if it failed, then everything works out fine.  So I think CreateAboutBlankContentViewer is basically killing us here.  In this case, it's only called for javascript: URLs, which is probably why we haven't hit issues from this more often.
http://mxr.mozilla.org/mozilla-central/source/docshell/base/nsDocShell.cpp?mark=9981-9983,9992-9992#9981
CreateAboutBlankContentViewer even contains a call to  "FirePageHideNotification" just like the one in nsDocShell::Destroy.  So I think it's intending to clean up after itself, and I think that's supposed to be the last word in pagehide-vs.-pageshow status.

The problem is that we're accidentally triggering one more "bonus" OnPageShow *after* that page-hide, via the counterintuitive stack from comment 14.
> That seems wrong to me.

In general, firing pageshow when stopping the document load is correct; think the user hitting the stop button.

It sounds like the sequence of events here is that we have a document that's in the middle of loading and has already embedded its content viewer but NOT yet fired pageshow.  Then CreateAboutBlankContentViewer is called, which fires pagehide on the old viewer, but doesn't actually stop the old load.  Then it embeds a new viewer which is what kills off the old viewer and stops the load, right?

It'd make sense to stop the old viewer's loadgroup in CreateAboutBlankContentViewer before calling PageHide on it, perhaps...  Would that fix the issue here?
(In reply to comment #17)
> > That seems wrong to me.
> 
> In general, firing pageshow when stopping the document load is correct; think
> the user hitting the stop button.

Right -- that part makes sense. I meant more that firing pageshow when *closing* the document seems broken.

> It sounds like the sequence of events here is that we have a document that's in
> the middle of loading and has already embedded its content viewer but NOT yet
> fired pageshow.  Then CreateAboutBlankContentViewer is called, which fires
> pagehide on the old viewer, but doesn't actually stop the old load. Then it
> embeds a new viewer which is what kills off the old viewer and stops the load,
> right?

Yes, I think so.

> It'd make sense to stop the old viewer's loadgroup in
> CreateAboutBlankContentViewer before calling PageHide on it, perhaps...  Would
> that fix the issue here?

That sounds like it might work. I've got an alternate patch that's along the same lines, so I'll post that first and then try what you suggested.
This patch just cleans up the existing mContentViewer value inside of CreateAboutBlankContentViewer, before we've called e.g. "Embed" (which triggers the unwanted OnPageHide call).

I'm copying the existing mContentViewer cleanup code from nsDocShell::Destroy.  (and I'm only doing it if !mSavingOldViewer, though I'm not sure if I actually need that check -- I don't know the conditions under which mSavingOldViewer would be set.)

I've confirmed locally that this fixes the problem, and it made some intuitive sense as a first shot, but I haven't sanity-checked it across all of our testsuites or anything.
...and here's the approach that bz suggested, which I've confirmed also works.  

Not obsoleting my previous patch yet until I've gotten bz's feedback on which approach is better (in case there's some reason we'd actually want to clean up the contentviewer more directly / thoroughly.)
Attachment #529617 - Attachment description: first attempt: ditch old viewer, like in nsDocShell::Destroy → patch 1, first attempt: ditch old viewer, like in nsDocShell::Destroy
Regardless of which tactic we take on the fix, we should take this assertion to catch this sort of issue slightly earlier, too.

(This basically just asserts that we get a notification from nsDocument::OnPageHide before getting a notification from ~nsDocument, without any "OnPageShow" calls in between.)
Attachment #529624 - Attachment description: patch 2: catch this slightly earlier in SMIL code → (not-quite-ready version of patch 2)
(forgot to qref)
Assignee: nobody → dholbert
Attachment #529624 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #529625 - Flags: review?(bzbarsky)
Attachment #529620 - Flags: review?(bzbarsky)
Attachment #529617 - Flags: feedback?(bzbarsky)
FWIW, I pushed patch 1 second attempt & patch 2 to tryserver for sanity-checking.  It passed, modulo known-randomorange:
http://tbpl.mozilla.org/?tree=Try&rev=ee49290e4bce
blocking2.0: --- → -
status2.0: --- → wanted
Whiteboard: [sg:critical?]
I'd told dveditz that this bug was likely security-sensitive and probably exploitable on 4.0 & Aurora, since the assertion-failure seems to indicate that we were trying to use nsSMILAnimationController::mDocument after it was deleted.  (hence all the flags he just added)

However, it turns out it's not quite as bad as I'd thought. I just now noticed that this assertion is firing *while we're in* the nsDocument destructor, not *after* the destructor, as shown from the stack in attachment 529378 [details].  So  mDocument isn't a deleted pointer yet after all, and nothing bad actually happens. (I confirmed this by placing a breakpoint in nsSMILAnimationController::RemoveChild in mozilla-aurora).

So I'm not so sure that this could be exploited. This might be better-classified as "sg:high".

Still, it seams reasonable to suppose that a yet-more-evil testcase might be able to leave us registered with a nsRefreshDriver after our document dies, due to the unbalanced PageShow call noted in Comment 7, and then we'd potentially outlive the document and deref its pointer after it's destroyed.  We don't have such a testcase, but it seems possible.

So anyway -- this isn't as scary as I'd thought, but I still think it (and bug  653238) are worth backporting.
bz - can we get an ETA on review for these patches? This bug is on the tracking-firefox5 list (though maybe comment 24 means it doesn't really need to be, anymore). If it does need to be tracked against ff5, then we need to figure out how to move it forward. If it doesn't, Daniel, can you clear the + (and maybe just ask for approval when there's a patch to land)?
I will review these tonight.  Sorry for the lag on them.  :(
Comment on attachment 529617 [details] [diff] [review]
patch 1, first attempt: ditch old viewer, like in nsDocShell::Destroy

So this worries me.  I'd rather not do this except as a last resort.
Attachment #529617 - Flags: feedback?(bzbarsky) → feedback-
Comment on attachment 529620 [details] [diff] [review]
patch 1, second attempt: Stop() after firing onpagehide

I'm a little surprised this worked.  Shouldn't the Stop() call come _before_ FirePageHideNotification?
Comment on attachment 529625 [details] [diff] [review]
patch 2: catch this slightly earlier in SMIL code

r=me
Attachment #529625 - Flags: review?(bzbarsky) → review+
Also, can we add a test for this?
(In reply to comment #28)
> I'm a little surprised this worked.  Shouldn't the Stop() call come _before_
> FirePageHideNotification?

Possibly -- I don't know the significance of that particular ordering there.  I can try the stop-before-hide ordering and see if it also works.  (I see now that you'd suggested stop-before-hide at the end of comment 17 -- sorry for missing that.)

Anyway, the important thing from my perspective was to have Stop() come before the call to *Embed()* (30 lines below), because *that's* the call that currently triggers the unwanted OnPageShow() (in the third backtrace of attachment 529547 [details]), if we haven't stopped our load group.
I would expect pageshow to fire when Stop() is called (is that not happening?).  So with the ordering in the patch we're doing a pagehide and then should be doing a pageshow.  Which is exactly what we want to avoid.
(In reply to comment #32)
> I would expect pageshow to fire when Stop() is called (is that not
> happening?).

The PageShow is not happening.  No calls are made to nsDocument::OnPageShow during the Stop() call in my current patch.

To see why, I put a breakpoint in DocumentViewerImpl::LoadComplete (the caller of the unwanted OnPageShow call in attachment 529547 [details]).  It looks like when my patch's Stop() call hits that function, we skip over its OnPageShow call here:
> 1061       if (docShell && NS_SUCCEEDED(docShell->GetIsInUnload(&isInUnload)) &&
> 1062           !isInUnload) {
> 1063         mDocument->OnPageShow(restoring, nsnull);
> 1064       }
http://mxr.mozilla.org/mozilla-central/source/layout/base/nsDocumentViewer.cpp#1061

In this case, the variable isInUnload is PR_TRUE, so we skip over OnPageShow.  That bool is just a proxy for nsDocShell::mFiredUnloadEvent, which gets set to true inside of FirePageHideNotification().

So basically, FirePageHideNotification sets a flag that keeps us from triggering the OnPageShow call.
(In reply to comment #33)
> So basically, FirePageHideNotification sets a flag that keeps us from
> triggering the OnPageShow call.

So, as one might expect -- if I follow bz's suggestion and move Stop() before FirePageHideNotification(), then I *do* get an OnPageShow() call inside of Stop.  (That's ok, though, because for the purposes of this bug, the subsequent FirePageHideNotification call undoes its effects.)

So either ordering is fine for fixing this bug, AFAICT.

I get the impression that Stop()ing first (as bz suggests) is saner & less fragile.  Should I go with that, then?
Attachment #529617 - Attachment is obsolete: true
I think I'd prefer that, yes.  Thanks for figuring out why the other was working too!  Man, this code is side-effecty.  :(
(In reply to comment #30)
> Also, can we add a test for this?

Not directly, I think.  Such a test would need to...
 (a) be able to automate a call to window.open()
 (b) turn tinderbox orange if this bug's assertion fails

...and I don't think we have any testsuites that satisfy both of those conditions.

I tried running the testcases here as a crashtest, but the onload-triggered window.open() call fails to run (due to popup blocking I think).

I also wrote a mochitest version, and *there*, window.open() works (and triggers the assertion in an un-patched debug build), but that doesn't do us a lot of good since the mochitest harness ignores assertion failures. :)

I'm thinking I'll stick with the mochitest route, though, and then possibly convert the original NS_ASSERTION here into a NS_ABORT_IF_FALSE, because that will definitely turn tinderbox orange when it fails. :)  How does that sound?
(note that the s/NS_ASSERTION/NS_ABORT_IF_FALSE/ that I've just suggested for test-purposes shouldn't be a big deal -- that is, I'm not expecting the NS_ASSERTION to be triggered under normal conditions anymore -- because the other new NS_ABORT_IF_FALSE added in patch 2 should now always fire before this bug's assertion fires anyway.)
(In reply to comment #35)
> I think I'd prefer that, yes.  Thanks for figuring out why the other was
> working too!  Man, this code is side-effecty.  :(

Sounds good -- here's the updated patch.
Attachment #529620 - Attachment is obsolete: true
Attachment #529620 - Flags: review?(bzbarsky)
Attachment #531822 - Flags: review?(bzbarsky)
Here's a patch with both of this bug's testcases converted into mochitests, and with this bug's original failing assertion upgraded to NS_ABORT_IF_FALSE so that the mochitests can fail.

I've confirmed that these fail (abort) before this bug's fix, and they pass after this bug's fix.
Attachment #531824 - Flags: review?(bzbarsky)
Comment on attachment 531822 [details] [diff] [review]
patch 1, third attempt: Stop() *before* firing onpagehide

r=me.
Attachment #531822 - Flags: review?(bzbarsky) → review+
Comment on attachment 531824 [details] [diff] [review]
patch 3: crashtests (as mochitests) plus assertion upgrade

r=me
Attachment #531824 - Flags: review?(bzbarsky) → review+
Landed patches 1 & 2 on cedar (m-c is a bit backed up and colorful right now):
  http://hg.mozilla.org/projects/cedar/rev/2f53d4782ce3
  http://hg.mozilla.org/projects/cedar/rev/2f4c320b5c33

Holding off on landing tests (patch 3) until fix is released to users.
Flags: in-testsuite?
Whiteboard: [sg:critical?] → [sg:critical?] fixed-in-cedar
(CC'ing volkmar in case he does the next cedar-->mc merge & needs access to this bug to close it out)
Comment on attachment 531822 [details] [diff] [review]
patch 1, third attempt: Stop() *before* firing onpagehide

Requesting approval to land on aurora (together with bug 653238).
Attachment #531822 - Flags: approval-mozilla-aurora?
Attachment #529625 - Flags: approval-mozilla-aurora?
Pushed:
http://hg.mozilla.org/mozilla-central/rev/2f53d4782ce3
http://hg.mozilla.org/mozilla-central/rev/2f4c320b5c33
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [sg:critical?] fixed-in-cedar → [sg:critical?]
Target Milestone: --- → mozilla6
Attachment #529625 - Flags: approval-mozilla-beta?
Attachment #531822 - Flags: approval-mozilla-beta?
Whiteboard: [sg:critical?] → [sg:critical?] wants to land on beta with 653238
Attachment #529625 - Flags: approval-mozilla-aurora?
Attachment #531822 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Attachment #531822 - Flags: approval-mozilla-aurora-
Attachment #529625 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #531822 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #529625 - Flags: approval-mozilla-aurora+
Comment on attachment 531822 [details] [diff] [review]
patch 1, third attempt: Stop() *before* firing onpagehide

Please land these changes on both Aurora and Beta. (In the future, getting changes in during Aurora will save you this extra step.)
Attachment #531822 - Flags: approval-mozilla-aurora+
Whiteboard: [sg:critical?] wants to land on beta with 653238 → [sg:critical?]
Depends on: 662186
Group: core-security
Depends on: 670324
Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101 Firefox/6.0

While runing testcase 2 on Fx 6.0b3 and 8.0a1 this is what I get in the error console:

Error: [Exception... "Component returned failure code: 0x80004003 (NS_ERROR_INVALID_POINTER) [nsIURI.equals]"  nsresult: "0x80004003 (NS_ERROR_INVALID_POINTER)"  location: "JS frame :: chrome://browser/content/browser.js :: <TOP_LEVEL> :: line 338"  data: no]
Source File: chrome://browser/content/browser.js
Line: 338

Is this the expected behavior?
That's an existing bug that's unrelated to the changes here. I filed bug 675037.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: