Closed Bug 641388 Opened 14 years ago Closed 14 years ago

Crash in nsRefreshDriver::Notify with assertion in CSSFrameConstructor

Categories

(Core :: General, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- Macaw+
status2.0 --- .1-fixed
status1.9.2 --- unaffected
status1.9.1 --- unaffected

People

(Reporter: nils, Assigned: dholbert)

References

Details

(4 keywords, Whiteboard: [sg:critical?])

Attachments

(3 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0
Build Identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0

Description:
        The attached testcase has been reduced from a very large testcase (~3 Megs). The resulting testcase is still quite large (~80 lines of JavaScript), however further reduction made were not possible or made it very hard to reproduce. The reason for
 the crash is not obvious from the testcase.

        Following assertion can be observed on a Linux debug build:
        ASSERTION: Dying in the middle of our own update?: mUpdateCount == 0, file ../../../layout/base/nsCSSFrameConstructor.h, line 89

Affected Versions:
        Firefox 4.0 RC1

Testcase:
        The testcase is attached as an HTML file. It will crash the browser on opening after several reloads.

Testcase Notes:
        The testcase implements two function to improve the reproducability of the testcase. gc() triggers garbage collection. It requires Jesse's quitter extension (https://www.squarefree.com/extensions/quitter.xpi).
        fill() works similar to a heap spray and fills the memory with a byte pattern.

Stack Backtrace:
Following crash has been observed on Windows. This indicates that code exection is possible. Similar behaviour has been seen on linux as well.

Windows:
(e5c.1360): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
xul!nsRefreshDriver::Notify+0x5b9:
625fa199 8b08            mov     ecx,dword ptr [eax]  ds:002b:fafafafb=????????
625fa19b 57              push    edi
625fa19c ffd1            call    ecx

xul!nsRefreshDriver::Notify+0x5b9
xul!nsTimerImpl::Fire+0x103
xul!nsTimerEvent::Run+0x20
xul!nsThread::ProcessNextEvent+0x2d7
xul!mozilla::ipc::MessagePump::Run+0x6e
xul!MessageLoop::RunInternal+0x11
xul!MessageLoop::RunHandler+0x1d
nspr4!PR_GetThreadPrivate+0x20
xul!nsBaseAppShell::Run+0x34
xul!nsAppShell::Run+0x42
xul!nsAppStartup::Run+0x1e
xul!XRE_main+0xdec
firefox!wmain+0x34c
firefox!__tmainCRTStartup+0x152
kernel32!BaseThreadInitThunk+0xe
ntdll32!__RtlUserThreadStart+0x70
ntdll32!_RtlUserThreadStart+0x1b

VulnDev reference    : vd11002

reported by nils of vulndev ltd.

Reproducible: Always
This testcase looks quite different from the first testcase and it does not trigger the same assertion in a debug build. However it triggers the same crash (stack backtrace) on windows.

It still might be a separate issue, hard to tell until we know more about the reasons for the crashes.
With the second testcase, I get an abort when a refresh driver is destroyed before its observers have been unregistered.

The observer that's still around is an nsSMILTimeContainer.

The testcase does use some <svg:animate> stuff.
To be clear, that abort is a debug abort due to an NS_ABORT_IF_FALSE.  But it indicates that something is not unregistering correctly here.

For the first testcase, I don't see why mUpdateCount is 1 there.  The stack shows us removing an element which has an iframe as a descendant from the DOM.  The frame constructor with mUpdateCount == 1 is the frame constructor for that iframe's document.

And then eventually I do crash when reloading the page, because one of the refresh observers in the refresh driver is a dead pointer....
Note that the mUpdateCount being nonzero is pretty weird, since it's maintaned by stack objects.
(In reply to comment #3)
> With the second testcase, I get an abort when a refresh driver is destroyed
> before its observers have been unregistered.
[...]
> that abort is a debug abort due to an NS_ABORT_IF_FALSE.

With both attached testcases, my debug build crashes about 10 seconds after loading.  It's a real crash -- not an abort -- in both cases.  (I did also see one instance of the abort that bz mentioned, though, when I loaded the second testcase and then reloaded a few times.)

The crashes are from trying to create a new nsRefPtr, which calls the virtual function "AddRef", on a nsARefreshObserver* whose vtable pointer is garbage:
> #5  0x00007f097cdcdb9d in nsRefPtr<nsARefreshObserver>::nsRefPtr (this=0x7fff6692d1b0, aRawPtr=0x7f09613867e0) at nsAutoPtr.h:993
> (gdb) p *mRawPtr
> $2 = {
>   _vptr.nsARefreshObserver = 0x5a5a5a5a5a5a5a5a
> }

This is scary - flagging as [sg:critical?]
Status: UNCONFIRMED → NEW
Ever confirmed: true
Hardware: x86 → All
Whiteboard: [sg:critical?]
Version: unspecified → Trunk
I don't have a 1.9.2 build handy, but I think it's reasonable to assume this is a regression, since the crash seems to depend on SMIL which we didn't have in 1.9.2
Summary: Crash in nsRefreshDriver::Notify with assertition in CSSFrameConstructor → Crash in nsRefreshDriver::Notify with assertion in CSSFrameConstructor
Daniel, your crash sounds about the same as what I saw at the end of comment 4.

I tried using gdb to dump out the stacks to all the nsCSSFrameConstructor::BeginUpdate calls, hoping that this would catch the errant call for me, but the bug doesn't show up if I do that.  Presumably because it makes things run too slow and the SMIL finishes "too early" or something?
Guessing at an assignee if SMIL is involved?
Assignee: nobody → dholbert
Sure, I can take this.
Status: NEW → ASSIGNED
So at least in the second case, what's happening is this:
 1) Our nsSMILAnimationController gets a call to 'NotifyRefreshDriverCreated', and it registers as a refresh-observer (via StartSampling())

 2) Our nsSMILAnimationController gets a new time-container child, which gets us to nsSMILAnimationController::AddChild. This takes the branch that sets
>     mDeferredStartSampling = PR_TRUE;
   (The idea there is that we don't have any registered animations, so we're trying to hold off on actively sampling until we get some. However, unbeknownst to us, we've actually already called StartSampling, in (1) above...)

 3) When the document gets torn down, our nsSMILAnimationController gets a call to "NotifyRefreshDriverDestroying", but that call does nothing, because 'mDeferredStartSampling' is set.  So we never unregister as a refresh observer.

This means we the nsRefreshDriver stays alive longer than our animation controller does, and so it later ends up Notify()ing us after our animation controller is already destroyed.
My first instinct is that we want to copy the
>  if (mAnimationElementTable.Count())
check in nsSMILAnimationController::AddChild() over to NotifyRefreshDriverCreated().  That should keep us from registering as a Refresh Observer in part (1) of comment 11. (We don't want to be a refresh observer at that point, if this proposed check fails, because we don't have any registered animations yet.)
Attached patch fix v1 (obsolete) — Splinter Review
This patch makes the "StartSampling" call in  nsSMILAnimationController::NotifyRefreshDriverCreated consistent with all our other calls to "StartSampling".  The idea is to only *actually* start sampling if we have registered animation elements -- and if we don't, we defer starting sampling until such an element is added.

Without the patch, my debug build (w/ jesse's dom-fuzz-lite) crashes within 10 seconds after loading either attached testcase.

With the patch, I can't get either testcase to crash. (I do still see the assertion from comment 0 in the first testcase, though -- I think that might be unrelated.)

I'm running this through mochitests + reftests, and then will request review.
Comment on attachment 522464 [details] [diff] [review]
fix v1

Passed SVG + SMIL reftests & mochitests.  Requesting review.
Attachment #522464 - Flags: review?(bzbarsky)
Comment on attachment 522464 [details] [diff] [review]
fix v1

Is there a reason not to push this logic down into StartSampling, since now every callsite has it?
Good call.  This patch shifts that logic into a wrapper function, actually, which performs the check & then the actual StartSampling call.

We need to keep one 'raw' StartSampling call, though -- that's why I didn't push the logic directly into StartSampling as you'd suggested.  We need to directly call StartSampling (and clear mDeferredStartSampling) in the one place where we say "ok, actually start sampling now" -- in nsSMILAnimationController::RegisterAnimationElement.  (That's the place where we stop deferring, if we were deferring.)

So now, with this patch, RegisterAnimationElement and MaybeStartSampling are the only methods that call StartSampling.  Everywhere else uses MaybeStartSampling.
Attachment #522464 - Attachment is obsolete: true
Attachment #522464 - Flags: review?(bzbarsky)
Attachment #522768 - Flags: review?(bzbarsky)
> nsSMILAnimationController::RegisterAnimationElement

So say RegisterAnimationElement looked like this:

     mDeferredStartSampling = PR_FALSE;
     if (mChildContainerTable.Count()) {
       // mAnimationElementTable was empty, but now we've added its 1st element
       MaybeStartSampling(GetRefreshDriverForDoc(mDocument));
     }

We'd land in MaybeStartSampling.  mDeferredStartSampling is false, so we don't take the early return.  All of the above code is after a mAnimationElementTable.PutEntry call, so I'd think mAnimationElementTable.Count() is nonzero, and MaybeStartSampling calls StartSampling.  I agree that it may be conceptually cleaner to have the separate MaybeStartSampling, and that we'd have an extra pair of branches in the above code, but the behavior would be the same.  Up to you whether you want to keep the MaybeStartSampling function.
Comment on attachment 522768 [details] [diff] [review]
fix v2: create wrapper MaybeStartSampling

r=me with the above caveat.
Attachment #522768 - Flags: review?(bzbarsky) → review+
(In reply to comment #17)
> I agree that it may be conceptually cleaner to have the
> separate MaybeStartSampling, and that we'd have an extra pair of branches in
> the above code, but the behavior would be the same.

Mm, true -- so we could effectively merge MaybeStartSampling into StartSampling and just call that function everywhere, and still get the same behavior as in the attached patch.

Still, I think I like the conceptual separation (& I'd prefer to avoid adding unnecessary checks to RegisterAnimationElement), so I think I'd opt for keeping MaybeStartSampling & StartSampling separate, the way they are in the attached patch.

Thanks for the review!
Blocks: 590425
Blocks: 647044
if this fixes 647044 as expected we should try and get it out before US tax season if possible.
blocking2.0: --- → ?
Running it through TryServer right now.
TryServer liked it. Pushed to m-c:
  http://hg.mozilla.org/mozilla-central/rev/b71e50bf9afc
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Attachment #522768 - Flags: approval2.0?
(In reply to comment #20)
> if this fixes 647044 as expected we should try and get it out before US tax
> season if possible.

We think TurboTax is using SMIL? Really?
It's feasible that they might be using SMIL-animated SVG for a throbber between steps or something...
Probably not, though.  I just clicked through a the first part of a "Try It Out" TurboTax session, and it seems to use a .gif throbber and a lot of Flash.  I didn't see anything that was obviously SVG (nor did I see any mention of SVG in 'view source').

So I think bug 641388 isn't this, but it may end up being a *similar* bug to this one, in our plugin code rather than our SVG code.
Plugins don't use refresh driver for anything, to my knowledge.
Darn - you're right, they don't seem to, based on
http://mxr.mozilla.org/mozilla-central/search?string=AddRefreshObserver
blocking2.0: ? → Macaw+
status2.0: --- → wanted
Comment on attachment 522768 [details] [diff] [review]
fix v2: create wrapper MaybeStartSampling

Approved for the mozilla2.0 repository, a=dveditz for release-drivers
Attachment #522768 - Flags: approval2.0? → approval2.0+
(In reply to comment #28)
> Should this fix both [signatures omitted for brevity]
> that seem to be rising in the last days?

That's the question that was discussed in Comment 20 & Comment 23-Comment 27.

To sum up: this patch only affects sites with SMIL animation (and subdocuments being created/destroyed on the fly).  So if e.g. TurboTax (or other high-profile sites that hit these crash signatures) uses SVG with SMIL animation, then this would help -- but as dveditz implied in Comment 23 (and I semi-confirmed in comment 25), it's pretty unlikely that that's the case.
That is to say: I don't have a lot of confidence that it'll fix those signatures.

Still a scary bug, though,, & still worth fixing on 2.0 branch.
Pushed to mozilla-2.0:
  http://hg.mozilla.org/releases/mozilla-2.0/rev/c94d64a0d540
Group: core-security
Crash tests:
https://hg.mozilla.org/integration/mozilla-inbound/rev/88e4547bdbcf
Flags: in-testsuite? → in-testsuite+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: