Last Comment Bug 650732 - Crash [@ nsSMILInstanceTime::HandleChangedInterval(nsSMILTimeContainer const*, int, int) ] [@ nsSMILInstanceTime::HandleChangedInterval ]
: Crash [@ nsSMILInstanceTime::HandleChangedInterval(nsSMILTimeContainer const*...
Status: VERIFIED FIXED
[sg:critical?]
: crash, crashreportid, testcase
Product: Core
Classification: Components
Component: SVG (show other bugs)
: Trunk
: All All
: -- critical (vote)
: mozilla7
Assigned To: Brian Birtles (:birtles)
:
Mentors:
http://imgh.us/1400.svgz
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-17 23:10 PDT by Daniel Holbert [:dholbert]
Modified: 2011-08-24 04:41 PDT (History)
6 users (show)
bbirtles: in‑testsuite+
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-
wontfix
+
fixed
fixed
wanted
unaffected
unaffected


Attachments
Reduced test case -- will probably cause a crash on unload (386 bytes, image/svg+xml)
2011-05-30 18:56 PDT, Brian Birtles (:birtles)
no flags Details
Crashtest -- will almost definitely cause a crash! (1.42 KB, image/svg+xml)
2011-05-30 18:57 PDT, Brian Birtles (:birtles)
no flags Details
Part 1 - Copy list of times so interval can be safely deleted v1a (17.37 KB, patch)
2011-06-02 19:54 PDT, Brian Birtles (:birtles)
no flags Details | Diff | Splinter Review
Part 2 - Don't allow instance times that are dependent on themselves v1a (16.23 KB, patch)
2011-06-02 19:59 PDT, Brian Birtles (:birtles)
no flags Details | Diff | Splinter Review
Part 1 - Copy list of times so interval can be safely deleted v1b (17.49 KB, patch)
2011-06-07 17:55 PDT, Brian Birtles (:birtles)
dholbert: review+
Details | Diff | Splinter Review
Part 2 - Don't allow instance times that are dependent on themselves v1b (16.34 KB, patch)
2011-06-07 17:56 PDT, Brian Birtles (:birtles)
no flags Details | Diff | Splinter Review
Part 3 - Remove all dependent times in one hit v1a (3.09 KB, patch)
2011-06-07 17:57 PDT, Brian Birtles (:birtles)
dholbert: review+
Details | Diff | Splinter Review
Crash test - to be checked in later (2.15 KB, patch)
2011-06-07 17:58 PDT, Brian Birtles (:birtles)
dholbert: review+
Details | Diff | Splinter Review
Part 1 - Copy list of times so interval can be safely deleted v1c, r=dholbert (18.01 KB, patch)
2011-06-09 17:35 PDT, Brian Birtles (:birtles)
christian: approval‑mozilla‑aurora+
bbirtles: checkin+
Details | Diff | Splinter Review
Crash test - to be checked in later, v1b, r=dholbert (2.06 KB, patch)
2011-06-09 17:44 PDT, Brian Birtles (:birtles)
no flags Details | Diff | Splinter Review
Part 3 - Remove all dependent times in one hit v1b, r=dholbert (3.10 KB, patch)
2011-06-09 17:48 PDT, Brian Birtles (:birtles)
no flags Details | Diff | Splinter Review

Description Daniel Holbert [:dholbert] 2011-04-17 23:10:20 PDT
Mardeg reported this crash in IRC: bp-78a2bf1e-bcc7-4b02-a751-555a32110417

He doesn't have STR, but he says he'd just closed a tab with animated SVG before crashing.

Looking at the stack trace, it's cycle-collection triggering nsSMILTimedElement::Unlink, which eventually triggers nsSMILInstanceTime::HandleChangedInterval, which hits this line:
>141   PRBool objectChanged = mCreator->DependsOnBegin() ? aBeginObjectChanged :
>142                                                       aEndObjectChanged;
...and crashes on a null-deref, since mCreator is apparently null.
Comment 1 Daniel Holbert [:dholbert] 2011-04-17 23:11:37 PDT
Mardeg says http://imgh.us/1400.svgz is the testcase, and it crashes on closing tab some fraction of the time.
Comment 2 Daniel Holbert [:dholbert] 2011-04-17 23:15:57 PDT
So presumably, in the crash conditions, we're calling one of the following three methods before HandleChangedInterval -- these three methods all set mCreator to nsnull, thereby setting us up to crash in HandleChangedInterval on that "mCreator->DependsOnBegin()" call:
  nsSMILInstanceTime::Unlink()
  nsSMILInstanceTime::HandleDeletedInterval()
  nsSMILInstanceTime::HandleFilteredInterval()
Comment 3 Daniel Holbert [:dholbert] 2011-04-17 23:21:45 PDT
I just reproduced the crash on Linux at the URL in comment 1 after a few tries (started up firefox, with the URL loaded from sessionstore, and then quit, and it crashed)
 bp-2125cabd-29b2-42c2-8050-716d92110417
 Mozilla/5.0 (X11; Linux i686; rv:6.0a1) Gecko/20110417 Firefox/6.0a1
Comment 4 Daniel Holbert [:dholbert] 2011-04-17 23:34:06 PDT
I can reproduce this pretty reliably (3 out of 4 tries just now) with these STR:
 1. Open 2 tabs: http://imgh.us/1400.svgz and http://www.google.com/
 2. Focusing the first tab, hit "Ctrl+W" followed by "Ctrl+Shift+T" (close tab, bring it back)
 3. Perform previous step a total of ~4 times. (quickly is fine - no need to wait between closing & reopening & closing etc)
 4. Wait a few seconds.

---> CRASH
  bp-4de87ba2-4512-4d4d-ab8e-069cf2110417
  bp-abe4f27d-695c-42ee-b3b0-0df022110417
  bp-5bb0573e-dc1a-4f36-80dc-45d122110417
Comment 5 Daniel Holbert [:dholbert] 2011-04-17 23:54:54 PDT
From playing around with this in a debug build, I just got this crash stack:

#0  0xb53a0e1e in nsTArray_base<nsTArrayDefaultAllocator>::Length (this=0x9ed93b18) at ../../../dist/include/nsTArray.h:139
#1  0xb61f9f61 in nsTArray<nsRefPtr<nsSMILInstanceTime>, nsTArrayDefaultAllocator>::ElementAt (this=0x9ed93b18, i=0) at ../../dist/include/nsTArray.h:455
#2  0xb61f9e32 in nsTArray<nsRefPtr<nsSMILInstanceTime>, nsTArrayDefaultAllocator>::operator[] (this=0x9ed93b18, i=0) at ../../dist/include/nsTArray.h:488
#3  0xb61f95e7 in nsSMILInterval::NotifyChanged (this=0x9ed93b10, aContainer=0x0) at ../../../mozilla/content/smil/nsSMILInterval.cpp:83
#4  0xb620488d in nsSMILTimedElement::NotifyChangedInterval (this=0xaaf23bf0) at ../../../mozilla/content/smil/nsSMILTimedElement.cpp:2037
#5  0xb6203f2a in nsSMILTimedElement::UpdateCurrentInterval (this=0xaaf23bf0, aForceChangeNotice=0) at ../../../mozilla/content/smil/nsSMILTimedElement.cpp:1833
#6  0xb62002d7 in nsSMILTimedElement::RemoveInstanceTimesForCreator (this=0xaaf23bf0, aCreator=0x9ee0c100, aIsBegin=1) at ../../../mozilla/content/smil/nsSMILTimedElement.cpp:405
#7  0xb6207710 in nsSMILTimeValueSpec::UnregisterFromReferencedElement (this=0x9ee0c100, aElement=0x9ef9c5c0) at ../../../mozilla/content/smil/nsSMILTimeValueSpec.cpp:305
#8  0xb6207615 in nsSMILTimeValueSpec::Unlink (this=0x9ee0c100) at ../../../mozilla/content/smil/nsSMILTimeValueSpec.cpp:256
#9  0xb6201f59 in nsSMILTimedElement::Unlink (this=0xaaf23bf0) at ../../../mozilla/content/smil/nsSMILTimedElement.cpp:1145
#10 0xb61ddff9 in nsSMILTimedElement::DissolveReferences (this=0xaaf23bf0) at ../../../../../mozilla/content/svg/content/src/../../../smil/nsSMILTimedElement.h:343
#11 0xb61dd8b2 in nsSVGAnimationElement::UnbindFromTree (this=0xaaf23b80, aDeep=1, aNullParent=0) at ../../../../../mozilla/content/svg/content/src/nsSVGAnimationElement.cpp:321
#12 0xb59da960 in nsGenericElement::UnbindFromTree (this=0xa8161ca0, aDeep=1, aNullParent=1) at ../../../../mozilla/content/base/src/nsGenericElement.cpp:3073
#13 0xb5a1d7f0 in nsStyledElementNotElementCSSInlineStyle::UnbindFromTree (this=0xa8161ca0, aDeep=1, aNullParent=1) at ../../../../mozilla/content/base/src/nsStyledElement.cpp:251
#14 0xb59de268 in nsGenericElement::cycleCollection::Unlink (this=0xb77803c0, p=0x9ed8e120) at ../../../../mozilla/content/base/src/nsGenericElement.cpp:4299

When we crash, the nsSMILInterval at stacklevel 3 has all of its member data pointers set to 0x5a5a5a5a. (including its nsTArray mDependentTimes's mHdr pointer -- which is why we crash.)
Comment 6 Daniel Holbert [:dholbert] 2011-05-24 19:58:38 PDT
This still reproduces in latest nightly, btw.
bp-ff40b963-8670-45b0-840f-5e4ae2110524
Mozilla/5.0 (X11; Linux i686; rv:6.0a1) Gecko/20110524 Firefox/6.0a1
Comment 7 Daniel Holbert [:dholbert] 2011-05-24 20:04:50 PDT
flagging as security-sensitive per comment 5.  (and un-hiding comment 5)
Comment 8 Brian Birtles (:birtles) 2011-05-24 20:15:34 PDT
I will look into this in a couple of days' time (27th). If anyone wants to look into it before then feel free to take it from me.
Comment 9 Brian Birtles (:birtles) 2011-05-30 18:56:21 PDT
Created attachment 536209 [details]
Reduced test case -- will probably cause a crash on unload

Reduced test case. Will probably cause a crash!
Comment 10 Brian Birtles (:birtles) 2011-05-30 18:57:46 PDT
Created attachment 536210 [details]
Crashtest -- will almost definitely cause a crash!

Similar to the reduced test case, this simply adds and removes one of the animations over and over hopefully triggering a crash. (Crashes 100% of the time for me.)
Comment 11 Brian Birtles (:birtles) 2011-05-30 19:14:45 PDT
After some analysis I've arrived at the following minimal test case:

    <set id="a" attributeName="fill" to="blue"
      begin="6s" end="986s"/>
    <set id="b" attributeName="fill" to="orange"
      begin="a.begin+69.3s;b.begin+700s" dur="700s" end="a.end"/>
    <set id="c" attributeName="fill" to="yellow"
      begin="0s;b.begin+700s"/>

In the above arrangement, removing element "b" before the others will trigger a crash about 50% of the time.

The reasoning is as follows:

1. In unlinking "b" we will dissolve the syncbase time references "a.begin+69.3s" and "b.begin+700s".
2. When we drop "a.begin+69.3s" we update the current interval (now from b.begin+700s to 986s).
3. After updating the interval we inform those times that are dependent on b's interval of which there are two. One belonging to "b" and one belonging to "c".
4. If we notify the time belonging to "b" first it will be updated again to b.begin+700s+700s which puts it beyond it's only end time ("a.end" = 986s) which by SMIL's rules means there is no valid interval.
5. Since "b" is still in the waiting state the interval will be deleted.
6. When we go back to the interval to finish notifying the other dependent times (c's dependent time in this case), the interval has already been deleted.

Note:
- If we make the times in the test case shorter it is harder to reproduce since "b" may no longer be in the "waiting" state and the interval can't be deleted.
- If we remove "c" we will fail to notice that the interval has been deleted when we return to it since we'll just return from the NotifyChange method without doing much more.
- I think the test crashes about 50% of the time because it depends what order we update the interval's two dependent times in. I think so anyway but haven't confirmed this yet.

Possible steps for fixing:
1. We shouldn't be deleting an interval that is in the call stack. One approach is to make intervals ref-counted and add a kungFuDeathGrip to it before calling. An alternative approach which I might try is checking that nsSMILTimedElement::UpdateCurrentInterval isn't called recursively on the same nsSMILTimedElement. If we detect such a case, set a flag, and re-run the method again after it completes the first time. Or something of that nature anyway.

2. This test produces a bizarre situation where an instance time is directly dependent on itself. I don't think that makes sense and we should probably detect it and break the cycle. It's fine for self-referential timing (as in <animate id="a" begin="a.begin+2s" .../>), but an actual instance time being dependent on itself doesn't seem useful.

3. As an optimisation, when we unlink an element we might be able to do a reset step first that just clears out all instance times with a creator set before we clear the time value spec objects. That would hopefully save a lot of unnecessary interval updating.
Comment 12 Johnny Stenback (:jst, jst@mozilla.com) 2011-06-02 13:52:34 PDT
Brian, am I correct in assuming that this is happening on trunk, and not some older branch(es)?
Comment 13 Daniel Holbert [:dholbert] 2011-06-02 13:56:27 PDT
This affects Firefox 4 and newer (including trunk).
Comment 14 Brian Birtles (:birtles) 2011-06-02 19:54:33 PDT
Created attachment 537065 [details] [diff] [review]
Part 1 - Copy list of times so interval can be safely deleted v1a

Part 1, implements step 1 from comment 11. The array of instance times is now copied so that even if the interval gets deleted whilst we're iterating over the times we'll be safe.

I've also taken it a little bit further and identified some other situations where we're not in an entirely safe state for doing notifications and fixed them too.

I'll run this (and the next patch) through Try before putting up for review.
Comment 15 Brian Birtles (:birtles) 2011-06-02 19:59:11 PDT
Created attachment 537066 [details] [diff] [review]
Part 2 - Don't allow instance times that are dependent on themselves v1a

Implements step 2 from comment 11.

Detects when we're about to use an instance time as the same endpoint of an interval that it's already dependent on and avoids using such an instance time. This sort of detail simply isn't covered by SMIL/SVG (which basically ignore dynamic document changes altogether) but I've produced a couple of test cases which I think demonstrate why this new behaviour is correct.

This patch alone also fixes the crash (as does part 1 alone).
Comment 16 Brian Birtles (:birtles) 2011-06-07 17:55:29 PDT
Created attachment 537921 [details] [diff] [review]
Part 1 - Copy list of times so interval can be safely deleted v1b

Just updating header information so this can be checked in.
Comment 17 Brian Birtles (:birtles) 2011-06-07 17:56:12 PDT
Created attachment 537922 [details] [diff] [review]
Part 2 - Don't allow instance times that are dependent on themselves v1b

Just updating header information
Comment 18 Brian Birtles (:birtles) 2011-06-07 17:57:31 PDT
Created attachment 537924 [details] [diff] [review]
Part 3 - Remove all dependent times in one hit v1a

This is really an optimisation and probably deserves a separate bug. Let me know what you think.
Comment 19 Brian Birtles (:birtles) 2011-06-07 17:58:58 PDT
Created attachment 537925 [details] [diff] [review]
Crash test - to be checked in later

Crash test. To be checked in later.
Comment 20 Brian Birtles (:birtles) 2011-06-07 18:03:00 PDT
I've put up a series of patches for review.

The three different patches all fix the crash if applied in isolation. That is, it is fixed three different ways. However, part 1 really addresses the root problem. I can file parts 2 and 3 as separate bugs. Part 3 is really just an optimisation with the side-effect of fixing this bug.

Parts 1 & 2 (applied together) have passed through Try server without issue.
Comment 21 Daniel Holbert [:dholbert] 2011-06-09 14:18:45 PDT
Comment on attachment 537921 [details] [diff] [review]
Part 1 - Copy list of times so interval can be safely deleted v1b

>diff --git a/content/smil/nsSMILInterval.h b/content/smil/nsSMILInterval.h
>+  void GetDependentTimes(nsTArray<nsRefPtr<nsSMILInstanceTime> >& aTimes);
[...]
>+nsSMILInterval::GetDependentTimes(InstanceTimeList& aTimes)

For consistency / cleanliness, can we use "InstanceTimeList" in the declaration of GetDependentTimes, too? (making the typedef public instead of private, if that's necessary)

>@@ -546,41 +546,51 @@ nsSMILTimedElement::DoSampleAt(nsSMILTi
>+        PRBool earlyEnd = ApplyEarlyEnd(sampleTime);
[...]
> nsSMILTimedElement::ApplyEarlyEnd(const nsSMILTimeValue& aSampleTime)
[...]
>     nsSMILInstanceTime* earlyEnd = CheckForEarlyEnd(aSampleTime);

nit: so in ApplyEarlyEnd(), |earlyEnd| is the name of an instance time (makes sense), whereas in DoSampleAt(), |earlyEnd| is the name of a PRBool.

Perhaps name the PRBool "didApplyEarlyEnd" or something else boolean-sounding, for clarity?

>+nsSMILTimedElement::NotifyChangedInterval(nsSMILInterval* aInterval,
>+  for (PRUint32 i=0; i < times.Length(); ++i)
>+  {

Nit: bump the "{" up to the previous line, & add spaces around '=' in "i=0".


>diff --git a/content/smil/nsSMILTimedElement.h b/content/smil/nsSMILTimedElement.h

>+  void GetDependentTimes(nsTArray<nsRefPtr<nsSMILInstanceTime> >& aTimes);

>+  // Notification methods. Note that dispatching these notifications can trigger
>+  // a chain reaction that results in this object being called again.

"this object being called again" doesn't make sense to me -- do you mean "recursive calls to these same functions on this same object"?

>+  // [Therefore] they should not be called until this object is in a consistent state to
>+  // receive callbacks.

So, to be sure I understand -- we *only* get problematic recursiveness if we call this when the object is in an inconsistent state?  If that's right, it might be good to state that a little more explicitly, either by clarifying the first sentence, or by adding a sentence at the end ("this will prevent the problematic chain reactions" or something like that)

r=dholbert with the above.
Comment 22 Daniel Holbert [:dholbert] 2011-06-09 14:20:17 PDT
(sorry, I accidentally double-pasted the GetDependentTimes() at two different spots in my review feedback. ignore the second one, towards the bottom. :))
Comment 23 Daniel Holbert [:dholbert] 2011-06-09 15:01:06 PDT
Comment on attachment 537925 [details] [diff] [review]
Crash test - to be checked in later

>diff --git a/content/smil/crashtests/650732-1.svg b/content/smil/crashtests/650732-1.svg
>+function add()
>+{
[...]
>+  attempts++;
>+  if (++attempts >= max_attempts) {

Looks like you're accidentally double-incrementing here. :)

Also, the control-flow isn't immediately grockable.  How about we improve it as follows, for readability:
Have load trigger a method "attemptCrash()" (instead of triggering remove), where attemptCrash looks like so:
  function attemptCrash() {
    remove();
    add();
    if (++attempts >= max_attempts) {
      document.documentElement.removeAttribute("class");
    } else {
      setTimeout(remove, 0);
    }
  }
That way, "remove()" and "add()" can just do what their names say, and no more. (In particular -- they'd no longer call each other, and "add" no longer does the special increment-and-check-if-we're-done).

r=dholbert with the above.
Comment 24 Daniel Holbert [:dholbert] 2011-06-09 15:04:31 PDT
(In reply to comment #23)
>       setTimeout(remove, 0);
sorry; I meant for the setTimeout in attemptCrash to call itself, not 'remove'. It should have been:  setTimeout(attemptCrash, 0);
Comment 25 Daniel Holbert [:dholbert] 2011-06-09 15:10:23 PDT
Comment on attachment 537924 [details] [diff] [review]
Part 3 - Remove all dependent times in one hit v1a

> void
> nsSMILTimedElement::Unlink()
> {
>+  // As an speed up, go through and remove all instance times that have

s/an/a/
Comment 26 Daniel Holbert [:dholbert] 2011-06-09 15:45:38 PDT
(In reply to comment #20)
> I can file parts 2 and 3 as separate bugs. Part 3 is really
> just an optimisation with the side-effect of fixing this bug.

I haven't looked at patch 2 in detail yet, but I think I _would_ prefer to have that one be a separate (non-security-sensitive?) bug, given that it's a nontrivial change that's orthogonal to the other patches here & not necessary for fixing this crash.  That way, in the unlikely event that patch 1 or 2 has to be backed out due to regressions, we won't end up with a franken-bug with some patches landed & some patches backed out.

I'm less concerned with patch 3 getting its own bug, since it's minimal and (IIUC?) has no intended effect on behavior (aside from a perf win & fixing this bug as a side effect).  You can split it to a separate bug if you like, but I don't particularly care. :)

Side question: which patch(es) here should we try to get landed on the aurora & beta branches? Perhaps just patch 3, since it's minimal & sufficient to fix the crash?  (In the absence of another crasher testcase that specifically requires one of the other patches, I'm currently leaning against backporting them...)
Comment 27 Brian Birtles (:birtles) 2011-06-09 17:35:01 PDT
Created attachment 538394 [details] [diff] [review]
Part 1 - Copy list of times so interval can be safely deleted v1c, r=dholbert

Thanks Daniel for getting back to me on all this! I've fixed the items you picked up but just want to check about the last comment.

(In reply to comment #21)
> >+  // Notification methods. Note that dispatching these notifications can trigger
> >+  // a chain reaction that results in this object being called again.
> 
> "this object being called again" doesn't make sense to me -- do you mean
> "recursive calls to these same functions on this same object"?

It's not necessarily recursive in terms of calling the same functions again, but it could be.

e.g. we can have

TimedElementA::NotifyChangedInterval
  -> dependent times::HandleChangedInterval
    -> all sorts of updates to time value specs etc.
      -> TimedElementA::UpdateInstanceTime
        -> TimedElementA::UpdateCurrentInterval

From that point we *may* call NotifyChangedInterval recursively but even if we don't, we need to be sure we're in a consistent state.

For example, in principal, mCurrentInterval should only be set whilst the state is WAITING or ACTIVE. So we need to be careful not to do:

  mElementState = POSTACTIVE;
  NotifyChangedInterval();
  mCurrentInterval = nsnull;

If we do that we'll violate assertions in UpdateCurrentInterval, not to mention overwriting the most recent state.

So it's not really recursion that's the problem but just callbacks to the same object.

> So, to be sure I understand -- we *only* get problematic recursiveness if we
> call this when the object is in an inconsistent state?

If "problematic recursiveness" means crashes like this bug then not really. This bug is fixed by copying the list of instance times out of the interval before doing change notification in case the interval disappears. It's not fixed by ensuring we're in a consistent state for doing notifications.

However in fixing the change notification problem I noticed that there were one or two instances (particularly the early-end case) where we were assuming too much about the state of the element after doing notifications.

I've had a go at updating the comment but let me know if you can think of further improvements to it.
Comment 28 Brian Birtles (:birtles) 2011-06-09 17:44:08 PDT
Created attachment 538396 [details] [diff] [review]
Crash test - to be checked in later, v1b, r=dholbert

Address review feedback.
Comment 29 Brian Birtles (:birtles) 2011-06-09 17:48:04 PDT
Created attachment 538397 [details] [diff] [review]
Part 3 - Remove all dependent times in one hit v1b, r=dholbert

Address review feedback.
Comment 30 Brian Birtles (:birtles) 2011-06-09 18:09:55 PDT
Comment on attachment 537922 [details] [diff] [review]
Part 2 - Don't allow instance times that are dependent on themselves v1b

Moved to bug 663288
Comment 31 Brian Birtles (:birtles) 2011-06-09 18:18:09 PDT
(In reply to comment #26)
> I haven't looked at patch 2 in detail yet, but I think I _would_ prefer to
> have that one be a separate (non-security-sensitive?) bug, given that it's a
> nontrivial change that's orthogonal to the other patches here & not
> necessary for fixing this crash.  That way, in the unlikely event that patch
> 1 or 2 has to be backed out due to regressions, we won't end up with a
> franken-bug with some patches landed & some patches backed out.

Good call. I've moved part 2 to bug 663288.

> I'm less concerned with patch 3 getting its own bug, since it's minimal and
> (IIUC?) has no intended effect on behavior (aside from a perf win & fixing
> this bug as a side effect).

That's correct.

> Side question: which patch(es) here should we try to get landed on the
> aurora & beta branches? Perhaps just patch 3, since it's minimal &
> sufficient to fix the crash?  (In the absence of another crasher testcase
> that specifically requires one of the other patches, I'm currently leaning
> against backporting them...)

Patch 1 is the real fix. If we just push patch 3 then I suspect there *may* be other test cases that could produce a crash along similar lines--but I've yet to find them. If it's possible to get patch 1 landed then I'd prefer that, but I'm ok with just patch 3 if you judge patch 1 as too risky.

(I haven't landed anything for a while so I don't have a feel for what's an acceptable level of risk on the new branches.)
Comment 32 Daniel Holbert [:dholbert] 2011-06-10 00:51:32 PDT
(In reply to comment #27)
> I've had a go at updating the comment but let me know if you can think of
> further improvements to it.

Looks great now - thanks for the clarification!
Comment 33 Daniel Holbert [:dholbert] 2011-06-10 01:01:05 PDT
(In reply to comment #31)
> > Side question: which patch(es) here should we try to get landed on the
> > aurora & beta branches?
> 
> Patch 1 is the real fix. If we just push patch 3 then I suspect there *may*
> be other test cases that could produce a crash along similar lines

Ok, fair enough.

(In reply to comment #31)
> (I haven't landed anything for a while so I don't have a feel for what's an
> acceptable level of risk on the new branches.)

Due to the sg nature of this bug, I suspect patch 1 would be likely to get approval-aurora+.  approval-beta may be harder to argue for, though, as this is a nontrivial change, & beta is kind of like an rc. (and it'll update to the code in aurora after 6 weeks anyway.)

Once this is landed on trunk, I'd suggest requesting approval-aurora ASAP.  The approval-queue-triagers suggest that you include a comment with a brief risk/reward assessment when you request approval, too.
Comment 34 Daniel Holbert [:dholbert] 2011-06-10 01:04:12 PDT
(In reply to comment #33)
> approval-beta may be harder to argue for, though

(ah -- indeed, I just noticed that jst had already set this as "status-firefox5:wontfix".)
Comment 35 Mardeg 2011-06-10 01:13:57 PDT
(In reply to comment #34)
> (ah -- indeed, I just noticed that jst had already set this as
> "status-firefox5:wontfix".)

and "status2.0:wanted" so there's a possibility it could land on the Firefox 4 branch?
Comment 36 Daniel Holbert [:dholbert] 2011-06-10 01:18:36 PDT
(yeah, I think that status2.0 value may have been accidental/bogus.  Last I heard, the next update for Firefox 4 users will be Firefox 5.)
Comment 37 Brian Birtles (:birtles) 2011-06-13 20:16:33 PDT
It turns out patch 3 is bad. It causes SMIL crashtest 596796-1.svg to fail. Currently debugging. Patch 1 should still be fine.
Comment 38 Brian Birtles (:birtles) 2011-06-14 17:36:06 PDT
Comment on attachment 538397 [details] [diff] [review]
Part 3 - Remove all dependent times in one hit v1b, r=dholbert

Moving this patch to bug 664343. It needs to be re-thought. For now, patch 1 is sufficient to fix this bug.
Comment 39 Brian Birtles (:birtles) 2011-06-14 17:39:27 PDT
Pushed to m-i:
http://hg.mozilla.org/integration/mozilla-inbound/rev/2fbc5d9b566b
Comment 40 Brian Birtles (:birtles) 2011-06-16 03:44:21 PDT
This has now landed on m-c:
http://hg.mozilla.org/mozilla-central/rev/2fbc5d9b566b

Will request approval to land on aurora.
Comment 41 Brian Birtles (:birtles) 2011-06-16 03:49:10 PDT
Comment on attachment 538394 [details] [diff] [review]
Part 1 - Copy list of times so interval can be safely deleted v1c, r=dholbert

Requesting approval to land on aurora. Potentially exploitable, causes crash with attached test case (attachment 538396 [details] [diff] [review]), sg:crit?, landed on m-c and seems to be behaving.
Comment 42 Johnny Stenback (:jst, jst@mozilla.com) 2011-06-16 14:48:46 PDT
This needs to bake some days before we'll consider this for aurora. Leaving the patch nominated. An risk/benefit analysis would be helpful here though, as the patch is a bit on the large side and assessing risk purely based on reading the patch isn't trivial.
Comment 43 Daniel Veditz [:dveditz] 2011-06-16 14:49:10 PDT
Comment 40 means this is FIXED on trunk/mozilla7
Comment 44 Brian Birtles (:birtles) 2011-06-16 17:10:58 PDT
(In reply to comment #42)
> This needs to bake some days before we'll consider this for aurora. Leaving
> the patch nominated. An risk/benefit analysis would be helpful here though,
> as the patch is a bit on the large side and assessing risk purely based on
> reading the patch isn't trivial.

I'm not sure exactly what more to add. The patch largely just moves code from one point to another. However, it's quite a complicated part of the code (it manages timing dependencies which can be arbitrarily complex) so there is the potential that this patch alters the behaviour such that some instances of cyclic dependencies are not correctly detected and broken resulting in infinite recursion.

In summary:

Benefit:
* Fixes known crash and potential security hole
* Fixes further potentially problematic behaviour (notably the changes to early ending)
Risk:
* Alters the behaviour such that some time dependency arrangements result in infinite recursion

I think that if this bakes for a week or so then the risk is low. In general this patch produces a more stable system and passes our existing crashtests/reftests/mochitests which include a number of test cases for cyclic dependencies.

In bug 664343 I hope to add some further recursion checking to catch the kind of problem this patch could trigger.
Comment 45 christian 2011-06-22 11:54:00 PDT
Comment on attachment 538394 [details] [diff] [review]
Part 1 - Copy list of times so interval can be safely deleted v1c, r=dholbert

A sg:crit that only touches SMIL is great to fix. Approved for mozilla-aurora.
Comment 46 Johnny Stenback (:jst, jst@mozilla.com) 2011-06-23 13:11:46 PDT
Brian, dholbert, can you guys land this on aurora?
Comment 47 Daniel Holbert [:dholbert] 2011-06-23 14:24:16 PDT
Landed on aurora: http://hg.mozilla.org/releases/mozilla-aurora/rev/d4509c8a19a9
Comment 48 Brian Birtles (:birtles) 2011-08-23 16:36:48 PDT
Crashtest on m-i (now that this fix has shipped):
http://hg.mozilla.org/integration/mozilla-inbound/rev/07ae78c82432
Comment 49 Daniel Holbert [:dholbert] 2011-08-24 01:52:12 PDT
Un-hiding (now that this fix has shipped)
Comment 50 Marco Bonardo [::mak] (Away 6-20 Aug) 2011-08-24 01:53:46 PDT
http://hg.mozilla.org/mozilla-central/rev/07ae78c82432
Comment 51 Vlad [QA] 2011-08-24 02:36:15 PDT
I have run both testcases in the attachments and I've got no crashes on 
Mozilla/5.0 (Windows NT 6.1; rv:7.0) Gecko/20100101 Firefox/7.0 beta 1

Setting resolution to Verified Fixed.

Note You need to log in before you can comment on or make changes to this bug.