606932 - Hang in nsSMILAnimationController::DoMilestoneSamples

Reporter

Description

•

14 years ago

I've been seeing hangs like this for about a week or so, but I haven't found out yet how to reproduce them. This time it happened when I opened Facebook.

The symptoms are a multi-second hang where nothing at all responds, followed by a pseudo-hang for at about a minute where the window responds to some user actions but not to others: for example, scrolling works when I drag the scrollbar, but not with the mousewheel.
At some point, everything recovers and Firefox becomes usable again.

I've finally managed to get this hang in a self-built build with frame pointers and have captured meaningful profiles with Shark.

What seems to happen is that in the first phase of the hang, a huge number of Gecko events is dispatched (from nsSMILTimedElement::FireTimeEventAsync). These events are then processed during the second phase of the hang. Until we're finished with that, we only respond to new *native* events (like mouse dragging), but not to new Gecko events (like timer events that are used for scrolling in the mousewheel case).

Dispatching the events happens with the following call stack:
nsSMILTimedElement::FireTimeEventAsync
nsSMILTimedElement::DoSampleAt
nsSMILAnimationController::DoMilestoneSamples
nsSMILAnimationController::DoSample
nsSMILTimeContainer::Sample
nsSMILAnimationController::WillRefresh
nsRefreshDriver::Notify

The first phase profile spends all its time in nsSMILAnimationController::DoMilestoneSamples, presumably in the while(true) loop.

I don't know why I've ended up in SMIL-land in the first place. Do we use SMIL anywhere in the Firefox UI? Does Facebook use it? If I recall correctly, the only other tabs I had open were mxr.mozilla.org pages.

Markus Stange [:mstange]

Reporter

Comment 1

•

14 years ago

Attached file Shark profile of the first phase — Details

Markus Stange [:mstange]

Reporter

Comment 2

•

14 years ago

Attached file Shark profile of the second phase — Details

There's not much interesting stuff in here. It's 8% AsyncTimeEventRunner::Run() and the rest is in native event loop goop.

The fact that we spend so much time pleasing the native event loop instead of processing our queued Gecko events needs its own bug.

Markus Stange [:mstange]

Reporter

Comment 3

•

14 years ago

The interesting part of the first profile is this:

> + 100.0%, nsRefreshDriver::Notify(nsITimer*), XUL
> | + 100.0%, nsSMILAnimationController::WillRefresh(mozilla::TimeStamp), XUL
> | | + 100.0%, nsSMILTimeContainer::Sample(), XUL
> | | | + 100.0%, nsSMILAnimationController::DoSample(int), XUL
> | | | | + 99.7%, nsSMILAnimationController::DoMilestoneSamples(), XUL
> | | | | | + 71.6%, nsSMILTimedElement::DoSampleAt(long long, int), XUL
> | | | | | | + 25.6%, nsSMILTimedElement::FireTimeEventAsync(unsigned int, int), XUL
> | | | | | | | - 18.8%, NS_DispatchToMainThread_P(nsIRunnable*, unsigned int), XUL
> | | | | | | | - 3.2%, moz_xmalloc, libmozalloc.dylib
> | | | | | | | - 0.5%, nsRunnable::Release(), XUL
> | | | | | | |   ...
> | | | | | | - 18.5%, nsSMILTimedElement::RegisterMilestone(), XUL
> | | | | | | - 6.9%, nsSVGAnimationElement::GetTimeContainer(), XUL
> | | | | | | - 3.2%, PL_DHashTableEnumerate, XUL
> | | | | | | - 2.5%, nsSMILTimedElement::ApplyEarlyEnd(nsSMILTimeValue const&), XUL
> | | | | | | - 2.2%, void nsSMILTimedElement::RemoveInstanceTimes<(anonymous namespace)::RemoveFiltered>(nsTArray<nsRefPtr<nsSMILInstanceTime> >&, (anonymous namespace)::RemoveFiltered&), XUL
> | | | | | | - 2.1%, nsSMILTimedElement::UpdateCurrentInterval(int), XUL
> | | | | | | - 1.9%, void nsSMILTimedElement::RemoveInstanceTimes<(anonymous namespace)::RemoveReset>(nsTArray<nsRefPtr<nsSMILInstanceTime> >&, (anonymous namespace)::RemoveReset&), XUL
> | | | | | | - 1.2%, nsSMILTimedElement::FilterIntervals(), XUL
> | | | | | | - 0.7%, nsSMILTimedElement::GetNextInterval(nsSMILInterval const*, nsSMILInstanceTime const*, nsSMILInterval&) const, XUL
> | | | | | | - 0.6%, nsTArray_base::EnsureCapacity(unsigned int, unsigned int), XUL
> | | | | | | - 0.5%, nsSMILInterval::NotifyChanged(nsSMILTimeContainer const*), XUL
> | | | | | |   0.2%, nsSMILInterval::Begin(), XUL
> | | | | | | - 0.2%, nsSMILAnimationFunction::Activate(long long), XUL
> | | | | | |   0.1%, nsSMILTimeValue::CompareTo(nsSMILTimeValue const&) const, XUL
> | | | | | | - ...
> | | | | | - 19.8%, PL_DHashTableEnumerate, XUL
> | | | | | - 3.8%, nsSVGAnimationElement::GetTimeContainer(), XUL
> | | | | | - 1.4%, nsTArray_base::ShrinkCapacity(unsigned int), XUL
> | | | | |   0.3%, nsSMILTimeContainer::ParentToContainerTime(long long) const, XUL
> | | | | |   0.2%, nsGenericElement::Release(), XUL
> | | | | |   0.1%, nsSVGElement::GetOwnerSVGElement(nsIDOMSVGSVGElement**), XUL
> | | | | |   ...
> | | | |   0.1%, nsSVGAnimationElement::Release(), XUL
> | | | |   0.1%, nsSMILTimeContainer::GetCurrentTime() const, XUL
> | | | |   0.0%, nsTArray_base::nsTArray_base(), XUL
> | | | |   0.0%, nsSMILTimedElement::DoSampleAt(long long, int), XUL
> | | | |   ...

Markus Stange [:mstange]

Reporter

Comment 4

•

14 years ago

(In reply to comment #2)
> The fact that we spend so much time pleasing the native event loop instead of
> processing our queued Gecko events needs its own bug.

I've filed bug 606937 on this.

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Updated

•

14 years ago

Assignee: nobody → birtles

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Comment 5

•

14 years ago

Nominating as a blocker for now since it apparently causes a hang when opening facebook.

Status: NEW → ASSIGNED

blocking2.0: --- → ?

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Comment 6

•

14 years ago

I'm starting to suspect this has the same root cause as bug 606942. Specifically, I wonder if, as Daniel suggested in bug 606942 comment 3 the fish.svg image is not getting paused when the referring document is in the bfcache.

Then I wonder if on top of that (or perhaps because of that) the time stamps get out of whack and we end up skipping a long way through time and having to spit out all the events that take place in the meantime.

In the fish.svg there are a couple of infinitely repeating animations of 1s duration. If we do a seek, that's fine since events are suppressed during a seek. However, if for some reason apart from doing a seek we get one sample at say, t=5s and the next at t=500s, those animations will register a milestone at every second because they need to fire a repeatEvent every second and processing the sample at t=500s will take a really long time and probably produce the kind of symptoms described here.

If we're paused, however, there shouldn't be a problem since the time will stop (i.e. the next sample will be at t=5s or thereabouts).

So I'm guessing that you've loaded up planet.mozilla.org at some point in the browsing history which includes fish.svg as an <img> and then either
a) We're not pausing when going into bfcache and then perhaps getting throttled really low so there's a long time between samples?
b) Something's wrong with our pause logic and we're processing a sample whilst paused but calculating a wrong timestamp?
c) Something entirely different such as cloning animation elements?

I'm inclined to suspect (a) since we already know something's a little amiss there as per bug 606942. If that's the case then this should also be fixed by that bug.

So for now I'll wait to see what Daniel turns up in bug 606942 but when I get a chance I'll try to add some debugging output and see if I can reproduce this problem (so far I haven't been able to).

Markus Stange [:mstange]

Reporter

Comment 7

•

14 years ago

(In reply to comment #6)
> So I'm guessing that you've loaded up planet.mozilla.org at some point in the
> browsing history

Definitely. Since filing this bug, I've hit the hang another three or four times, and the one time I've even been able to catch it in the debugger it was fish.svg that was responsible.

Daniel Holbert [:dholbert]

Comment 8

•

14 years ago

I can't reproduce the hang described here, but I suspect that the patch on Bug 606942 should fix this.

My blog post has scrolled off of planet.mozilla.org by now, but the post is still up on http://blog.dholbert.org/ of course :) and that should work just as well for reproducing this. (at least, it reproduces bug 606942 just fine for me.)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

14 years ago

blocking2.0: ? → final+

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

14 years ago

Depends on: 606942

Whiteboard: [depends on 606942]

Daniel Holbert [:dholbert]

Comment 9

•

14 years ago

I just landed Bug 606942's patch, so hopefully that fixes this.

Markus, can you post here if you see this happening anymore after tomorrow's nightly?  (Maybe visit http://blog.dholbert.org every once in a while to hit the fish.svg image that was implicated in Comment 7.)

Thanks again for filing these bugs!

Markus Stange [:mstange]

Reporter

Comment 10

•

14 years ago

I haven't hung yet, so it seems to be fixed. Cool!

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → WORKSFORME

Markus Stange [:mstange]

Reporter

Comment 11

•

14 years ago

Oh no, there it is again. :-(

Between closing the http://blog.dholbert.org/ tab and the hang I've sent my Mac to sleep at least twice. Maybe that's related? I'm thinking along the lines of "Firefox wakes up, notices that it hasn't sampled for several hours and tries to catch up like mad"

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Comment 12

•

14 years ago

That's definitely possible. In fact, I was wondering about that possibility just recently. I guess a similar situation is if the user changes their system time to a time in the future. I'm not sure what mozilla::TimeStamp will give us in that case but it could produce a similar problem. We really want to base our sample time on uptime (not including sleep time).

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Updated

•

14 years ago

Whiteboard: [depends on 606942]

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 13

•

14 years ago

I don't think "uptime" is necessarily a well-defined concept.

Why is the amount of work the SMIL engine does proportional to the distance in time between two samples? Can we fix that?

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Comment 14

•

14 years ago

(In reply to comment #13)
> Why is the amount of work the SMIL engine does proportional to the distance in
> time between two samples? Can we fix that?

Presently, if we have a sample at t=5s and t=500s we normally won't skip immediately to t=500s but instead we might need to make a few stops along the way. For example, if animation A ends an interval at t=30s it will receive a special intermediate "milestone sample" at that time. Likewise for animation B that begins at t=40s. (Note that we don't sample the whole model at those times, just the animations that asked for it, i.e. just A in the first case, B in the second).

This is needed for two reasons:
a) Syncbase dependencies--if other animations are dependent on A's timing, then we don't know what the effect will be of A ending its interval at t=30s until it happens. SMIL discourages us from trying to guess as well since all sorts of interactive events might happen before that time.
b) Events--we are supposed to dispatch events every time an animation starts, repeats, or ends.

As for the question of can we fix it, it depends if it's possible to detect if we have listeners registered against our events. See the comment here:
http://mxr.mozilla.org/mozilla-central/source/content/smil/nsSMILTimedElement.cpp#1969

If we can reliably detect when we have event listeners then we can skip these intermediate samples when we don't have listeners or syncbase dependencies and when we ourselves aren't depending on anyone else's timing (since we need to be up-to-date when receiving syncbase notifications). 

Note that this is less of a problem when we *seek* the model since intermediate events are not fired in that case and we skip entirely the intermediate samples that correspond to repeat times which can constitute a lot of intermediate samples in some cases (e.g. the fish.svg image that we suspect caused Markus to notice this hang). Furthermore, the actual firing of events itself seems to constitute a fair amount of the time taken in processing these samples (particularly on Mac it seems as per bug 606937).

In addition to, or instead of tweaking SMIL we can try to avoid these big jumps in time.

For example, we don't have this problem when the document goes into the bfcache since the document gets paused during that time.

When the computer sleeps I think the expected behaviour would also be to pause. From what I understand of mozilla::TimeStamp on some platforms we might get this behaviour for free (TimeStamp.h: "[elapsed time according to the system as used by TimeStamp] might not advance while the system is sleeping"), but not on all so perhaps we need to manually pause when the computer sleeps.

Also, I think the case of the user advancing their clock an hour for daylight savings might be ok since mozilla::TimeStamp apparently gives us, "current elapsed system time" but I need to confirm this.

That still leaves the scenario when an animation gets throttled really low (and here I'm thinking several seconds). Perhaps all we can do in that case is the previously mentioned tweaks to skip intermediate samples except when they are absolutely necessary. That depends on us being able to reliably detect when we have event listeners and be notified whenever this changes.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 15

•

14 years ago

Thanks for the info.

TimeStamp isn't really going to do what you want in terms of pausing while the computer is asleep.

I'm not really sure whether animations should pause while the computer is asleep though. We don't pause them when the tab is hidden or the window is minimized. Why should we treat those situations differently?

But maybe that's the easiest thing to do here. We could measure the amount of real time elapsed between each sample interval. If it's a lot more time than expected, we could subtract the extra from the animation clock instead of trying to "catch up".

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Comment 16

•

14 years ago

Attached patch Patch v1a (obsolete) — Details — Splinter Review

(In reply to comment #15)
> TimeStamp isn't really going to do what you want in terms of pausing while the
> computer is asleep.

Ok, I misunderstood that comment. I think I understand now.

> But maybe that's the easiest thing to do here. We could measure the amount of
> real time elapsed between each sample interval. If it's a lot more time than
> expected, we could subtract the extra from the animation clock instead of
> trying to "catch up".

Ok, I've implemented that in this patch.

So far I haven't had any ideas for automated tests for this though.

Attachment #491054 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 17

•

14 years ago

Shouldn't the logic in nsSMILAnimationController::WillRefresh actually be in the refresh driver so that all animations stay in sync?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 18

•

14 years ago

Hmm, I guess not.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 19

•

14 years ago

Comment on attachment 491054 [details] [diff] [review]
Patch v1a

+  nsSMILTime                 mAvgTimeBetweenSamples;

This needs to be documented.

I think you should explain in comments why we do this for SMIL but not CSS transitions or scripted animations (hence allowing these to become desynchronized during sleeps). I think the reason is that those animations don't have to do a lot of work to "catch up". Furthermore it would be confusing for script authors to see major drift in animation time vs real time.

Attachment #491054 - Flags: review?(roc) → review+

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Comment 20

•

14 years ago

Attached patch Patch v1b, r=roc — Details — Splinter Review

Address review feedback from comment 19.

Attachment #491054 - Attachment is obsolete: true

Attachment #491727 - Flags: review+

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Updated

•

14 years ago

OS: Mac OS X → All

Whiteboard: [ready to land]

Brian Birtles (:birtles, Feb 26~ parental leave)

Assignee

Comment 21

•

14 years ago

Pushed: http://hg.mozilla.org/mozilla-central/rev/cf61e6916004

Markus, if you get a chance in the next few days, would you mind confirming that this is fixed.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Whiteboard: [ready to land]

Target Milestone: --- → mozilla2.0b8

Shark profile of the first phase 14 years ago Markus Stange [:mstange] 798.52 KB, application/octet-stream		Details
Shark profile of the second phase 14 years ago Markus Stange [:mstange] 1.54 MB, application/octet-stream		Details
Patch v1a 14 years ago Brian Birtles (:birtles, Feb 26~ parental leave) 5.91 KB, patch	roc : review+	Details \| Diff \| Splinter Review
Patch v1b, r=roc 14 years ago Brian Birtles (:birtles, Feb 26~ parental leave) 6.84 KB, patch	birtles : review+	Details \| Diff \| Splinter Review