Closed Bug 650379 Opened 9 years ago Closed 9 years ago

mozRequestAnimationFrame lags the UI


(Core :: DOM: Core & HTML, defect, P1)




Tracking Status
firefox5 + fixed
fennec 2.0next+ ---


(Reporter: azakai, Assigned: bzbarsky)



(Keywords: dev-doc-complete)


(1 file, 1 obsolete file)

The cool Runfield demo has a normal and a mobile version

Mobile version is ok, but the desktop version is unusable on Fennec, both on linux desktop and on device. CPU is maxed out and no rendering occurs, except 1 frame every 10 seconds perhaps.

Initial investigation shows that no graphics Update IPC messages are sent, which explains why there is no rendering. There are also a lot of events/timers/messages that do not appear in the mobile version, specifically


Two things to find out:

1. Refresh driver is being called. Why are no Update messages being sent?
2. Are all these audio events/messages needed, and do they cause the problem here?
Assignee: nobody → azakai
Audio is not the cause of this problem.

The issue is that for some reason Update messages are not sent.
Assignee: azakai → nobody
Component: General → Graphics
Product: Fennec → Core
QA Contact: general → thebes
1. We do reach PuppetWidget::Invalidate() properly, but we always decide there is a pending paint |mPaintTask.IsPending()| so we do nothing in that Invalidate. So lots of rendering is done to the canvas, but no Paint event is actually dispatched, so no Layers Update is sent.

2. Interestingly, it is possible to trigger this bug on the desktop version of the demo too. If I break with the debugger while running it, it enters the same state as the mobile one (that is, 100% CPU, canvas work is done, but no paint task, no Update, and no frames are drawn). A reload snaps it out of it.

So, some timing or ordering issue it appears. I suspect a Paint event gets queued on the main thread, but never actually Run().
Assignee: nobody → azakai
Yes, what is happening is that a PaintTask is dispatched to the main thread in PuppetWidget.cpp, but a very very very long time (seconds or dozens of seconds) passes on average before it actually gets Run(). This happens in the desktop demo always, and on the mobile one rarely. Which makes me suspect this can happen in other situations as well.

cjones, are you familiar with this code? If so, do you know why that might happen?
tracking-fennec: --- → ?
I don't know off-hand what might be the problem here.  Bug 622072 seems suspicious.  I'd recommend bisecting.
New discoveries:

The underlying issue is *not* a Fennec thing. It is present on desktop Firefox as well. The issue in both cases is that Runfield uses mozRequestAnimationFrame. A huge number of events are then queued (on Fennec I get a backlog of 1,000 events after a few seconds) leading to some events not being run in a reasonable time.

To see this on desktop Firefox, run Runfield (the normal desktop demo, link in first comment), and as it is running click on the menu bar. Move the mouse around the various options. The mouse hovering will mark menu options with a very significant delay. Sometimes clicking to move to another tab is also noticeably sluggish. In other words, the UI has become much less responsive.

On Fennec the same problem occurs, however, apparently due to how PuppetWidget does rendering - it queues a PaintTask - not only is the UI less responsive but painting happens much less frequently or not at all. (This can happen in both the desktop and mobile versions of the demo, but is less likely to happen in the mobile version, apparently since it does less work.)

Evidence that mozRequestAnimationFrame is involved is that the problem behavior can be entirely prevented by replacing



  setTimeout(function() {
  }, 0);

which I guess introduces the opportunity for other events to fire.

bz, cc'ing you since I see you are involved/assigned to all the mozRequestAnimationFrame bugs. Is this behavior to be expected with mozRequestAnimationFrame?
Assignee: azakai → nobody
Component: Graphics → DOM
QA Contact: thebes → general
Summary: Runfield desktop version unusable on Fennec anywhere → mozRequestAnimationFrame lags the UI
tracking-fennec: ? → 2.0next+
Good question.  

The behavior where using a 0 timeout works but using mozRequestAnimationFrame does not is NOT expected.  On the other hand, that's not quite what the code in comment 5 is doing.  Does just doing:

  setInterval(draw, 15)

after draw() is defined and removing that whole hunk from the end of draw() lag the UI the same way mozRequestAnimationFrame does in your case?  Note that because mozRequestAnimationFrame promises a heartbeat the code in the testcase there that uses setTimeout is totally NOT equivalent to the mozRequestAnimationFrame codepath. Also note that WebKit's handling of webkitRequestAnimationFrame at the end of the method is buggy last I checked in that it does not maintain the heartbeat...

If, as I assume it does, the setInterval case causes the same issues you see with mozRequestAnimationFrame, then those issues are expected, yes.  See bug 630127 comment 17.  This is why we need separate processes for tabs and the UI.  On the other hand, I don't understand why you see problems there on Fennec, unless the Fennec content process event loop prioritizes timer events over paint events.
(In reply to comment #6)
> The behavior where using a 0 timeout works but using mozRequestAnimationFrame
> does not is NOT expected.  On the other hand, that's not quite what the code in
> comment 5 is doing.  Does just doing:
>   setInterval(draw, 15)
> after draw() is defined and removing that whole hunk from the end of draw() lag
> the UI the same way mozRequestAnimationFrame does in your case?

Interesting, doing the setInterval way does *not* lag the UI. In other words, the UI lags with mozRequestAnimationFrame, but things work properly with a simple setInterval. If I understand correctly, this is unexpected?
> If I understand correctly, this is unexpected?


If you can reproduce this in desktop Firefox, can you please attach both the setInterval testcase that does not lag the UI and the mozRequestAnimationFrame that does for you?
Ok, here is the complete runfield demo, ready to be run either with mozRequestAnimationFrame or setInterval.

1. Get and unpack it
2. Run python -m SimpleHTTPServer 8000 in that directory
3. Open localhost:8000/demo.html and look for UI lag as described before (moving mouse around the menus, etc.)

Edit demo.html to switch between mozRequestAnimationFrame and setInterval, there are two code segments, marked by |XXX begin| and |XXX end|, one for each option, comment out one each time.
Ok, I think I have a general picture of what is going on here:

1. Using mozRequestAnimationFrame causes a *precise* timer to be used for nsRefreshDriver, as opposed to a slack one.

2. Timers work like this: The timer's time comes up. It dispatches an event. For precise timers, we schedule the next time for this timer when the time comes up (so we get precise timing). For slack timers, we schedule the next time for this timer when the dispatched event is actually run.

3. In the runfield demo, everything is very busy, 100% CPU, etc. As a result, **the nsRefreshDriver's precise timer floods the event queue, since it keeps adding events at the precise rate it should (1/60 sec or so), even when those events cannot be run quickly enough** - which is the case here. In runfield on my machine I can see twice as many nsRefreshDriver events added to the nsEventQueue as taken off of it. Seems like this would happen in any web page that uses mozRequestAnimationFrame but is so busy that the actual framerate is less than it should be.

4. The event queue is flooded by nsRefreshDriver events and consequently the Firefox UI becomes laggy. (On Fennec, the UI is not laggy since we are in a separate process, but the child process does not send draw updates since PuppetWidget drawing relies on a PaintEvent... which is in the blocked-up event queue. Different symptoms from Firefox but the same cause.)

cc'ing some more people that are relevant to this.

My inclination is to add a check for precise timers, to prevent them flooding the event queue. This seems like a fix for a bug in the timer code to me, but it is a significant behavior change, though, in that a 'backlog' of events will not be generated and then all run at once, which is what happens now (assuming we ever get the chance to run them...).

Alternatively, perhaps better to add a new type (PRECISE_NOFLOOD?) and use that in nsRefreshDriver instead of the current precise timer? Seems less chance of regressions, but to be honest continuing to use the current PRECISE timers (in anything) sounds risky.
Hmm.  So the timer code doesn't just account for the time the callback took (like setInterval) but will post a new timer event, or possibly more than one, before the callback even finishes.  I'd missed that...  So this bug was introduced by bug 630127.

I agree that this is not a desirable behavior for the "precise" mode of mozRequestAnimationFrame.  What we want there is to be able to skip frames altogether but to not add the time the callback took to our firing interval.

We could do this by changing refresh driver to use one-shot timers in this mode and reset them itself (which is what setInterval does) or by adding a new mode to nsITimer.  The latter is probably less work; the former less invasive.

Brendan, Chris, thoughts?  I think we need to either fix this in as safe a way as possible on Aurora ASAP or back out bug 630127 or just decide we'll ship Firefox 5 with this bug...
And Alon, thanks for figuring out what's going on here!
Assignee: nobody → bzbarsky
Priority: -- → P1
So as I see it, we in fact have 4 options for both trunk and Aurora; we may want to take different options on the two branches:

1)  Back out bug 630127.  This makes mozRequestAnimationFrame suck for web developers; retards its adoption.
2)  Do nothing.  Somewhat sucks for users, and also for web developers who will unintentionally DoS users.  Will likely also retard adoption.
3)  Add a new XPCOM timer type that does what we want.  This would be a pretty simple change; it could even be done without a UUID change for nsITimer, I think, since it just adds a new enum value for the timer type.
4)  Change repeating precise XPCOM timers to do what we want.  The only in-tree uses are the code under discussion and caret blink; the latter definitely wants the same behavior as the former.  But there are lots of extension uses (several hundred).  Most look safe (e.g. they're precise timers for some number of hours or minutes or days, so the callback would never realistically take longer than the timer interval), but a few are using 100ms timers, and some are utility methods that I'd need to find the callers of.  So there is some risk of (subtle, if it happens) extension breakage if this approach is taken.

Chris and I think option 4 is the way to go for trunk.  I'll try to figure out what drivers would prefer for Aurora.
Attached patch Implement option 4 (obsolete) — Splinter Review
Please see patch commit message for a description of what the patch is doing
Attachment #527710 - Flags: review?(jones.chris.g)
Attachment #527710 - Flags: superreview?(brendan)
Whiteboard: [need review]
We made a decision to track this for FF5 since it could be responsible for a number of problems. We are deferring the decision on what option to pick to dbaron and roc since they are the best qualified to make the decision.
Last time I asked roc he told me to pick something, for what it's worth.

I'll see whether I can get Chris to review this ASAP.
dbaron says he would be happier with a new timer type.  Plan is to call it REPEATING_PRECISE_CAN_SKIP.
Attachment #528721 - Flags: review?(jones.chris.g)
Attachment #527710 - Attachment is obsolete: true
Attachment #527710 - Flags: superreview?(brendan)
Attachment #527710 - Flags: review?(jones.chris.g)
Attachment #528721 - Flags: superreview?(brendan)
Comment on attachment 528721 [details] [diff] [review]
Option 3, per dbaron's request

Patch looks OK but leaving the current REPEATING_PRECISE behavior is pretty scary.  Is there a good way to let code know that creating REPEATING_PRECISE timers for intervals < O(minutes) is very bad?  Maybe log a message to stderr or send something to the JS console?  Could be a followup I guess.

We should probably post to m.d.extensions+platform about this.
Attachment #528721 - Flags: review?(jones.chris.g) → review+
I'm pretty happy to do something like that as a followup, sure.  Or just remove REPEATING_PRECISE altogether for Firefox 6 or 7....
(In reply to comment #20)
> I'm pretty happy to do something like that as a followup, sure.  Or just remove
> REPEATING_PRECISE altogether for Firefox 6 or 7....

I was fine with door #4, but yes: door #3 entails a followup bug to remove the footgun in due course, after notice is broadcast. Please file.

Comment on attachment 528721 [details] [diff] [review]
Option 3, per dbaron's request

Again, if we know of no uses of the current bad thing and we believe it's never better than _CAN_SKIP, I'd just change it and avoid complexity and potential for prolonged foot damage due to the footgun. But I'll sr+ this. Strictly simpler patch to fix PRECISE gets auto-sr+ from me.

Attachment #528721 - Flags: superreview?(brendan) → superreview+
Comment on attachment 528721 [details] [diff] [review]
Option 3, per dbaron's request

Nits only:

>   // If this is a repeating precise timer, we need to calculate the time for
>   // the next timer to fire before we make the callback.
>-  if (mType == TYPE_REPEATING_PRECISE) {
>+  if (IsRepeatingPrecisely()) {
>     SetDelayInternal(mDelay);
>-    if (gThread) {
>+    // But only re-arm REPEATING_PRECISE timers.

Nit: one blank line before comment not preceded by {.

>+  bool IsRepeating() const {
>+    return mType == TYPE_REPEATING_SLACK || IsRepeatingPrecisely();
>+  }
>+  bool IsRepeatingPrecisely() const {
>+    return mType == TYPE_REPEATING_PRECISE ||
>+  }

Maybe gcc and msvc do this, but range-testing based on TYPE_REPEATING order with respect to ONE_SHOT is more l33t (provided static asserts enforce the order).

> if we know of no uses of the current bad thing

We know there are a few hundred extension uses.  dbaron's worry was that some extensions may depend on receiving a certain number of callbacks over a certain time period, which is the one thing PRECISE guarantees that the other types don't.

> Nit: one blank line before comment not preceded by {.


> range-testing based on TYPE_REPEATING order with respect to ONE_SHOT is more
> l33t

OK; I guess as long as I do it in those inline methods the readability is not hurt too much...
Whiteboard: [need review] → [need landing]
Pushed and things are green.
Closed: 9 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Whiteboard: [need landing]
Target Milestone: --- → mozilla6
Comment on attachment 528721 [details] [diff] [review]
Option 3, per dbaron's request

Requesting approval for aurora.  This is a pretty safe fix.  Extension compat should be a non-issue unless extensions were passing random values that had no corresponding timer type to timer init.  So the only impact is on the core code being changed: the refresh driver and caret.  And we're pretty sure it's a change for the better.
Attachment #528721 - Flags: approval-mozilla-aurora?
I urge that we take this for Aurora, as I have noticed this bug happening in more places. For example, just running things made with CubicVR.js - the WebGL engine used in Mozilla demos like No Comply - can make the UI very unresponsive. On my home laptop, just running the CubicVR.js samples almost locks up the browser, so that it is very hard to leave the current page, sometimes even to the point that I need to force-quit the browser.
Attachment #528721 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.