Created attachment 758857 [details]
Testcase 1 - choke the browser
When consecutive requestAnimationFrame callbacks at the content take 20ms or more, the browser could hang for 2000ms.
STR (tested: Windows 7+DWM, Ubuntu 13.04 VM):
- Use a new profile, load testcase 1, close other tabs.
- Restart the browser.
* Testcase 1 animates an X using requestAnimationFrame, and on each iteration spins for 20ms.
- The browser is responsive shortly after it's visible.
- The browser is unresponsive for about 2-3 seconds, with no screen updates (after that - it responds quickly again and the screen updates frequently enough).
- To retrigger the bug:
- Linux: reload the page.
- Windows: restart the browser.
I'm not sure the following symptoms are of the same problem, but since they could be related, I added them:
Symptom #2: choking the browser (tested: win/linux):
STR: Same as above, but add #fork10 at the end of the URL.
- This creates 10 rAF callbacks at the content - all to the same function.
- The browser is reasonably responsive.
- The browser is very unresponsive, and stays that way as long as the page is at a foreground tab.
Symptom #3: hanging the browser (tested: win/linux):
STR: same as above, but use #fork0 at the end of the URL.
- This will add another rAF callback on each iteration - indefinitely.
- The browser is responsive.
- The browser hangs pretty quickly, at least apparently (didn't notice a crash).
Created attachment 758865 [details]
Testcase 1 : choke the browser
Oops, removed a debug alert box.
On the first STR: also configure Firefox to open with the tabs from the last session before restarting.
I think symptom 1 is a different bug than symptoms 2/3, for at least two reasons:
1. Symptom 1 can happen under normal conditions where a lot (>16ms) of legitimate processing takes place on each refresh driver iteration (even if for only few such consecutive iterations). However, symptoms 2/3 would mostly happen IMO either as an attack or by very specific kind of bad code, which is very likely to get fixed before it becomes public.
2. Symptom 1 should be curable easily IMO by making sure that non-rAF events are not starved, for instance by enforcing a minimal delay of X ms between refresh driver iterations (recall, the callback handler isn't evil, it just happens that ticking all the refresh drivers ends up in a bit more than the refresh interval - but still quickly enough to theoretically keep the browser fully responsive). However, symptoms 2/3 are a different case completely, where the content could impose any length of cumulative callback handlers duration, even if each of the handlers runs for very short time.
I believe the solution for symptom 1 is relatively simple, similar to what I suggested above.
I think the solution for symptoms 2/3 should be that the "script is taking to long, abort?" message should be invoked for overall duration of content callback handlers (per content), rather than only for a single too-long JS slice.
I'm making this bug for symptom 1 only (as the title suggests), and I'll file a new bug for symptoms 2/3 - which are both the same scenario, but at different severity levels.
FWIW, a preliminary test which I conducted where ScheduleNextTick() at the refresh driver timer is invoked _after_ ticking all the refresh drivers (and with an updated now value), instead of before ticking them, cures symptom 1 completely.
This is because a schedule of (e.g.) 5ms into the future where the following handlers ended up as (e.g.) 10ms, will be served very quickly since by the time the handlers finished, the next refresh iteration should have supposedly happened 5ms ago. This is fixed by scheduling the next tick after the handlers are done.
I'll try few variations of this, but I expect that a good solution would be relatively simple.
I Don't think this is critical -- there are plenty of things that can cause the browser to hang/become unresponsive, some even longer than this. Not sure I understand why it actually recovers, though.
iirc, all rAF callbacks should happen on the same trigger; the "fork" in the testcase should just cause each rAF callback to take longer (since you'll end up with N calls to the function), not N separate callbacks. If that's not the case that's a problem, I think..
Created attachment 760143 [details] [diff] [review]
Debug patch: fully functional
This patch schedules the next refresh driver iteration after ticking the drivers (therefore aware of the time it took to complete the ticks) instead of before. It's fully functional and hopefully correct, but needs cleanups (non cached prefs, duplicate code, debug prints, etc).
However, only scheduling after instead of before has the side effect of changing the behavior of Firefox into "full vsync*" mode, which means that the intervals will always be integer-multiples of the optimal-interval (e.g. 16.67ms/33.3/etc).
* "vsync" here means either actual HW signal or virtual signal (which is used when HW info isn't available).
While the original code aimed at this "full vsync" behavior, in practice it was behaving in "adaptive vsync" mode, which means that if it couldn't keep up at the optimal rate, then it'd iterate as fast as possible. However, it happened in a way which places the event for the next iteration deep within the queue by the time that ticking the drivers was finished (therefore the next iteration could be served before events which were queued later - such as paint, and this resulted in this bug).
I think that adaptive vsync is the preferred behavior (both logically and empirically), so I've added few lines to enable that mode as well, by default. This way we're not changing the timing behavior of Firefox, and end up with a good behavior regardless. But it can be disabled using a pref (live with this debug patch), and we can also change a pref to set the minimum timer delay to use when we're late (1ms by default, can be changed live with the debug patch).
I'm currently guessing that this patch could give a major boost to responsiveness on many cases where few consecutive iterations took more than 16.7ms, but which resulted in quite a bit longer visible hang. However, I don't think we currently have telemetry to measure this kind of responsiveness, so I hope to add that as well such that we could measure such improvements.
Created attachment 760213 [details] [diff] [review]
Debug2: fixed - only sched if required, minor logic improvement
Small fix. I think this patch implements my approach properly (but still needs cleanups).
I also confirmed that paint events are starved (tested on windows only) with the current code - I added prints at paint handlers and refresh-driver iterations. Before patching, on relevant animation cases, there are many ref-drv iterations without paints, while after patching, it's consistently alternating (unrelated, but FWIW, most of the paint durations I've observed are about 0.5-2ms on my system. Resizing the window could be up to 100ms per paint).
However, I'm hesitant to go forward with this approach since it seems the starvation stops after ~2 seconds. On windows - it seems to be fixed for the session, and on linux, until the page reloads.
So without understanding the explicit or implicit mechanism that stops starvation after 2s, and trying to fix it first, I think that committing this patch could be a cover-up for another bug, which is not fun.
I don't have time for this investigation right now, so I'll get back to this bug at a later stage.
Hmm, could this be related to bug 822096 (which is somehow a regression from bug 731974).
http://mxr.mozilla.org/mozilla-central/source/docshell/base/nsDocShell.cpp#241 and the FavorPerformanceHint calls that use it could be the source of the 2 seconds magic number you are seeing here.
(In reply to Timothy Nikkel (:tn) from comment #8)
> cpp#241 and the FavorPerformanceHint calls that use it could be the source
> of the 2 seconds magic number you are seeing here.
Thanks. This appears to be it, though I haven't verified it yet.
By looking at the code, it seems to be used while a document is loading - giving priority to "internal" events over OS events for performance - by design.
Assuming that's the case, IMO it's not a good design decision. Responsiveness should come before speedup in initial page rendering, especially if the potential result of the "speedup" is practically frozen browser for 2s.
In fact, I think browser responsiveness should always have higher priority than any other task, with no exceptions which I could think of.
> // Hint for native dispatch of events on how long to delay after
> // all documents have loaded in milliseconds before favoring normal
> // native event dispatch priorites over performance
> #define NS_EVENT_STARVATION_DELAY_HINT 2000
I can confirm that changing this constant to 10ms fixes the hang on both Windows and Linux.
With some help on IRC, bug 164931 seems the initial/enhancement of this system which I think didn't change much in the past 10 years.
I believe/guess it was born to serve an implementation which isn't here anymore: where every layout change would trigger a lot of processing, and therefore it could be useful to wait a while when loading documents before starting to process the layout.
These days layouts are aggregated and flushed not more than @60hz. Further more, AFAIK the layout are still flushed even when the OS events are starved, since the refresh driver still ticks. The thing which doesn't happen is actually paint those changes to screen (and handle all other user inputs).
So we're flushing - which is typically most of the work of the refresh driver, but not painting (handling paint events take typically 5-20% of the refresh cycle, though with high variance, possibly very high during document load).
The optimal approach to this "problem" would be to possibly delay content processing while the document is loading, but still handle all chrome layout changes and OS events.
But however I look at it, IMO responsiveness should _always_ come before everything else, including slightly longer duration until the page has finished rendering (instead of the browser frozen).
Any ideas how to proceed on with this?
I'd suggest at looking at dropping the starvation hint to 10ms (or something low). You'd want to get roc, dbaron, bz involved in helping make that decision.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #11)
> I'd suggest at looking at dropping the starvation hint to 10ms (or something
> low). You'd want to get roc, dbaron, bz involved in helping make that
roc already suggested on IRC to "rip it out" (this sends a chill down my spine with still active consequences from bug 590422 ;) ), and I tend to also prefer it over leaving this system practically no-op, but without fully understanding the implications of that.
Even though changing the constant to 10 fixes the issue, I also still think that the refresh driver scheduling could be improved, and so we should change it from before ticking the drivers, to after ticking them, such that the tick duration is (implicitly) taken into account.
- Should we change the constant to 10?
- Rip it out?
- Also change the ref-drv scheduling?
I will join roc in saying "rip it out"; I was afraid to go out on that limb, but now I will pile on!
If you change the scheduling, make sure you don't regress bug 630127.
As far as the performance hint stuff, ccing sicking too. It would be good to understand what it actually does now, for sure. Does it affect how often the parser flushes, for example?
Oh, and also, painting _will_ flush layout even aside from refresh ticks. But maybe we're close enough to only painting off the refresh driver that this is not an issue?
(In reply to Boris Zbarsky (:bz) from comment #14)
> If you change the scheduling, make sure you don't regress bug 630127.
Yes, comment 5 explains how it was taken care of. The sched modification had the side effect of changing the behavior to "full vsync" mode, so I added few lines to make sure the default behavior is kept as "adaptive vsync". I also explicitly tested this case before I posted the patch.
I think we are conflating two issues here: the 2 second thing on Windows, and the general problem (which appears to not clear up after 2 seconds on mac and linux). Let's split them up into separate bugs.
(In reply to Timothy Nikkel (:tn) from comment #17)
> I think we are conflating two issues here: the 2 second thing on Windows,
> and the general problem (which appears to not clear up after 2 seconds on
> mac and linux). Let's split them up into separate bugs.
Whoops, Avi tells me that the 2 second thing is the same on Linux and Windows. Nevermind.
One thing at a time. Try ripping this out. Definitely don't leave it at 10ms, that's just a confusing no-op.
Trying ripping out the NS_EVENT_STARVATION_DELAY_HINT stuff sounds good to me as well. We should be aware both what it does to the user experience and what it does to performance metrics.
As far as general improvements to refresh driver scheduling, I think at some point we should move towards making it schedule to line up with the actual refresh rate in the video hardware. That's well outside my area of expertise, though.
FWIW, I also tested on OSX 10.8 without OMTC, and noticed that Tab animation paints (present) are getting starved as well.
My test case was tab animation while layout.frame_rate=10000. The result was that intervals were ~30ms. After setting docshell.event_starvation_delay_hint=10 (bug 884955 added this pref for the starve duration), intervals were perfect at ~16.7ms.
The favor perf code in docshell doesn't seem to do any checks if the loading document that causes us to enter favor perf mode is actually in an active tab. So it appears as though any document loading anywhere, including background tabs, will cause us to enter favor perf mode for a time, making us less responsive in the current ab.
FWIW, the content sink code that also uses favor perf mode does actually check if it's in a foreground tab.
try server run with the favor perf stuff removed from docshell
and the compare talos link
I haven't looked at the results in detail, but it looks like there is some movement there.
(In reply to Boris Zbarsky (:bz) from comment #15)
> Oh, and also, painting _will_ flush layout even aside from refresh ticks.
> But maybe we're close enough to only painting off the refresh driver that
> this is not an issue?
Sorry, I'm not sure what you are asking me about?
Jonas, the questions for you are in comment 12, sorry.
Oh, I don't actually have any opinion on the various knobs that we have during page load.
At this point I don't even really recall which knobs we have as I've only ever been tangentially involved in them.
And at the time, basically all of the numbers were based on ancient data and so we had no confidence in the numbers, but were afraid to touch them since no one really understood them.
I even wrote a patch at some point to rip out the FavorPerformanceHint. I forget what prevented that patch from landing though.
As some evidence of how changing the FavorPerformanceHint duration affects perceived and actual performance (useful scenario but not a typical one):
Talos tscrollx is a new test which measures average iteration interval of scrolling different types of content in ASAP mode (it loads the page and starts scrolling onLoad).
On August 6th we landed a modified tscrollx which sets the pref docshell.event_starvation_delay_hint=1 (bug 884955 , defaults to 2000, as the code had it originally).
As a result we've seen major tscrollx "regression" of 30-100% across the board ( https://groups.google.com/forum/#!topic/mozilla.dev.tree-management/DwDiY-Sbwo4 ). E.g. http://graphs.mozilla.org/graph.html#tests=[[287,94,31]]&sel=none&displayrange=30&datatype=running
If you run this test locally with and without the pref, the numbers indeed regress a lot when the hint is set to 1 (ms).
However, if you also look at the Firefox window during the test, you'd notice that without the pref it shows the content right after it loads, then a second or two later, it shows the content already scrolled to the bottom without ever seeing a single frame in between.
When setting the pref to 1, it scrolls and iterates quickly, as it should.
So even if our numbers show that we're doing much worse with the hint set to 1, actually, from a user point of view, we might be doing infinitely better.
Also, interestingly, the windows builds from comment 29 showed the biggest regression (132% on xp, 77% on win8, 60% on win7, 34% on Ubuntu x64/32, and <=10% on OS X).
Assuming that the regression is due to the change from "paint is not part of each iteration" to "it also paint on each iteration therefore iterates slower", If I were to deduce anything from this, it would be that paints on windows have a much higher weight within the "Rendering pipeline" WRT CPU usage compared to other platforms.
Adding dependencies since bug 822096 and bug 906811 fix subsets of this issue (favor perf can go wrong, and disable favored perf in background tabs, respectively).