Closed Bug 1406414 Opened 7 years ago Closed 7 years ago

60fps tab throbbers may be impacting page load performance on some machines

Categories

(Firefox :: Theme, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 58
Tracking Status
firefox-esr52 --- unaffected
firefox56 --- unaffected
firefox57 + wontfix
firefox58 --- fixed

People

(Reporter: mconley, Assigned: jaws)

References

Details

(Whiteboard: [Better STR in comment 9])

Attachments

(9 files)

This was originally filed as bug 1397092, but at just over 200 comments, that bug has gotten super unwieldy.

Here are the important bits that we've gathered so far:

1) User shellye noticed that when restoring 9 tabs simultaneously, time to first paint for those tabs had started to suffer in their Nightly.
2) Using mozregression, they found that the problem started when bug 1352119 landed, which introduced the 60fps SVG tab throbbers.
3) I tried to reproduce this on the hardware that we have in the Toronto office, but no luck, but jrmuizel found the machine that the user reported the issue on on Amazon and we ordered one.
4) We're able to reproduce the behaviour on the machine now.

This is the machine that we ordered:

https://www.amazon.com/Lenovo-59441913-15-6-Inch-E1-6010-Professional/dp/B00TB8L0B0/ref=sr_1_13?s=electronics&ie=UTF8&qid=1507303418&sr=1-13&keywords=lenovo+b50

Although it has Windows 10 64-bit installed.


STR:
1) Import the attached bookmarks collection
2) Close all tabs except for about:home
3) From the bookmarks menu, find the Test folder, and hover your mouse over "Open all in tabs"
4) Start a stopwatch
5) Click "Open all in tabs"
6) Wait for all tabs to open, and then choose one (I usually choose the Softpedia one) and click on it.
7) Wait for the first meaningful paint on that Softpedia tab, and then stop the stopwatch.

ER:

A roughly similar length of time passing with or without the 60 fps throbbers.

AR:

When I disable the throbbers, the time to reach that paint is quite a bit longer.
[Tracking Requested - why for this release]:

This bug is taking over for bug 1397092 which was blocking 57, so this should be blocking 57.
See Also: → 1404042
Attachment #8916021 - Attachment mime type: text/plain → text/html
Profile with 60 fps SVG tab throbbers: https://perfht.ml/2xneXkf

Profile with tab throbbers fully disabled (via Browser Toolbox hack): https://perfht.ml/2xmZuki
STR:

1) Import the attached bookmarks collection
2) Close all tabs except for about:home
3) From the bookmarks menu, find the Test folder, and hover your mouse over "Open all in tabs"
4) Start a stopwatch
5) Click "Open all in tabs"
6) Wait for all tabs to finish loading (so their throbbers all stop)
7) Stop the stopwatch.

ER:

A roughly similar length of time passing with or without the 60 fps throbbers.

AR:

With the 60fps throbbers, the time to finish loading is quite a bit longer.
Whiteboard: [Better STR in comment 6] → [Better STR in comment 9]
Update:

On the Mozilla Toronto office wireless network, running the STR with the 60 fps SVG throbbers, completes at around 68s. Without the SVG throbbers animating (if I use userChrome.css to override the tab-throbber[busy]::before animation rule), it completes around 58s.

So we're talking about a 10s difference here on this hardware to load 9 tabs simultaneously.

We tried using Concurrency Visualizer collection tools (https://www.microsoft.com/en-in/download/details.aspx?id=49103) to collect both process and system-wide profiles, but the symbolication seems really messed up and non-sensical. We're trying to get VS2015 installed on the machine to run Concurrency Visualizer directly.

The thread activity in CV seems to be truthful though, and what _appears_ to be happening is that we're overloading the CPU with too much work. Just too much is going on.

One thing I want to investigate is whether or not the 60fps throbbers are causing us to wake up the content process main threads more often, and how that might be impacting things. This is similar to what we addressed (and supposedly fixed) in bug 1354080.
(In reply to Mike Conley (:mconley) (:⚙️) - Backlogged on reviews from comment #10)


> The thread activity in CV seems to be truthful though, and what _appears_ to
> be happening is that we're overloading the CPU with too much work. Just too
> much is going on.
>

No wonder disabling e10s removes the tab switch and typing lags although the Cpu use is still high.
More data:

When we run the SVG animation at the compositor at, say, 30fps (which didn't help, according to the original reporter), we do composite at a lower rate.

However, even at that lower frame rate, the vsync thread alerts the main thread and (I believe) the compositor thread to let them know that vsync has occurred. These vsync alerts are only observed by these threads when an animation is occurring.

For APNG's, these aren't animations that depend on vsync. That means that, assuming they're the only animation going on, they won't cause vsync to alert the other threads. APNG's encode how long to wait until the next frame as part of the file format, so instead of having a vsync / compositor driven wake-up to drive the animation, we have a timer go off when the APNG says that the next frame should have been painted.

So we think the vsync stuff is what's waking up the CPU and taking cycles. The CPU in this hardware is very much optimized for power usage, so this waking up every ~16ms to basically do nothing is still introducing enough busy work that it's delaying (compounded over time) the work that's occurring on other threads.

I agree with ehsan's assessment in bug 1397092 comment 131 - the "right solution" here is being better at scheduling threads so that the least important ones get out of the way of the more important ones.

Having spoken with jrmuizel, it doesn't sound like there will be a cliff here where suddenly there's a processor speed where things get horrible - there's just a relationship where the 60fps tab throbbers cause slower hardware to perform worse - and the slower the hardware, the more pronounced this effect is.

I'm not yet convinced that this is something that needs to block 57 - but I want to see if I can get a sense of how this effect becomes pronounced as the hardware becomes weaker. If there's a point where we can say, "the 60fps animation causes unacceptable page load delays for processors with clock speed X and lower", then we might be able to condition the 60fps throbbers on that.

ni?ing myself for a sense of how this effect presents as hardware gets weaker.
Flags: needinfo?(mconley)
Relman is looking at a cutoff today for any non-critical work for Fx57 before Oct 12 Go to build for Beta 8.  To avoid any risk for the 57 Beta then this bugs need to NOT have this bug block 57.
Apologies on my last comment. This bug would have the time during this current 57 beta cycle to allow continued analysis and investigation.  Therefore, this bug doesn't fall under the Oct 12 cut off for all Photon non-critical work.
hi guys,
@mike @jeff loved the mozilla air video but shame the mic issue occurred,
can you guys post more sessions of this bug?

@dao @jaws

seems something are not rightly understood so try to clear them.

1> _This_ happens on most machines not just the one you guys brought,
That one easily reproduces the issue compared to other later ones

_(tested on AMD A10-7860K)_
12 Compute Cores: 4 CPU Cores + 8 Graphics Cores
4 GHz CPU Boost Frequency, 3.6 GHz Base Frequency
AMD Radeon R7 Graphics with 757 MHz GPU Frequency 

Can easily see this dip in this and most machines using the new indicator(so the problem is more widespread but less visible as most users don't care for the 10s but it's definitely adding a large delay)

2>Is there a difference in implementation in 32bit & 64bit?
32bit this problem is more visible.
,can you try on 32bit to see?
What were the old indicators? APNG or SVG?

3>Is the performance dip acceptable when upgrading to FF57 on same machine?
The perceived performance feels even worse.
Alternatively can older animation of indicators be added to cosmetic animations preference(not busyness which add overhead to CPU again) so users facing the problems can revert?

4>The experiment you guys suggested with changing the priority of the GPU process does not help,
so aging SVG indicators overloading the CPU but most of the UI/Typing lag goes away with e10s disabled.

5>New budget limiting/throttling/scheduling in FF58 help a bit but nowhere near FF56 in terms of performance.
Double the bookmarks and the 10s delay turns into more than 120sec delay.
https://bugzilla.mozilla.org/show_bug.cgi?id=1397092#c221

The Fallback to APNG fixes the problem somewhat.
Can it be fixed and added to FF57? so that users don't get degraded performance.


Then you guys can look and sort-out the SVG issues here.
Mike, Any further update or you are still investigating.
Flags: needinfo?(mconley)
Flags: needinfo?(mconley)
Assignee: nobody → jaws
Status: NEW → ASSIGNED
since there will be a very long delay between FF57 & FF58
can you guys uplift this so there is no fallback in performance?

@mike did you find the cause yet?
See Also: → 1407462
Comment on attachment 8919842 [details]
Bug 1406414 - Switch to an APNG loading indicator when the browser is under schedule pressure.

https://reviewboard.mozilla.org/r/190786/#review197814

Sorry for the delay. I've thought about it. Let's try this. Thanks for your patience, jaws!

::: browser/base/content/tabbrowser.xml:768
(Diff revision 1)
> +                    SchedulePressure.startMonitoring(window, {
> +                      highPressureFn() {
> +                        // Only switch back to the SVG loading indicator after getting
> +                        // three consecutive low pressure callbacks. Used to prevent
> +                        // switching quickly between the SVG and APNG loading indicators.
> +                        gBrowser.tabContainer._schedulePressureCount = 3;

Nit: It might be best to make this magic number a field on the tabbrowser binding.

::: browser/modules/SchedulePressure.jsm:71
(Diff revision 1)
> +   *           A function that will be called when the idle callback
> +   *           gets called within the time specified by TIMEOUT_AMOUNT.
> +   *           Returning an object with property of `continueMonitoring` set
> +   *           to `false` will prevent further monitoring.
> +   */
> +  startMonitoring(window, {highPressureFn, lowPressureFn}) {

Is it possible to add a killswitch pref for this, just in case we discover down the road that this ended up causing more harm than good?

::: toolkit/components/telemetry/Histograms.json:6139
(Diff revision 1)
> +  "FX_SCHEDULE_PRESSURE_IDLE_SAMPLE_MS": {
> +    "record_in_processes": ["main"],
> +    "expires_in_version": "61",
> +    "kind": "exponential",
> +    "high": 10000,
> +    "n_buckets": 20,
> +    "bug_numbers": [1397092],
> +    "alert_emails": ["jaws@mozilla.com"],
> +    "description": "Firefox: Time taken to get an idle callback while loading a page (ms). Loading of about: pages is not counted."

You're going to need a data review peer for this.
Attachment #8919842 - Flags: review?(mconley) → review+
Flags: needinfo?(mconley)
Comment on attachment 8919842 [details]
Bug 1406414 - Switch to an APNG loading indicator when the browser is under schedule pressure.

Requesting data-review for new Telemetry probe.
Attachment #8919842 - Flags: feedback?(francois)
Comment on attachment 8919842 [details]
Bug 1406414 - Switch to an APNG loading indicator when the browser is under schedule pressure.

https://reviewboard.mozilla.org/r/190786/#review198752

This is Category 1 data.

datareview+

::: toolkit/components/telemetry/Histograms.json:6173
(Diff revision 2)
> +    "record_in_processes": ["main"],
> +    "expires_in_version": "61",
> +    "kind": "exponential",
> +    "high": 10000,
> +    "n_buckets": 20,
> +    "bug_numbers": [1397092],

Should that bug number be 1406414 instead?
Attachment #8919842 - Flags: review+
Attachment #8919842 - Flags: feedback?(francois)
Pushed by jwein@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/53f5585bb94d
Switch to an APNG loading indicator when the browser is under schedule pressure. r=francois,mconley
https://hg.mozilla.org/mozilla-central/rev/53f5585bb94d
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 58
Depends on: 1412809
Is there any timeline on when this will be fixed(the main bug not the APNG loaders)

This problem with SVG is pretty severe and the fallback throbbers still don't bring the performance to that level.

or need to switch to opera/vivaldi or FF56?
They are not as taxing 
https://www.reddit.com/r/firefox/comments/79wowc/webrender_is_really_usable_now_and_i_think_it/dp64t3b/
Flags: needinfo?(mconley)
The root issue with the SVG performance should be addressed by bug 1366353 and bug 1399962 (potentially also bug 1401268).

There doesn't appear to be a set timeline for fixing those, and extra comments in those bugs aren't likely to increase the priority of them. The patch that landed with this bug introduced a telemetry probe that we can use to see how many users are running in to this situation. If we see that many users are hitting this bug then that may increase the priority of those other bugs.
Flags: needinfo?(mconley)
Hi Jared, since this is tracked as a 57 blocker, what are your thoughts on the risk associated with landing this in 57 RC1? The other options are 1) if it is high risk, let it ride the 58 train. or 2) let this bake in central for a while and include it in a 57.0.x dot release in 2-3 weeks time frame. Please let me know. Thanks!
Flags: needinfo?(jaws)
I think we should let this ride the 58 train. If we start to hear higher reports from users in 57 release then we could re-evaluate but as of now we aren't sure how many users are hitting this case and the fallback experience isn't exactly great either (though it does allow more interaction with the browser).
Flags: needinfo?(jaws)
Taking this off the blocker list per comment 30.  We'll keep tracking and re-evaluate in a couple of weeks.
I pulled up the Telemetry data for this feature. This feature kicks in if idle timeouts take longer than 1 second to resolve.

Idle callback times have been consistent since this landed (Oct 27 to Nov 11) around 15-16ms. Results:
https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=median&cumulative=0&end_date=null&keys=&max_channel_version=nightly%252F58&measure=FX_SCHEDULE_PRESSURE_IDLE_SAMPLE_MS&min_channel_version=nightly%252F55&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=null&trim=1&use_submission_date=0

The distribution of idle callbacks has:
5th Percentile     6.57ms
25th Percentile    11.34ms
Median             15.89ms
75th Percentile    21.11ms
95th Percentile    57.73ms

The data shows that roughly 0.32% of users have idle callbacks that are longer than 1 second. Results: 
https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-11-12&keys=__none__!__none__!__none__&max_channel_version=nightly%252F58&measure=FX_SCHEDULE_PRESSURE_IDLE_SAMPLE_MS&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-10-27&table=0&trim=1&use_submission_date=0

I suggest we stop tracking this for 57. Julien?
Flags: needinfo?(jcristau)
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #32)
> I pulled up the Telemetry data for this feature. This feature kicks in if
> idle timeouts take longer than 1 second to resolve.
> 
> Idle callback times have been consistent since this landed (Oct 27 to Nov
> 11) around 15-16ms. Results:
> https://telemetry.mozilla.org/new-pipeline/evo.html#!
> aggregates=median&cumulative=0&end_date=null&keys=&max_channel_version=nightl
> y%252F58&measure=FX_SCHEDULE_PRESSURE_IDLE_SAMPLE_MS&min_channel_version=nigh
> tly%252F55&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&sta
> rt_date=null&trim=1&use_submission_date=0
> 
> The distribution of idle callbacks has:
> 5th Percentile     6.57ms
> 25th Percentile    11.34ms
> Median             15.89ms
> 75th Percentile    21.11ms
> 95th Percentile    57.73ms
> 
> The data shows that roughly 0.32% of users have idle callbacks that are
> longer than 1 second. Results: 
> https://telemetry.mozilla.org/new-pipeline/dist.html#!
> cumulative=0&end_date=2017-11-12&keys=__none__!__none__!
> __none__&max_channel_version=nightly%252F58&measure=FX_SCHEDULE_PRESSURE_IDLE
> _SAMPLE_MS&min_channel_version=null&processType=*&product=Firefox&sanitize=1&
> sort_keys=submissions&start_date=2017-10-
> 27&table=0&trim=1&use_submission_date=0
> 
> I suggest we stop tracking this for 57. Julien?

Sorry but really confused so could you clear it up please?

The patch has landed in 58 only so the Telemetry is for 58 nightly users or beta users and if so isn't it just a very low userbase to begin with ?
and why 1 second and not something lower? which would show how often more clearly as if not mistaken the time taken to switch is 1 second+1 second=2 seconds

FF57 is gradually being rolled out so the results are not conclusive + the patch is not in FF57 so isn't the data not accurate ?
32bit firefox shows regressed performance BTW
The time to switch (2 seconds) is not relevant for this telemetry probe. This probe takes the measurement values outside of the actual determination to change to PNG or SVG throbbers.

Yes it is true that we don't have a lot of data yet. As to the question of if we should hurry up and ship this change in 57, right now the data says no. When we get more data we can look at this again.

Can you clarify what you mean by "32bit firefox shows regressed performance"?
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #35)
> The time to switch (2 seconds) is not relevant for this telemetry probe.
> This probe takes the measurement values outside of the actual determination
> to change to PNG or SVG throbbers.
> 
> Yes it is true that we don't have a lot of data yet. As to the question of
> if we should hurry up and ship this change in 57, right now the data says
> no. When we get more data we can look at this again.
> 
> Can you clarify what you mean by "32bit firefox shows regressed performance"?

compared to FF55 FF57 is regressed.

You are the expert but missing one key point,
8seconds was total delay when opening bunch of pages, single pages loaded fine but bunch caused the issues.
Here too the time taken to render is still high, so the 8secs came from the tiny bits of delay in each page thanks to the 60fps throbbers and that is still the case.
(In reply to shellye from comment #36)
> (In reply to Jared Wein [:jaws] (please needinfo? me) from comment #35)
> > The time to switch (2 seconds) is not relevant for this telemetry probe.
> > This probe takes the measurement values outside of the actual determination
> > to change to PNG or SVG throbbers.
> > 
> > Yes it is true that we don't have a lot of data yet. As to the question of
> > if we should hurry up and ship this change in 57, right now the data says
> > no. When we get more data we can look at this again.
> > 
> > Can you clarify what you mean by "32bit firefox shows regressed performance"?
> 
> compared to FF55 FF57 is regressed.

I see, yes I'm well aware of this specific issue.
 
> You are the expert but missing one key point,
> 8seconds was total delay when opening bunch of pages, single pages loaded
> fine but bunch caused the issues.
> Here too the time taken to render is still high, so the 8secs came from the
> tiny bits of delay in each page thanks to the 60fps throbbers and that is
> still the case.

Understood, but we have to balance how often will users be opening a bunch of pages at the same versus us shipping code straight to release that has not had much time to get testing/feedback from users on Beta. That is the situation we are in now and we will have to sit tight and wait to get more data.
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #37)
> (In reply to shellye from comment #36)
> > (In reply to Jared Wein [:jaws] (please needinfo? me) from comment #35)
> > > The time to switch (2 seconds) is not relevant for this telemetry probe.
> > > This probe takes the measurement values outside of the actual determination
> > > to change to PNG or SVG throbbers.
> > > 
> > > Yes it is true that we don't have a lot of data yet. As to the question of
> > > if we should hurry up and ship this change in 57, right now the data says
> > > no. When we get more data we can look at this again.
> > > 
> > > Can you clarify what you mean by "32bit firefox shows regressed performance"?
> > 
> > compared to FF55 FF57 is regressed.
> 
> I see, yes I'm well aware of this specific issue.
>  
> > You are the expert but missing one key point,
> > 8seconds was total delay when opening bunch of pages, single pages loaded
> > fine but bunch caused the issues.
> > Here too the time taken to render is still high, so the 8secs came from the
> > tiny bits of delay in each page thanks to the 60fps throbbers and that is
> > still the case.
> 
> Understood, but we have to balance how often will users be opening a bunch
> of pages at the same versus us shipping code straight to release that has
> not had much time to get testing/feedback from users on Beta. That is the
> situation we are in now and we will have to sit tight and wait to get more
> data.

Understood .
please keep in mind 
one page=no issues , 2 pages= issues
The patch in beta 58 *does* not fix the problem so will the data really help?
the time taken to kick in fallback method is too large for it to have any effect,
the mean is higher per page compared to older versions which adds up to the 8 seconds.
This doesn't seem to have become a widespread issue post release, so calling it wontfix for 57.
Flags: needinfo?(jcristau)
See Also: → 1433735
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #37)
Any progress? The difference between esr52 and FF58 is high and with next esr coming soon this is getting worrisome
No change since comment 32. I did file and investigate bug 1436423 but we ended up concluding that we shouldn't change the settings.

Is there a specific value for browser.schedulePressure.timeoutMs in about:config that works better for you?

(Also, please flag me for "needinfo" so I will make sure to see your question. You only need to comment in one bug, multiple bugs is just spamming and doesn't help.)
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #41)
> No change since comment 32. I did file and investigate bug 1436423 but we
> ended up concluding that we shouldn't change the settings.
> 
> Is there a specific value for browser.schedulePressure.timeoutMs in
> about:config that works better for you?

changing the pref to 100 makes it smoother but not sure what it actually does so not tested much.

needinfo to ask about correct value
Flags: needinfo?(jaws)
The value used here is the length of delay (in milliseconds) that Firefox will allow before disabling the SVG loading throbber.

shellye, can you help me out by going to "about:telemetry" in Firefox, clicking on the "Histograms" button on the left side, then searching for "SCHEDULE" in the search box near the top right of the webpage? This histogram may tell you a better value to use. If you could take a screenshot of your histogram and share it, it will help us get a better idea of what users with your setup are experiencing and possibly change the value to help all users.
Flags: needinfo?(jaws) → needinfo?(shellye5)
Attached image daily use profile
Flags: needinfo?(shellye5) → needinfo?(jaws)
Attached image FF60 fresh.jpg
Thanks shellye, the screenshots uploaded do show a much longer tail on Fx60 than on Fx59. Looking at your 59vs60 screenshots, I would say that a value of around 284 would work good for you.

@mconley, what do you think about lowering the limit from 1000ms to 300ms (284ms rounded up)? We could re-open bug 1436423 and use that to drop it to 300ms. It's worth noting that 1000ms was chosen without any specific experiments.
Flags: needinfo?(jaws)
For comment 48 ^
Flags: needinfo?(mconley)
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #49)
> For comment 48 ^

Hi 
what does browser.schedulePressure.defaultCount do?

like OP was having problems and your recommendation seems to work well with 250

browser.schedulePressure.defaultCount;2 set  or even 1 causes spinners to freak out with
browser.schedulePressure.timeoutMs = 250 ms
Flags: needinfo?(jaws)
defaultCount tracks how many consecutive schedule pressure misses do we need before we fall back to using the APNG throbbers. Using a value less than three will cause us to switch back and forth very quickly and will (as you noted) cause the spinners to freak out.
Flags: needinfo?(jaws)
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #51)
> defaultCount tracks how many consecutive schedule pressure misses do we need
> before we fall back to using the APNG throbbers. Using a value less than
> three will cause us to switch back and forth very quickly and will (as you
> noted) cause the spinners to freak out.

Thanks switched it to default
and browser.schedulePressure.timeoutMs = 250 ms
Clearing needinfo - we dropped the schedule pressure threshold to 300ms in bug 1436423.
Flags: needinfo?(mconley)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: