Disable e10s if the user sees the big tab spinner too much

RESOLVED WONTFIX

Status

()

defect
P1
normal
RESOLVED WONTFIX
3 years ago
3 years ago

People

(Reporter: mconley, Assigned: Felipe)

Tracking

(Blocks 1 bug)

unspecified
Firefox 52
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s+, firefox48 wontfix, firefox49+ wontfix, firefox50+ wontfix, firefox51 affected, firefox52 affected)

Details

User Story

Definition of a slow tab switch: FX_TAB_SWITCH_SPINNER_VISIBLE_LONG_MS where time time is greater than 2 seconds.

Definition of "last two days":
Look at all subsession pings where subsessionDate is today or yesterday.

Specification
* consider all slow tab switches in the last two days, sum the total tab switch time into T
* total the subsessionlength in the last two days into L (in hours)
* if T / L is greater than 20 seconds/hour, disable e10s

Attachments

(2 attachments)

What "too much" is is not yet determined.

What we know is that some users are seeing the big tab spinner when switching tabs for long periods of time. See bug 1300411, for example.

The current idea is to notice when we've crossed some threshold of "too many long spinners", and then flip a pref that disables e10s on the next run - much like we do when an a11y API is used. It disqualifies the user for some period of time - like 1 week - before re-enabling.

So we can use the a11y disabling mechanism as inspiration, but we still need to determine what "too much" is.
tracking-e10s: --- → ?
See Also: → 1300411
Hey blassey,

Do you have an opinion on what "too much" is? 5 spinners of at least 10 seconds in length in the same session? 2 spinners of at least 30 seconds in length in the same session? Some accumulation of total spinner time? I don't really know what we'd consider bad enough to turn e10s off.
Flags: needinfo?(blassey.bugs)
I think this question should be directed at product, so NI Jeff.

How is this for a strawman: Anything longer than 2s is considered a terrible spinner. If we total up the time spent in terrible spinners (or put another way, ignore spinners of less than 2s), can we turn off e10s for any user who has been seeing more than 20s of terrible spinners per hour for 3 days? This would capture the people who are seeing occasional 10+ second spinners but also the people who are seeing regular 2s spinners. The reason for looking for this over 3 days is to rule out the case where they were just encoding a video in final cut that day.

The second thing I'd like to do is then monitor these people after we've disabled e10s because if they're seeing terrible tab switching times without e10s as well, we might as well turn e10s back on.
Flags: needinfo?(blassey.bugs) → needinfo?(jgriffiths)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #2)
> I think this question should be directed at product, so NI Jeff.
> 
> How is this for a strawman: Anything longer than 2s is considered a terrible
> spinner. If we total up the time spent in terrible spinners (or put another
> way, ignore spinners of less than 2s), can we turn off e10s for any user who
> has been seeing more than 20s of terrible spinners per hour for 3 days? This
> would capture the people who are seeing occasional 10+ second spinners but
> also the people who are seeing regular 2s spinners. The reason for looking
> for this over 3 days is to rule out the case where they were just encoding a
> video in final cut that day.
> 
> The second thing I'd like to do is then monitor these people after we've
> disabled e10s because if they're seeing terrible tab switching times without
> e10s as well, we might as well turn e10s back on.

Do we know currently how often people see >2sec spinners? How many people would we disable in the first day or week? I'd hate to drain the cohort dramatically and I'm going to be honest, I see a LOT of 2sec+spinners. It would be also great to know why we're getting a given bad spinner, and then make those things less slow.

Otherwise this seems reasonable.
Flags: needinfo?(jgriffiths)
(In reply to Jeff Griffiths (:canuckistani) (:⚡︎) from comment #3)
> (In reply to Brad Lassey [:blassey] (use needinfo?) from comment #2)
> > I think this question should be directed at product, so NI Jeff.
> > 
> > How is this for a strawman: Anything longer than 2s is considered a terrible
> > spinner. If we total up the time spent in terrible spinners (or put another
> > way, ignore spinners of less than 2s), can we turn off e10s for any user who
> > has been seeing more than 20s of terrible spinners per hour for 3 days? This
> > would capture the people who are seeing occasional 10+ second spinners but
> > also the people who are seeing regular 2s spinners. The reason for looking
> > for this over 3 days is to rule out the case where they were just encoding a
> > video in final cut that day.
> > 
> > The second thing I'd like to do is then monitor these people after we've
> > disabled e10s because if they're seeing terrible tab switching times without
> > e10s as well, we might as well turn e10s back on.
> 
> Do we know currently how often people see >2sec spinners? 
No, what we do know is that about 1% of tab switches see spinners of 1s or more and we have heard from 2 users who are regularly seeing 5s spinners. We need to get telemetry in to get a clearer picture here.
> How many people
> would we disable in the first day or week? 
Same as above, we're missing the telemetry to know what this will look like. 
> I'd hate to drain the cohort
> dramatically and I'm going to be honest, I see a LOT of 2sec+spinners. It
> would be also great to know why we're getting a given bad spinner, and then
> make those things less slow.
This is part of the goal here. If we can identify who is seeing terrible spinners hopefully we can slice the data to find that its all people who XXXX. Would you rather set the bar higher and have a 5s cut off?
> 
> Otherwise this seems reasonable.

Another option is to land this code but rather than disable e10s have it report back through telemetry  so we can get a sense of how many people we're talking about here.
Flags: needinfo?(jgriffiths)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #4)
...
> Another option is to land this code but rather than disable e10s have it
> report back through telemetry  so we can get a sense of how many people
> we're talking about here.

I think I'd prefer to collect data first as well. In particular I'd like to get a distribution graph of spinner times to see how long these spinners are actually shown.
Flags: needinfo?(jgriffiths)
Priority: -- → P2
I assume you assigned p2 for the time being because we do not currently know what the size of the problem is?
Flags: needinfo?(benjamin)
Well it's not P1 because we decided that we weren't going to block e10s rollout on this. But we decided to keep current attention so it's not P3/backlog. That leaves P2.

I'm not clear on who owns next steps, though. Jeff is that you to analyze data coming back from 48/49?
Flags: needinfo?(benjamin) → needinfo?(jgriffiths)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #7)
> Well it's not P1 because we decided that we weren't going to block e10s
> rollout on this. But we decided to keep current attention so it's not
> P3/backlog. That leaves P2.
> 
> I'm not clear on who owns next steps, though. Jeff is that you to analyze
> data coming back from 48/49?

That's correct.
Flags: needinfo?(jgriffiths)
Blocks: e10s-spinner
I have attempted to write a user story to turn "too much" into something Felipe can implement. Brad and Jeff, can you review/confirm this story?
Assignee: nobody → felipc
User Story: (updated)
Flags: needinfo?(jgriffiths)
Flags: needinfo?(blassey.bugs)
Per today's e10s cross-functional, we now consider this a P1 for 49 and will drive either a system-addon update or a 49.0.1 dot release.

We intend to land the "proper" fix into 51 and so this rule will only apply to 49+50.
Status: NEW → ASSIGNED
Priority: P2 → P1
I know there are various different ones, so what are the exact probe names to get N and T?
(Commenting on User Story)
> Definition of "too much":
> * count the total number of tab switches over the past 7 days == T
> * count the number of tab switches over the past 7 days == N
> * If N < 5, do nothing
> * If N / T > 0.3, then disable e10s

Was N supposed to be the number of slow tab switches? If so, I think we need to define a threshold.

Also, in comment 2 I made a proposal for the decision tree, is that what you're trying to describe?
Yes I missed comment 2. "per hour" may cause skew, but let's try it.

It turns out we can't implement mine because FX_TAB_SWITCH_TOTAL_E10S_MS isn't opt-out, so we can't know the total number of tab switches.
User Story: (updated)
Things that aren't specified: when should this check be performed? Once every N tab switches? Or once an hour, or just at every startup?

When e10s is disabled, what should the experience be? Silently disable e10s to take effect at next startup? Or show the user some more aggressive prompt (that would require l10n and sounds like a bad idea).
(In reply to Benjamin Smedberg [:bsmedberg] from comment #13)
> Yes I missed comment 2. "per hour" may cause skew, but let's try it.

Sounds good. Can we adjust this with system add-on updates?

...

(In reply to Benjamin Smedberg [:bsmedberg] from comment #14)
> Things that aren't specified: when should this check be performed? Once
> every N tab switches? Or once an hour, or just at every startup?

I don't think it makes sense to capture only on startup - if people experience the behaviour ( and we capture the data ) and it causes them to abandon Firefox, we won't get the data because they won't start Firefox again. I think I like once per hour or some other schedule because I want to see data from more users, or some other scheme that tries to capture the most amount of data recorded.

> When e10s is disabled, what should the experience be? Silently disable e10s
> to take effect at next startup? Or show the user some more aggressive prompt
> (that would require l10n and sounds like a bad idea).

We should just silently disable. Users seeing this are having a bad enough time as it is, let's not hassle them, is my thinking.
Flags: needinfo?(jgriffiths)
Data collection was not part of the specification. Jeff what data-based questions do you want to answer?

I assume that future system addons versions can change the algorithm, but Felipe should confirm. Felipe also confirm how this might show up in the telemetry environment cohorts.
Flags: needinfo?(felipc)
Yeah, we could have that algorithm be part of the system add-on, and we can put these users under a different cohort name.

(The cleanest would be to do this in nsAppRunner.cpp but it can be worked out through the system add-on so that we can update this on the fly).
Flags: needinfo?(felipc)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #16)
> Data collection was not part of the specification. Jeff what data-based
> questions do you want to answer?

I guess I want to know how many people we disable via this method, and it would be awesome if we could know a bit more about their systems - so would including the unified telemetry client id with the report help?
Another thing I'd like to do is monitor the people who get disabled to see if their experience is equally bad without e10s. Perhaps we should re-enable if disabling doesn't improve things.
Flags: needinfo?(blassey.bugs)
Which metric should we use for non-e10s tab switches? Is that already an opt-out metric?
Flags: needinfo?(mconley)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #20)
> Which metric should we use for non-e10s tab switches? Is that already an
> opt-out metric?

I believe FX_TAB_SWITCH_TOTAL_MS is the correct metric - for non-e10s, this should measure the point from us handling the click event to the next frame being painted.

Unfortunately, this probe is opt-in.
Flags: needinfo?(mconley)
(In reply to Mike Conley (:mconley) - (needinfo me!) from comment #21)
...
> Unfortunately, this probe is opt-in.

If this is the one piece of data we need in order to be able to understand this problem, let's make it opt-out and uplift it to where it will make a difference.
Flags: needinfo?(rweiss)
Flags: needinfo?(benjamin)
Felipe can roll that into the fix here. Although that means we'd have to do a dot-release and not just push a system addon update.
Flags: needinfo?(benjamin)
There's a dot release coming soon, so let's ride the probe opt-out change on it. I suggest doing that in a separate bug, because if this bug doesn't make it to the dot release it's fine (we can just update the system add-on), but not if it contains the probe change.
ok, will you file?
Depends on: 1304113
We don't have a super clear dot release driver yet. Tracking this for now so we can consider it.
Flags: needinfo?(rweiss)
Posting a rough first draft to get some feedback on it. I'm still unsure about whether the session is being used correctly. The archived pings don't seem to have a concept of subsessions, so I'm using sessionLength.
Attachment #8795581 - Flags: feedback?(gfritzsche)
Attachment #8795581 - Flags: feedback?(benjamin)
Attachment #8795581 - Flags: feedback?(chutten)
Comment on attachment 8795581 [details] [diff] [review]
detectLongSpinners, first draft

Review of attachment 8795581 [details] [diff] [review]:
-----------------------------------------------------------------

Grabbing old hgrams and pulling their information like this is a good plan. Just be careful you're getting the right type and reason of pings so that the fields you want are present, and you're not double-counting information.

::: browser/extensions/e10srollout/bootstrap.js
@@ +186,5 @@
>  }
> +
> +let performLongSpinnerCheck = Task.async(function*() {
> +  const DAYS_OLD = 4;
> +  let thresholdDate = new Date(Date.now() - (1000 * 60 * 60 * 24 * DAYS_OLD));

This might work better in case we hit a leapsecond day:

let thresholdDate = new Date();
thresholdDate.setDate(thresholdDate.getDate() - 4);

@@ +190,5 @@
> +  let thresholdDate = new Date(Date.now() - (1000 * 60 * 60 * 24 * DAYS_OLD));
> +
> +  let allPingsInfo = yield TelemetryArchive.promiseArchivedPingList();
> +
> +  let recentPingsInfo = allPingsInfo.filter(ping => {

Don't you also have to filter based on ping type/reason?

You'll likely want to look only at "main" pings.

@@ +219,5 @@
> +        if (spinnerTime >= 2000) {
> +          totalSpinnerTime += spinnerTime * histogram.values[spinnerTime];
> +        }
> +      }
> +    } catch (e) {}

definitely just continue; if !("FX_TAB_SWITCH_SPINNER_VISIBLE_LONG_MS" in ping.payload.histograms)

@@ +227,5 @@
> +
> +  print("totalSessionTime = " + totalSessionTime + "s");
> +  print("totalSpinnerTime = " + totalSpinnerTime + "s");
> +  print("threshold = " + totalSpinnerTime / totalSessionTime);
> +  const ACCEPTABLE_THRESHOLD = 20 / 3600; // 20 seconds per hour

Just an "out of thin air" number? If this is mentioned in the bug, reference the comment.
Attachment #8795581 - Flags: feedback?(chutten)
Attachment #8795581 - Flags: feedback?(gfritzsche)
Attachment #8795581 - Flags: feedback?(benjamin)
(In reply to Chris H-C :chutten from comment #29)
> > +
> > +  print("totalSessionTime = " + totalSessionTime + "s");
> > +  print("totalSpinnerTime = " + totalSpinnerTime + "s");
> > +  print("threshold = " + totalSpinnerTime / totalSessionTime);
> > +  const ACCEPTABLE_THRESHOLD = 20 / 3600; // 20 seconds per hour
> 
> Just an "out of thin air" number? If this is mentioned in the bug, reference
> the comment.

Yeah, this is specified in the user story here in the bug
Comment on attachment 8795807 [details]
Bug 1301131 - Disable e10s if the user sees the big tab spinner too much.

https://reviewboard.mozilla.org/r/81746/#review80784

My issues below are really mostly cosmetic.

We should probably come up with a testing plan for QA to make sure this thing is working.

Great work, felipe!

::: browser/extensions/e10srollout/bootstrap.js:208
(Diff revision 2)
> +  let recentPingsInfo = allPingsInfo.filter(ping => {
> +    let pingDate = new Date(ping.timestampCreated);
> +    return pingDate > thresholdDate;
> +  });

Since we're already filtering pings here, maybe we should only return `true` if this is a main ping too with e10s enabled as well?

It looks like `getCurrentPingData` will only ever return main pings: http://searchfox.org/mozilla-central/rev/572e74ee991bbfd812766b4524237eb77577a4b1/toolkit/components/telemetry/TelemetryController.jsm#924

Not sure if you want subsessions, but `getCurrentPingData` will default to returning subsession data. If we want that, we're good. If we don't want that, we should pass false as an argument.

::: browser/extensions/e10srollout/bootstrap.js:226
(Diff revision 2)
> +      if (ping.type != "main") {
> +        continue;
> +      }
> +
> +      if (!ping.environment.settings.e10sEnabled) {
> +        continue;
> +      }

See my note above about maybe doing the filtering all in one step.
Attachment #8795807 - Flags: review+
Comment on attachment 8795807 [details]
Bug 1301131 - Disable e10s if the user sees the big tab spinner too much.

https://reviewboard.mozilla.org/r/81746/#review80806

::: browser/extensions/e10srollout/bootstrap.js:208
(Diff revision 2)
> +  let recentPingsInfo = allPingsInfo.filter(ping => {
> +    let pingDate = new Date(ping.timestampCreated);
> +    return pingDate > thresholdDate;
> +  });

the initial ping list doesn't contain the entire ping. It's only a ping "metadata" that only contains the id and the creation timestamp. That's why I can only filter by timestamp there, and by the other details after we have the entire ping through promiseArchivedPingById.


The getCurrentPingData note is a good catch. We do want only subsession data, and it looks like the default is false. I'll look into what is the right thing to do.
Comment on attachment 8795807 [details]
Bug 1301131 - Disable e10s if the user sees the big tab spinner too much.

https://reviewboard.mozilla.org/r/81746/#review80784

> Since we're already filtering pings here, maybe we should only return `true` if this is a main ping too with e10s enabled as well?
> 
> It looks like `getCurrentPingData` will only ever return main pings: http://searchfox.org/mozilla-central/rev/572e74ee991bbfd812766b4524237eb77577a4b1/toolkit/components/telemetry/TelemetryController.jsm#924
> 
> Not sure if you want subsessions, but `getCurrentPingData` will default to returning subsession data. If we want that, we're good. If we don't want that, we should pass false as an argument.

fixed, getCurrentPingData should indeed receive true
After talking to Jeff on irc I made a change to also run this on nightly/aurora. I just moved the setUpSpinnerCheck() call to startup.  This means that the check (and the pref setting) happens on every user, but the actual blocking doesn't run on Nightly/Aurora because it bails early as "unsupportedChannel".

So we can then look at the telemetry environment where the pref "e10s.rollout.disabledByLongSpinners" is being recorded, and see the size of it on any channel.

Sent to try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8afc6bea76a9
(In reply to :Felipe Gomes (needinfo me!) from comment #37)
> 
> So we can then look at the telemetry environment where the pref
> "e10s.rollout.disabledByLongSpinners" is being recorded, and see the size of
> it on any channel.

Cool - very good idea.
Pushed by felipc@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3b3de5c48e28
Disable e10s if the user sees the big tab spinner too much. r=mconley
Please remember to get data review on any new data collection:
https://wiki.mozilla.org/Firefox/Data_Collection
Flags: needinfo?(felipc)
https://hg.mozilla.org/mozilla-central/rev/3b3de5c48e28
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 52
Comment on attachment 8795807 [details]
Bug 1301131 - Disable e10s if the user sees the big tab spinner too much.

Approval Request Comment
[Feature/regressing bug #]: disable e10s for users who see the tab spinner for too long
[User impact if declined]: we might ship this to release as a system add-on update, but we first want to run it on aurora and beta to see what would be the impact on users
[Describe test coverage new/current, TreeHerder]: landed on central
[Risks and why]: limited to the e10srollout addon
[String/UUID change made/needed]: none
Attachment #8795807 - Flags: approval-mozilla-beta?
Attachment #8795807 - Flags: approval-mozilla-aurora?
(In reply to Georg Fritzsche [:gfritzsche] from comment #40)
> Please remember to get data review on any new data collection:
> https://wiki.mozilla.org/Firefox/Data_Collection

Whoops, that's a good call.

bsmedberg, can we please get a post-facto data-review sign-off? We're collecting the pref that says that e10s was disabled due to too much spinning.
Flags: needinfo?(benjamin)
(In reply to Georg Fritzsche [:gfritzsche] from comment #40)
> Please remember to get data review on any new data collection:
> https://wiki.mozilla.org/Firefox/Data_Collection

Ah sorry, I forgot! Thanks for noticing that!
Flags: needinfo?(felipc)
data-review+

If e10s is disabled by long spinners, does that also show up as a different e10s cohort?

Note: that change is not part of the system addon and therefore we couldn't deploy at least that part via system addon to 49 (without a 49.0.2)
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #45)
> data-review+
> 
> If e10s is disabled by long spinners, does that also show up as a different
> e10s cohort?

Yep, the cohort name will be temp-disabled-longspinner.

On Nightly/Aurora, where cohort names do not apply (and the long spinner won't disable e10s), we will still be able to see this data through this pref recorded on telemetry.
Comment on attachment 8795807 [details]
Bug 1301131 - Disable e10s if the user sees the big tab spinner too much.

Tab spinning is a release49 block AFAICT, let's uplift this to Aurora51 and Beta50 so we have verification this approach works and eventually roll it out on the release channel.
Attachment #8795807 - Flags: approval-mozilla-beta?
Attachment #8795807 - Flags: approval-mozilla-beta+
Attachment #8795807 - Flags: approval-mozilla-aurora?
Attachment #8795807 - Flags: approval-mozilla-aurora+
Hi Felipe, mconley, What exactly are we looking for in telemetry data (from beta) or QA?
I am planning a 49.0.2 possibly for Wednesday. So we still have time to uplift this to release, if it's been tested and if the data seems useful. If it can wait till November then I'd prefer to let it ride with 50.
Flags: needinfo?(mconley)
Flags: needinfo?(felipc)
Hi. I prefer not to take this on 49.0.2. We're still looking into some warnings that this code might be expensive to run (as it reads archived telemetry data). It looks like it's a false alarm but we're still analyzing it. If necessary we can ship this as a system add-on update afterwards.
Flags: needinfo?(mconley)
Flags: needinfo?(felipc)
Comment on attachment 8795807 [details]
Bug 1301131 - Disable e10s if the user sees the big tab spinner too much.

https://reviewboard.mozilla.org/r/81746/#review85538

::: browser/extensions/e10srollout/bootstrap.js:208
(Diff revision 3)
> +  let recentPingsInfo = allPingsInfo.filter(ping => {
> +    let pingDate = new Date(ping.timestampCreated);
> +    return pingDate > thresholdDate;

You should filter out by ping type here already instead of after loading all of them.

This way you are e.g. loading lots of sync pings etc.

::: browser/extensions/e10srollout/bootstrap.js:216
(Diff revision 3)
> +  });
> +
> +  let pingList = [];
> +
> +  for (let pingInfo of recentPingsInfo) {
> +    pingList.push(yield TelemetryArchive.promiseArchivedPingById(pingInfo.id));

This looks like you are loading the full 3 days of history every hour.

Instead you should only need to load the 3 days history once on startup, then load any new pings from the last hour on future calls.

If there are a lot of main pings, this could lead to performance impact.
Maybe it would be less impact to load the pings in packets of N, with added delay between individual steps if needed.
I'm requesting approval to back this out from beta:

There's a potential performance problem that I want to fix, and we also didn't see this being effective on beta (no user got disabled, either because the threshold is too high or the code is not working correctly).

With this in mind, I don't want to let it ride to release in its current state. If we miss the train we can release this a system add-on update afterwards, once I have improved the code.
Flags: needinfo?(rkothari)
Hi Felipe, IIUC, this fix isn't working as expected and we want to remove the logic to disable e10s if the tab spinner goes on for longer than a preset threshold. If that is right, please go ahead with the backout from beta and we will ship 50.0b11 on Friday with this backed out.
Flags: needinfo?(rkothari)
Backed out from beta in order to investigate reliability before riding this code to release:
https://hg.mozilla.org/releases/mozilla-beta/rev/c45f9369c95cb492a2dee21e9c5cefde192f573c
50 is in RC mode, too late to fix it this cycle.
I can say, that I barely see the spinner in Aurora 51. On the 51 cycle begining I saw it very often, but then smth was fixed.
(In reply to Eugene Savitsky from comment #58)
> I can say, that I barely see the spinner in Aurora 51. On the 51 cycle
> begining I saw it very often, but then smth was fixed.

Could be a lot of things, but the most recent big perf win was bug 1308332. Do you have lots of addons?
So the original patch here was backed out pending more investigation. However, in the meantime, lots of improvements landed (like bug 1308332, bug 1279086, bug 1308039) that reduced the spinners to users. And there are reports (like comment 58) and data that there are fewer people affected by them.

The remaining users are probably helpful to be kept on e10s so we can continue to analyze the data and see what and if more improvements are necessary.

So I'm gonna call this bug wontfix for now. Happy to talk more about it if someone disagrees.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.