Closed Bug 1338522 Opened 3 years ago Closed 3 years ago

Add telemetry for UI responsiveness during migration of profile data from another browser

Categories

(Firefox :: Migration, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 54
Tracking Status
firefox52 --- fixed
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: dao, Assigned: dao)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 5 obsolete files)

No description provided.
Attached patch patch (obsolete) — Splinter Review
Some notes:

- I've tested ResponsivenessMonitor.jsm, but the code in AutoMigrate.jsm is entirely untested at the moment, so only asking for feedback.

- This doesn't break down by bookmark/history/password import since AutoMigrate.jsm doesn't seem to know enough to make that distinction. Could probably move this to MigrationUtils.jsm instead, although perhaps we can just correlate the data with that from the FX_MIGRATION_*_IMPORT_MS probes?

- performance.now doesn't seem to work in JSMs, so I just used Date.now. Shouldn't matter for our purpose.
Attachment #8836009 - Flags: feedback?(gijskruitbosch+bugs)
Comment on attachment 8836009 [details] [diff] [review]
patch

Review of attachment 8836009 [details] [diff] [review]:
-----------------------------------------------------------------

I don't know that this makes sense. It's basically a busy loop every 16ms. It'll make performance worse. I don't know how the add-on responsiveness stuff works, but maybe that can help? Yoric would know more.

::: toolkit/modules/ResponsivenessMonitor.jsm
@@ +31,5 @@
> +      return;
> +    }
> +
> +    let now = Date.now();
> +    this._accumulatedDelay += (now - this._prevTimestamp) - this.intervalMS;

we should ensure we don't decrease the delay if the timer fires early by our measurement.
Attachment #8836009 - Flags: feedback?(gijskruitbosch+bugs)
(In reply to :Gijs from comment #2)
> I don't know that this makes sense. It's basically a busy loop every 16ms.
> It'll make performance worse.

It's not a loop and not really busy since the callback doesn't do anything remotely expensive. I don't see how this would make performance worse in a meaningful way, but I can change the interval to 100ms to ease that concern.

> we should ensure we don't decrease the delay if the timer fires early by our
> measurement.

We're interested in multi-second hangs here, so a timer firing a few milliseconds early wouldn't matter in the grand scheme of things, but I can handle that case for sanity.
Attached patch patch v2 (obsolete) — Splinter Review
Attachment #8836009 - Attachment is obsolete: true
(In reply to Dão Gottwald [:dao] from comment #1)
> - This doesn't break down by bookmark/history/password import since
> AutoMigrate.jsm doesn't seem to know enough to make that distinction. Could
> probably move this to MigrationUtils.jsm instead, although perhaps we can
> just correlate the data with that from the FX_MIGRATION_*_IMPORT_MS probes?

Do you have thoughts on this?
Flags: needinfo?(gijskruitbosch+bugs)
Attached patch patch v3 (obsolete) — Splinter Review
(In reply to Dão Gottwald [:dao] from comment #3)
> (In reply to :Gijs from comment #2)
> > I don't know that this makes sense. It's basically a busy loop every 16ms.
> > It'll make performance worse.
> 
> [...] I can change the interval to 100ms to ease that concern.

In fact a one second interval should still get us the data we need, given the severity of the problem that we're trying to identify.
Attachment #8836343 - Attachment is obsolete: true
(In reply to Dão Gottwald [:dao] from comment #6)
> Created attachment 8836371 [details] [diff] [review]
> patch v3
> 
> (In reply to Dão Gottwald [:dao] from comment #3)
> > (In reply to :Gijs from comment #2)
> > > I don't know that this makes sense. It's basically a busy loop every 16ms.
> > > It'll make performance worse.
> > 
> > [...] I can change the interval to 100ms to ease that concern.
> 
> In fact a one second interval should still get us the data we need, given
> the severity of the problem that we're trying to identify.

I'm not sure that's true... we don't know anything about this hanging since we can't reproduce it, but the videos don't show the typical "not responding" stuff from Windows (with a native title-bar being drawn despite our asking Windows not to - or has Win10 stopped doing that?) so I'm not sure exactly what's happening. Really, to be sure whether whatever reporting we add is effective we'd need to be able to reproduce, or just land it and then know based on whether we get results back that show non-zero-hanginess.

(In reply to Dão Gottwald [:dao] from comment #5)
> (In reply to Dão Gottwald [:dao] from comment #1)
> > - This doesn't break down by bookmark/history/password import since
> > AutoMigrate.jsm doesn't seem to know enough to make that distinction. Could
> > probably move this to MigrationUtils.jsm instead, although perhaps we can
> > just correlate the data with that from the FX_MIGRATION_*_IMPORT_MS probes?
> 
> Do you have thoughts on this?

Correlating telemetry after collection is almost impossible, IME. I would add 3 probes similar to the _IMPORT_MS ones and collect hangs for each item. This has the added bonus that we'll gather data for the non-automigration case, too.
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to :Gijs from comment #7)
> > In fact a one second interval should still get us the data we need, given
> > the severity of the problem that we're trying to identify.
> 
> I'm not sure that's true... we don't know anything about this hanging since
> we can't reproduce it, but the videos don't show the typical "not
> responding" stuff from Windows (with a native title-bar being drawn despite
> our asking Windows not to - or has Win10 stopped doing that?) so I'm not
> sure exactly what's happening.

The videos seem to show Firefox being super slow but processing events every now and then rather than being completely hung, so this would explain why Windows doesn't consider Firefox hung, but a timer with a one second interval should still detect that case.

> Really, to be sure whether whatever reporting
> we add is effective we'd need to be able to reproduce, or just land it and
> then know based on whether we get results back that show non-zero-hanginess.

Right.

> Correlating telemetry after collection is almost impossible, IME. I would
> add 3 probes similar to the _IMPORT_MS ones and collect hangs for each item.
> This has the added bonus that we'll gather data for the non-automigration
> case, too.

Okay.
Attached patch patch v4 (obsolete) — Splinter Review
Attachment #8836371 - Attachment is obsolete: true
Attachment #8836384 - Attachment description: responsivenessMonitor.diff → patch v4
Attached patch patch v5 (obsolete) — Splinter Review
Somehow getKeyedHistogramById refuses to find my histograms. Maybe you know or see what I did wrong... Probably something trivial or terribly stupid.
Attachment #8836384 - Attachment is obsolete: true
Attachment #8836449 - Flags: feedback?(gijskruitbosch+bugs)
(In reply to Dão Gottwald [:dao] from comment #10)
> Created attachment 8836449 [details] [diff] [review]
> patch v5

This is on top of bug 1338812's patch which has landed but isn't on mozilla-central yet.
Comment on attachment 8836449 [details] [diff] [review]
patch v5

Review of attachment 8836449 [details] [diff] [review]:
-----------------------------------------------------------------

The histograms are compiled into a cpp file, so you need a non-artifact build (so also not ./mach build faster even on a non-artifact build) of toolkit/components in order to get the changes. Yes, it's annoying. I don't know why telemetry works that way and why it doesn't read the definitions out of an external file (presumably an optimization of sorts).

Note that this will also need data-review from e.g. bsmedberg.

::: browser/components/migration/MigrationUtils.jsm
@@ +265,5 @@
> +    let maybeStartResponsivenessMonitor = resourceType => {
> +      let responsivenessMonitor;
> +      let histogramId = getHistogramIdForResourceType(resourceType, "FX_MIGRATION_*_JANK_MS");
> +      if (histogramId) {
> +        responsivenessMonitor = new ResponsivenessMonitor;

Nit: add () to the new ...() call.

@@ +268,5 @@
> +      if (histogramId) {
> +        responsivenessMonitor = new ResponsivenessMonitor;
> +        responsivenessMonitor.start();
> +      }
> +      return [responsivenessMonitor, histogramId];

We always deconstruct this, so this could return object shorthand ({responsivenessMonitor, histogramId}) that we deconstruct that way, avoiding callsites potentially mis-assigning these by changing the order, which can happen in the array case (and meaning you might not have to update callsites if we add/remove return values).

::: toolkit/modules/ResponsivenessMonitor.jsm
@@ +26,5 @@
> +    this._timer.initWithCallback(this, this.intervalMS, Ci.nsITimer.TYPE_REPEATING_SLACK);
> +  },
> +
> +  notify(timer) {
> +    if (timer != this._timer) {

Can this actually happen? Looks to me like `this.abort()` would kill a pre-existing timer even if we call start() multiple times on the same object. I guess in theory the timer's cancel() call could race with the notification having already been queued on our thread that the timer has 'hit', or something?

Perhaps it makes more sense to move the start() code into the constructor instead? That would make it impossible to reuse the object, which might be easier to understand.
Attachment #8836449 - Flags: feedback?(gijskruitbosch+bugs) → feedback+
Attachment #8836449 - Attachment is obsolete: true
Comment on attachment 8836743 [details]
Bug 1338522 - Add telemetry for UI responsiveness during import of profile data from another browser.

https://reviewboard.mozilla.org/r/112090/#review113366
Attachment #8836743 - Flags: review?(gijskruitbosch+bugs) → review+
FWIW, I can confirm that I can see a 7-10-fold difference (~1.5s vs. 14-30s) in the values of this histogram if I remove all the frecency notifications (which get posted from the sql update queries via sql functions to the sql thread and then to the main thread and then to various JS consumers) as well as various other callbacks that all post their own notification to the main thread from the sql thread (as runnables). So it seems to measure something useful, at least.
Attachment #8836743 - Flags: review?(benjamin)
Attachment #8836743 - Flags: feedback?(benjamin)
Comment on attachment 8836743 [details]
Bug 1338522 - Add telemetry for UI responsiveness during import of profile data from another browser.

Also trying data collection peers who might be closer to my time zone.
Attachment #8836743 - Flags: feedback?(a.m.naaktgeboren)
Attachment #8836743 - Flags: feedback?(rweiss)
Comment on attachment 8836743 [details]
Bug 1338522 - Add telemetry for UI responsiveness during import of profile data from another browser.

https://reviewboard.mozilla.org/r/112090/#review113782

::: toolkit/components/telemetry/Histograms.json:5116
(Diff revision 2)
> +    "kind": "exponential",
> +    "n_buckets": 20,
> +    "high": 60000,
> +    "releaseChannelCollection": "opt-out",
> +    "keyed": true,
> +    "description": "Accumulated timer delay as an indicator for decreased responsiveness while auto-importing bookmarks from another browser, keyed by the name of the browser."

Reading both this description and the related code, I still don't quite understand what this measures. You have a 1s timer and then measure the variance between when you got the timer and when you expected to get it? Doesn't that scale by the number of seconds the whole process takes? How is that valuable to you?

In addition, this whole system is using JS date, which is non-monotonic and therefore very vulnerable to clock skew, which is especially a problem for short-duration measurements such as this.

We have much more fine-grained event loop telemetry available in general which measures runnable delays, and it's probably very easy to subset that for specific tasks like this.

That's not a data-review-, exactly, but I do think you should at least improve the description because it's very confusing.
Comment on attachment 8836743 [details]
Bug 1338522 - Add telemetry for UI responsiveness during import of profile data from another browser.

I'd like to see the description again.
Attachment #8836743 - Flags: feedback?(rweiss)
Attachment #8836743 - Flags: feedback?(benjamin)
Attachment #8836743 - Flags: feedback?(a.m.naaktgeboren)
Attachment #8836743 - Flags: feedback-
(In reply to Benjamin Smedberg [:bsmedberg] from comment #18)
> Comment on attachment 8836743 [details]
> Bug 1338522 - Add telemetry for responsiveness during auto migration of
> profile data from another browser.
> 
> https://reviewboard.mozilla.org/r/112090/#review113782
> 
> ::: toolkit/components/telemetry/Histograms.json:5116
> (Diff revision 2)
> > +    "kind": "exponential",
> > +    "n_buckets": 20,
> > +    "high": 60000,
> > +    "releaseChannelCollection": "opt-out",
> > +    "keyed": true,
> > +    "description": "Accumulated timer delay as an indicator for decreased responsiveness while auto-importing bookmarks from another browser, keyed by the name of the browser."
> 
> Reading both this description and the related code, I still don't quite
> understand what this measures. You have a 1s timer and then measure the
> variance between when you got the timer and when you expected to get it?

Yes.

> Doesn't that scale by the number of seconds the whole process takes?

Yes.

> How is that valuable to you?

It exposes a problem regardless of how long the whole process takes. Most of the import is happening in a background thread and not expected to block the UI.

> In addition, this whole system is using JS date, which is non-monotonic and
> therefore very vulnerable to clock skew, which is especially a problem for
> short-duration measurements such as this.

'performance' doesn't seem to be available in JSMs and Components.utils.importGlobalProperties doesn't seem to help. The measurement is not that short-duration -- people have seen Firefox become unresponsive for more than a minute.

> We have much more fine-grained event loop telemetry available in general
> which measures runnable delays, and it's probably very easy to subset that
> for specific tasks like this.

Unfortunately I know nothing at all about that and we're on a tight schedule.

> That's not a data-review-, exactly, but I do think you should at least
> improve the description because it's very confusing.

I'm not sure where it's lacking since you seem to have figured out what it means, but I'll see what I can do.
Comment on attachment 8836743 [details]
Bug 1338522 - Add telemetry for UI responsiveness during import of profile data from another browser.

reviewboard
Attachment #8836743 - Flags: feedback?(benjamin)
Blocks: 1249008
Summary: Add telemetry for responsiveness during auto migration of profile data from another browser → Add telemetry for responsiveness during migration of profile data from another browser
Comment on attachment 8836743 [details]
Bug 1338522 - Add telemetry for UI responsiveness during import of profile data from another browser.

Redirecting to Rebecca / François as suggested by bsmedberg since he's on PTO.
Attachment #8836743 - Flags: feedback?(rweiss)
Attachment #8836743 - Flags: feedback?(francois)
Attachment #8836743 - Flags: feedback?(benjamin)
Comment on attachment 8836743 [details]
Bug 1338522 - Add telemetry for UI responsiveness during import of profile data from another browser.

https://reviewboard.mozilla.org/r/112090/#review115136

The new descriptions look much clearer to me. I would suggest two small tweaks.

Linking to the list of possible keys in the description of each probe:

    keyed by the name of the browser (see gAvailableMigratorKeys in browser/components/migration/MigrationUtils.jsm)

and including units in the description:

    Accumulated timer delay in ms

datareview+
Attachment #8836743 - Flags: review+
A few quick questions:

1) Will this be added to opt-out telemetry on release? (Yes/No)
2) Will this be left on forever, or is it only bound to a specific version of Firefox? (yes/version(s))
3) How will you be collecting the resulting data and analyzing it?  (t.m.o, s.t.m.o, a.t.m.o, don't know).

If the answer to (3) is don't know, we should probably discuss further.  If the answer to (2) is "forever," who is responsible for monitoring this histogram?
Flags: needinfo?(dao+bmo)
(In reply to Rebecca Weiss from comment #25)
> A few quick questions:
> 
> 1) Will this be added to opt-out telemetry on release? (Yes/No)

Yes

> 2) Will this be left on forever, or is it only bound to a specific version
> of Firefox? (yes/version(s))

57 is the version auto-migration might ship with. So I marked this as expiring in version 58, but ultimately I expect it will depend on the data we find. If at some point we consider the problem under control, I suppose we can remove this, although an argument could be made for keeping it longer. I don't currently plan on keeping this forever though.

> 3) How will you be collecting the resulting data and analyzing it?  (t.m.o,
> s.t.m.o, a.t.m.o, don't know).

telemetry.mozilla.org is what I would start with, I expect it will give us good round-up insight.
Flags: needinfo?(dao+bmo)
(In reply to François Marier [:francois] from comment #24)
> The new descriptions look much clearer to me. I would suggest two small
> tweaks.

Done. Thanks for the feedback.

I also changed the timer interval back to 100ms after the recent discussion in bug 1332225.
Pushed by dgottwald@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/23391448faad
Add telemetry for UI responsiveness during import of profile data from another browser. r=francois,Gijs
Summary: Add telemetry for responsiveness during migration of profile data from another browser → Add telemetry for UI responsiveness during migration of profile data from another browser
https://hg.mozilla.org/mozilla-central/rev/23391448faad
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 54
Comment on attachment 8836743 [details]
Bug 1338522 - Add telemetry for UI responsiveness during import of profile data from another browser.

Approval Request Comment
[Feature/Bug causing the regression]: bug 1332225 / bug 1332318
[User impact if declined]: /
[Is this code covered by automated tests?]: no
[Has the fix been verified in Nightly?]: no
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: /
[Is the change risky?]: no
[Why is the change risky/not risky?]: only adding telemetry, no behavior change
[String changes made/needed]: /
Attachment #8836743 - Flags: approval-mozilla-aurora?
Attached patch patch for betaSplinter Review
Attachment #8839170 - Flags: approval-mozilla-beta?
Is this critical enough that it really needs to get into 52 less than a week from RC?
Flags: needinfo?(dao+bmo)
(In reply to Julien Cristau [:jcristau] from comment #33)
> Is this critical enough that it really needs to get into 52 less than a week
> from RC?

Yes, it's crucial for bug 1332318.
Flags: needinfo?(dao+bmo)
Comment on attachment 8836743 [details]
Bug 1338522 - Add telemetry for UI responsiveness during import of profile data from another browser.

Collect telemetry data to investigate the hang issue. Aurora53+.
Attachment #8836743 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8839170 [details] [diff] [review]
patch for beta

telemetry for migration code, beta52+

OK, let's get this in b9.
Attachment #8839170 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.