Closed Bug 1254099 Opened 4 years ago Closed 2 years ago

Add telemetry for failed Kinto collection sync

Categories

(Cloud Services :: Firefox: Common, defect)

defect
Not set

Tracking

(firefox55 fixed)

RESOLVED FIXED
Tracking Status
firefox55 --- fixed

People

(Reporter: mgoodwin, Assigned: leplatrem)

References

Details

(Whiteboard: [psm-assigned])

Attachments

(1 file, 3 obsolete files)

We need to know when collection sync is failing (and why). Adding telemetry will allow us to get some information on this.
See Also: → 1257565
Depends on: 1265008
Whiteboard: [psm-assigned]
Component: Security: PSM → Firefox: Common
Product: Core → Cloud Services
Comment on attachment 8775199 [details]
Bug 1254099 - Add telemetry for failed Kinto collection sync. r = MattN

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/67434/diff/1-2/
Matt, I just noticed the commit message is wrong (should be r?MattN) - just letting you know in case that has messed mozreview up.

Cheers.
Flags: needinfo?(MattN+bmo)
That did seem to mess mozreview up as you can see that it didn't add the flag on BMO.
Flags: needinfo?(MattN+bmo)
https://reviewboard.mozilla.org/r/67434/#review65180

Here are some initial comments. You will need to request data review[1] eventually too.

[1] https://wiki.mozilla.org/Firefox/Data_Collection#Requesting_Approval

::: services/common/blocklist-clients.js:26
(Diff revision 2)
> +const SERVICES_SETTINGS_SYNC_SIG_FAIL        = 3;
> +const SERVICES_SETTINGS_SYNC_RETRY_SIG_FAIL  = 4;
> +const SERVICES_SETTINGS_SYNC_ONECRL_FAIL     = 5;
> +const SERVICES_SETTINGS_SYNC_ONECRL_ENTRY    = 6;
> +const SERVICES_SETTINGS_SYNC_ADDONS_FAIL     = 7;
> +const SERVICES_SETTINGS_SYNC_GFX_FAIL        = 8;
> +const SERVICES_SETTINGS_SYNC_PLUGINS_FAIL    = 9;

I still find the interchanging of "settings" and "blocklist" really confusing and wish we didn't do that. This file is about blocklists but it's recording higher level settings telemetry which feels wrong to me.

::: services/common/blocklist-clients.js:29
(Diff revision 2)
>  const { CanonicalJSON } = Components.utils.import("resource://gre/modules/CanonicalJSON.jsm");
>  
> +const SERVICES_SETTINGS_SYNC_SIG_FAIL        = 3;
> +const SERVICES_SETTINGS_SYNC_RETRY_SIG_FAIL  = 4;
> +const SERVICES_SETTINGS_SYNC_ONECRL_FAIL     = 5;
> +const SERVICES_SETTINGS_SYNC_ONECRL_ENTRY    = 6;

This looks unused at first glance

::: services/common/blocklist-clients.js:117
(Diff revision 2)
> -  constructor(collectionName, lastCheckTimePref, processCallback, signerName) {
> +  constructor(collectionName, lastCheckTimePref, processCallback, signerName,
> +              failureBucket) {
>      this.collectionName = collectionName;
>      this.lastCheckTimePref = lastCheckTimePref;
>      this.processCallback = processCallback;
>      this.signerName = signerName;
> +    this.failureBucket = failureBucket;

Instead of adding another argument, we could use a keyed histogram with buckets for success and failure and use the collectionName as the key?

::: toolkit/components/telemetry/Histograms.json:7994
(Diff revision 2)
> +  "SERVICES_SETTINGS_SUCCESS": {
> +    "expires_in_version": "never",
> +    "kind": "enumerated",
> +    "n_values": 20,
> +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> +    "bug_numbers": [1254099],
> +    "description": "Services settings (kinto) sync success information (0=Sync completed, 1=Sync failed, 2='Changes' check failed, 3=Signature verification failed, 4=Retry signature verification failed, 5=OneCRL sync failed, 6=OneCRL entry failed, 7=Addons sync failed, 8=Gfx sync failed, 9=Plugins sync failed"
> +  },

The "SERVICES_SETTINGS_" prefix isn't descriptive enough to me. It should somehow hint at being related to syncing/downloading Firefox settings. e.g. "SETTINGS_DOWNLOAD_".

The "_SUCCESS" suffix seems misleading since some buckets are for success and some are for failure. I think "_RESULT" would address that issue.

::: toolkit/components/telemetry/Histograms.json:8000
(Diff revision 2)
> +    "expires_in_version": "never",
> +    "kind": "enumerated",
> +    "n_values": 20,
> +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> +    "bug_numbers": [1254099],
> +    "description": "Services settings (kinto) sync success information (0=Sync completed, 1=Sync failed, 2='Changes' check failed, 3=Signature verification failed, 4=Retry signature verification failed, 5=OneCRL sync failed, 6=OneCRL entry failed, 7=Addons sync failed, 8=Gfx sync failed, 9=Plugins sync failed"

It seems like we shouldn't be mixing counts that have different denomintors (e.g. 1=Sync failed, 5=OneCRL sync failed, and 6=OneCRL entry failed) in one histogram. Is there a specific reason you want this? I think my suggestion of a separate histogram of results keyed by collection would be easier to analyze.
Blocks: 1332286
Depends on: 1332766
Assignee: mgoodwin → mathieu
I rebased and revamped Mark's patch here: https://reviewboard.mozilla.org/r/109178/

There's still some polishing to perform, but I would really appreciate your feedback on the approach I took. There is one histogram per collection of settings and each type error during synchronization leads to a specific reported value.

Thank you in advance for your time ;)
Flags: needinfo?(rhelmer)
Flags: needinfo?(mgoodwin)
Flags: needinfo?(MattN+bmo)
Attachment #8833323 - Attachment is obsolete: true
Attachment #8833324 - Attachment is obsolete: true
Comment on attachment 8833322 [details]
Bug 1254099 - Add Telemetry to settings update

https://reviewboard.mozilla.org/r/109180/#review112902

Is it enough for you to collect this data from our "opt-in" population?
(I.e. all users onBeta, Aurora, Nightly users and those users who opted in to extended Telemetry on Release)

Do you need to collect these probes forever? Or is it enough to collect them for a few Firefox releases?

Note that adding new probes still requires data collection review [1] after (technical) code review.

1: https://wiki.mozilla.org/Firefox/Data_Collection

::: toolkit/components/telemetry/Histograms.json:8969
(Diff revision 2)
> +    "expires_in_version": "never",
> +    "kind": "enumerated",
> +    "n_values": 5,
> +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> +    "bug_numbers": [1254099],
> +    "description": "Services settings polling result information (0=Success, 1=Network error, 2=Server response error)"

We can use categorical histograms for these kind of probes and provide labels.
That will show up like [this](https://mzl.la/2kbQWLy).

See here:
- [docs](https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type)
- [example probe](https://dxr.mozilla.org/mozilla-central/search?q="FX_PREFERENCES_CATEGORY_OPENED")
(In reply to Georg Fritzsche [:gfritzsche] from comment #12)
> [docs](https://developer.mozilla.org/en-US/docs/Mozilla/Performance/
> Adding_a_new_Telemetry_probe#Choosing_a_Histogram_Type)

We just migrated and updated this documentation:
http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/histograms.html#categorical
Flags: needinfo?(rhelmer)
Comment on attachment 8833322 [details]
Bug 1254099 - Add Telemetry to settings update

https://reviewboard.mozilla.org/r/109180/#review117128

This looks OK to me.
Attachment #8833322 - Flags: review+
Flags: needinfo?(mgoodwin)
Attachment #8775199 - Attachment is obsolete: true
Comment on attachment 8833322 [details]
Bug 1254099 - Add Telemetry to settings update

Hi Benjamin!

We want to monitor the uptake rate of remote settings, and the distribution of the possible errors. 

We would gladly receive your feedback, since this is our first contact with Telemetry.

Thanks a lot in advance for your insights!

> Note: :gfritzsche was not horrified by the approach taken here but should also get back to us for further feedback.
Attachment #8833322 - Flags: feedback?(benjamin)
Depends on: 1343855
Comment on attachment 8833322 [details]
Bug 1254099 - Add Telemetry to settings update

https://reviewboard.mozilla.org/r/109180/#review119262

::: toolkit/components/telemetry/Histograms.json:9159
(Diff revision 4)
>    "SERVICE_WORKER_REQUEST_PASSTHROUGH": {
>      "expires_in_version": "50",
>      "kind": "boolean",
>      "description": "Intercepted fetch sending back same Request object. File bugs in Core::DOM in case of a Telemetry regression."
>    },
> +  "SERVICE_SETTINGS_POLLING_RESULT": {

This is not marked as opt-out, but I expect that this sort of error telemetry really should be opt-out and monitored in release. Was that an intentional choice or oversight?

::: toolkit/components/telemetry/Histograms.json:9165
(Diff revision 4)
> +    "expires_in_version": "never",
> +    "kind": "categorical",
> +    "labels": ["up_to_date", "success", "backoff", "network_error", "server_error"],
> +    "alert_emails": ["seceng-telemetry@mozilla.com"],
> +    "bug_numbers": [1254099],
> +    "description": "Services settings polling result information."

I don't recognize the phrase "services settings". Is this something new? This is worthy of a link or more explanation.
The only other "requirement" for expires: never telemetry is that there is a commitment to actual monitor this data in perpetuity. Do you have a plan for how you're going to do that yet?

This is exactly the kind of error reporting signal that we want to include in the new "war room" effort, so please see https://docs.google.com/a/mozilla.com/document/d/1SgpqUXnOLCakx5HdDoLkv83Pt4iNUxeQMfV0iQ3fqBo/edit?usp=sharing and talk to Mauro Doglio and Ryan VanderMeulen about including this metric as part of that reporting effort.
Flags: needinfo?(mathieu)
Comment on attachment 8833322 [details]
Bug 1254099 - Add Telemetry to settings update

Cancelling data-r for now and Mathieu's going to get back to me.
Attachment #8833322 - Flags: feedback?(benjamin)
Depends on: 1347216
Flags: needinfo?(mathieu)
Comment on attachment 8833322 [details]
Bug 1254099 - Add Telemetry to settings update

https://reviewboard.mozilla.org/r/109180/#review126508

data-r=me (I did not review the code)
Attachment #8833322 - Flags: review?(benjamin) → review+
Comment on attachment 8833322 [details]
Bug 1254099 - Add Telemetry to settings update

https://reviewboard.mozilla.org/r/109180/#review134906
Attachment #8833322 - Flags: review+
Comment on attachment 8833322 [details]
Bug 1254099 - Add Telemetry to settings update

https://reviewboard.mozilla.org/r/109180/#review146146
Attachment #8833322 - Flags: review?(rhelmer) → review+
Awesome, thanks Rob!

What do you think if I add these additional statuses:

- PREF_DISABLED: Update disabled in user prefs
- NETWORK_OFFLINE_ERROR: Network not available
- DOWNLOAD_ERROR: Failed to fully retrieve the data (eg. download interrupted, failed to write in local folder, ...)
- CLEANUP_ERROR : Failed to clean-up temporary files (eg. failed to remove folder)

I found them reading the balrog, shield and addons update clients. I didn't want to make the status list longer than it is, but at the same time I believe they make sense.

This is the last bit before shipping \o/
Flags: needinfo?(rhelmer)
(In reply to Mathieu Leplatre (:leplatrem) from comment #29)
> Awesome, thanks Rob!
> 
> What do you think if I add these additional statuses:
> 
> - PREF_DISABLED: Update disabled in user prefs
> - NETWORK_OFFLINE_ERROR: Network not available
> - DOWNLOAD_ERROR: Failed to fully retrieve the data (eg. download
> interrupted, failed to write in local folder, ...)
> - CLEANUP_ERROR : Failed to clean-up temporary files (eg. failed to remove
> folder)
> 
> I found them reading the balrog, shield and addons update clients. I didn't
> want to make the status list longer than it is, but at the same time I
> believe they make sense.
> 
> This is the last bit before shipping \o/

These all make sense to me. I think having a superset of what balrog/shield/addons require is a good approach!
Flags: needinfo?(rhelmer)
Comment on attachment 8833322 [details]
Bug 1254099 - Add Telemetry to settings update

https://reviewboard.mozilla.org/r/109180/#review147702

Just a couple grammar nits.

::: toolkit/components/telemetry/docs/collection/uptake.rst:10
(Diff revision 10)
> +Firefox continuously pulls data from different remote sources (eg. settings, system add-ons, …). In order to have consistent insights about the *uptake rate* of these *update sources*, our clients can use a unified Telemetry helper to report their *update status*.
> +
> +The helper — described below — reports predefined update status, which eventually gives a unified way to obtain:
> +
> +* the proportion of success among clients;
> +* its evolution in the course of time;

I would just say `over time` instead of `in the course of time` (although `in the course of time` doesn't sound incorrect)

::: toolkit/components/telemetry/docs/collection/uptake.rst:11
(Diff revision 10)
> +
> +The helper — described below — reports predefined update status, which eventually gives a unified way to obtain:
> +
> +* the proportion of success among clients;
> +* its evolution in the course of time;
> +* the distribution of errors causes.

error causes

::: toolkit/components/telemetry/docs/collection/uptake.rst:31
(Diff revision 10)
> +
> +   UptakeTelemetry.report(source, status);
> +
> +- ``source`` - a ``string`` that is an identifier for the update source (eg. ``addons-blocklist``)
> +- ``status`` - one of the following status constants:
> +  - ``UptakeTelemetry.STATUS.UP_TO_DATE``: Remote content is up-to-date with local content.

How about `Local content was already up-to-date with remote content`?
Attachment #8833322 - Flags: review+
This is now ready to land \o/

Benjamin, Georg, since you gave your approval a while, and since I modified the Telemetry docs to add a dedicated section about Uptake telemetry, could you please confirm you're still OK with the changes?

Thanks a lot everyone!
Flags: needinfo?(gfritzsche)
Flags: needinfo?(benjamin)
Without diving into the details of the doc addition, this looks well documented to me.
Flags: needinfo?(gfritzsche)
This looks fine. Thank you.
Flags: needinfo?(benjamin)
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/744a91bd4921
Add Telemetry to settings update r=bsmedberg,glasserc,mgoodwin,rhelmer
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/744a91bd4921
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Flags: needinfo?(MattN+bmo)
You need to log in before you can comment on or make changes to this bug.