Telemetry/FHR for FxA-related authentication issues

RESOLVED FIXED in Firefox 43

Status

()

defect
P1
normal
Rank:
15
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: markh, Assigned: lina)

Tracking

(Blocks 1 bug)

unspecified
Firefox 43
Points:
8
Dependency tree / graph
Bug Flags:
firefox-backlog +
qe-verify -

Firefox Tracking Flags

(firefox43 fixed)

Details

(Whiteboard: [fxsync])

Attachments

(1 attachment, 6 obsolete attachments)

We have a higher than expected rate of "unexplained authentication errors" - see bug 1119667.  We should use telemetry/fhr to see how widespread the problem is and to try and see if there is a common (but elusive) denominator.

As a first cut, we should look at all the logs in the bugs hanging off bug 1119667 and record entries for anything unexplained.  We should also record things we can identify as "expected but somewhat exceptional" - eg, password changed on another device, and ensure we can record the success of our handling of this situation (eg, when a password was changed on another device, see how often the reauth infobar is shown and how often the user successfully (or not) reconnects.

Note that we probably need/want to use FHR, and this may imply we need to write a custom dashboard.
Flags: qe-verify-
Flags: firefox-backlog+
Assignee: nobody → mhammond
Status: NEW → ASSIGNED
Iteration: --- → 38.2 - 9 Feb
Iteration: 38.2 - 9 Feb → 38.3 - 23 Feb
Assignee: mhammond → nobody
Status: ASSIGNED → NEW
Iteration: 38.3 - 23 Feb → ---
In particular, the issue reported in bug 1116119.
Duplicate of this bug: 1175343
Priority: -- → P1
Assignee: nobody → kcambridge
Hi Mark! Here's a first cut at collecting some metrics around authentication issues. I added a separate `telemetry.js` module that stores the histogram names, and maps error classes to enumerated fields (I hope I'm using those correctly).
Attachment #8629129 - Flags: feedback?(markh)
Attachment #8629129 - Flags: feedback?(ckarlof)
Added `alert_emails` and a new `ERROR_NO_TOKEN` enum value for when `_canFetchKeys()` returns `false`; fixed `expires_in_version`.
Attachment #8629129 - Attachment is obsolete: true
Attachment #8629129 - Flags: feedback?(markh)
Attachment #8629129 - Flags: feedback?(ckarlof)
Attachment #8629159 - Flags: feedback?(markh)
Attachment #8629159 - Flags: feedback?(ckarlof)
Comment on attachment 8629159 [details] [diff] [review]
0001-WIP-First-batch-of-Sync-auth-error-metrics.patch

Review of attachment 8629159 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/sync/modules/browserid_identity.js
@@ +211,5 @@
>        // this and the rest of initialization off in the background (ie, we
>        // don't return the promise)
>        this._log.info("Waiting for user to be verified.");
> +      if (isInitialSync) {
> +        Weave.Telemetry.verifiedState(accountData.verified);

I thought this might be helpful for tracking how many times we try to Sync when the user hasn't verified their account—maybe indicating she mistyped her email during signup, and can't go back to fix it?

@@ +466,5 @@
>      // so no keys - ensure MP unlocked.
>      if (!Utils.ensureMPUnlocked()) {
>        // user declined to unlock, so we don't know if they are stored there.
>        log.debug("unlockAndVerifyAuthState: user declined to unlock master-password");
> +      Weave.Telemetry.authError("unlock", MASTER_PASSWORD_LOCKED);

The key name could probably just be the name of the function. I'm not sure how keys for histograms are surfaced in the telemetry dashboards.

@@ +481,5 @@
> +        let result;
> +        if (this._canFetchKeys()) {
> +          result = STATUS_OK;
> +        } else {
> +          result = LOGIN_FAILED_LOGIN_REJECTED;

I think this catches the case where the login manager is unlocked, but we can't fetch keys. Just in case, I added `ERROR_NO_TOKEN` for when this function returns `null`.

@@ +649,5 @@
>      this._token = null;
>      return this._fetchTokenForUser().then(
>        token => {
> +        if (!token) {
> +          Weave.Telemetry.authError("ensureValidToken", Weave.Telemetry.ERROR_NO_TOKEN);

Not sure if this is an error condition, since we resolve the promise with `null` instead of rejecting. On the other hand, it looks like the only way the token *can* be `null` is if we can't immediately fetch keys (line 537).

::: services/sync/modules/telemetry.js
@@ +29,5 @@
> +  [INFO_COLLECTIONS_DOWNLOAD_FAIL]: 5,
> +  [LOGIN_FAILED_SERVER_ERROR]: 6,
> +  [ERROR_SIGN_IN_INVALID]: 7,
> +  [ERROR_SIGN_IN_UNKNOWN]: 8,
> +  [fxAccountsCommon.NO_ACCOUNT]: 9,

The more I think about this, the more I think reusing Sync and FxA constants for recording different error types is a bad idea. We already have a few cases that aren't covered by either module (`ERROR_NO_TOKEN`, `ERROR_SIGN_IN_INVALID`, and so on); might as well expand that list and use `Weave.Telemetry.*` constants everywhere.

::: toolkit/components/telemetry/Histograms.json
@@ +8310,5 @@
> +    "expires_in_version": "50",
> +    "kind": "count",
> +    "keyed": true,
> +    "releaseChannelCollection": "opt-out",
> +    "description": "Tracks Sync engine failures"

This is tracking `count.failed` (from `modules/engines.js`); the one above is tracking `count.newFailed`. It would be good to clarify the difference between them.
Comment on attachment 8629159 [details] [diff] [review]
0001-WIP-First-batch-of-Sync-auth-error-metrics.patch

Review of attachment 8629159 [details] [diff] [review]:
-----------------------------------------------------------------

Let's discuss this some more - my comments below are where I'm leaning, bug I'm open to having my mind changed. The tl;dr of my comments are roughly:

* I think some telemetry should be collected closer to the source of the problem - eg, inside hawkclient and the FxA modules.
* We need to ensure we aren't capturing network errors as "problems", and may eventaally want to differentiate different http status codes (which ties into the above) - eg, eventually I'd expect we don't want to count well-formed 50X errors with an appropriate retry-after header as that implies "expected" server maintenance that is presumably already covered by server-side metrics and shouldn't skew the client-side problems we are trying to find here.
* I don't think we should conflate sync telemetry with fxa telemetry.

--

I'm thinking we should probably treat telemetry for Sync and FxA separately - eg, 2 modules if necessary, with one focused on FxA and the other on Sync. browserid_identity would probably use FxA, even though it is technically part of Sync.

The reporting for Sync should also subsume the existing probes where possible - ie, we might as well have *all* telemetry reported by Sync use the same technique. There are only a few existing ones, in service.js and Weave.js IIRC

Another probe that might be worthwhile: _fetchTokenForUser()'s call to _canFetchKeys is now theoretically redundant - the (newer) unlockAndVerifyAuthState() function should mean we never fail as we abort the login. However, this may not be actually true, so a probe to verify that might be nice.

We also need to think about tests here - some places may be hard to test, but some should be fairly easily testable.

::: browser/base/content/aboutaccounts/aboutaccounts.js
@@ +186,5 @@
>        // won't notify the user). If the email has already been verified,
>        // the jelly will say "Welcome! You are successfully signed in as
>        // EMAIL", but it won't then say "syncing started".
> +    }, (err) => {
> +      let errClass;

I'm not sure this is the correct place for this, and ISTM the only thing we are catching here is failure inside setSignedInUser - but that doesn't hit the servers (ie, it's not clear to me how this would actually get hit).

I think we should catch these errors inside FxAccounts.jsm

::: services/sync/modules/browserid_identity.js
@@ +211,5 @@
>        // this and the rest of initialization off in the background (ie, we
>        // don't return the promise)
>        this._log.info("Waiting for user to be verified.");
> +      if (isInitialSync) {
> +        Weave.Telemetry.verifiedState(accountData.verified);

I don't think this is the correct place for this check. Given what you describe and assuming we can correlate telemetry with a profile, I'd think recording a single counter when we find a logged in user (either verified or not) and another when we find the user *is* verified directly in FxAccounts might be enough - then we are looking for the the first flag existing but not the second.

Also, that first flag *almost* already exists in Weave.js, but that probe doesn't tell us if they are actually using FxA or not (but see also the comment about keeping Sync and FxA telemetry distinct)

@@ +466,5 @@
>      // so no keys - ensure MP unlocked.
>      if (!Utils.ensureMPUnlocked()) {
>        // user declined to unlock, so we don't know if they are stored there.
>        log.debug("unlockAndVerifyAuthState: user declined to unlock master-password");
> +      Weave.Telemetry.authError("unlock", MASTER_PASSWORD_LOCKED);

I'm not sure what this probe tells us - the user declining to unlock isn't a problem as such - we are far more interested in them unlocking and then finding stuff missing (ie, the probe below).

IOW, I'm not sure what question this probe answers.

@@ +481,5 @@
> +        let result;
> +        if (this._canFetchKeys()) {
> +          result = STATUS_OK;
> +        } else {
> +          result = LOGIN_FAILED_LOGIN_REJECTED;

I think we are going to want new constants for auth failure reasons. LOGIN_FAILED_LOGIN_REJECTED is a somewhat magic value used by Sync and is really just a flag saying "user must reauth". eg, here we'd probably need a flag indicating "user has no keys available". I'm also not sure we need to bother reporting the "success" case here (eg, the initial branch is just as successy and that's not counted)

@@ +595,5 @@
>        .then(null, err => {
>          // TODO: unify these errors - we need to handle errors thrown by
>          // both tokenserverclient and hawkclient.
>          // A tokenserver error thrown based on a bad response.
> +        let errType;

I'm in 2 minds about the probes here. On one hand it probably makes sense to have some of these probes directly in hawk and fxAccountClient.

But on the other hand, there are things that could go wrong that aren't related to remote requests - eg, when bid_identity calls getKeys(), that could fail due to the hawk etc, but could also fail due to FxAccounts.jsm doing:

   currentState.whenKeysReadyDeferred.reject('No keyFetchToken');

which I believe shouldn't happen, but if it did, would be reported as a "network error" (which it clearly isn't).

So I'm thinking we might be able to do both:

* Have hawkclient.js and/or fxAccountsClient.jsm record telemetry for all "expected" errors (ie, particularly when hitting the network), but ensure we filter out network errors.

* Have _fetchTokenForUser() record coarse telemetry (ie, a simple counter for what it considers an auth error).

We should then be able to see these match up - but if we see far more auth errors reported by _fetchTokenForUser() but no corresponding counters for the servers returning 401s, then we know we have other errors we need to dig further for (eg, one of the other .reject() calls that should be impossible)

@@ +649,5 @@
>      this._token = null;
>      return this._fetchTokenForUser().then(
>        token => {
> +        if (!token) {
> +          Weave.Telemetry.authError("ensureValidToken", Weave.Telemetry.ERROR_NO_TOKEN);

I think we are largely handling fetchTokenForUser failing in the blocks above. We probably want a new probe down in the "Failed to fetch the cluster URL" block, but keeping in mind that a 401 later can just mean a node-reassignment, which is theoretically rare but also somewhat expected (and can also be measured by the server)

::: services/sync/modules/service.js
@@ +577,5 @@
>  
>      if (!this.identity.syncKey) {
>        this.status.login = LOGIN_FAILED_NO_PASSPHRASE;
>        this.status.sync = CREDENTIALS_CHANGED;
> +      Weave.Telemetry.setupError("keys", LOGIN_FAILED_NO_PASSPHRASE);

this is never going to be hit for FxA sync (and I doubt we care about legacy here) and ditto for the next few entries.  For FxA, if this.identity.syncKey returns true, syncKeyBundle will too, and they will always return true if browserid_identity has successfully logged the user in (and yeah, it would be great when we can kill the old identity stuff - hopefully later this year when the 1.1 servers are decomissioned).

So in general I don't think there's much in this function that's valuable.

@@ +599,5 @@
>        // This only applies when the server is already at version 4.
>        if (infoResponse.status != 200) {
>          this._log.warn("info/collections returned non-200 response. Failing key fetch.");
>          this.status.login = LOGIN_FAILED_SERVER_ERROR;
>          this.errorHandler.checkServerError(infoResponse);

these next couple might be valuable, although ideally we'd be able to differentiate between network errors and actual auth errors

@@ +764,5 @@
>            this._log.warn("401: login failed.");
> +          if (!allow40XRecovery) {
> +            Weave.Telemetry.authError("verify", LOGIN_FAILED_LOGIN_REJECTED);
> +          }
> +          return this._handleWrongCluster(allow40XRecovery);

I'd be inclined to not add the new function and just do something like:
  let telemValue = test.status == 401 ? LOGIN_FAILED_LOGIN_REJECTED : INFO_COLLECTIONS_DOWNLOAD_FAIL;

@@ +1118,5 @@
>        // should be able to get the existing meta after we get a new node.
>        if (this.recordManager.response.status == 401) {
>          this._log.debug("Fetching meta/global record on the server returned 401.");
>          this.errorHandler.checkServerError(this.recordManager.response);
> +        Weave.Telemetry.setupError("oldMeta", METARECORD_DOWNLOAD_FAIL);

This is a node reassignment, which I don't think we need to record here.

::: services/sync/modules/telemetry.js
@@ +30,5 @@
> +  [LOGIN_FAILED_SERVER_ERROR]: 6,
> +  [ERROR_SIGN_IN_INVALID]: 7,
> +  [ERROR_SIGN_IN_UNKNOWN]: 8,
> +  [fxAccountsCommon.NO_ACCOUNT]: 9,
> +  [fxAccountsCommon.ERROR_AUTH_ERROR]: 10,

I agree. But as I'm implying in other comments, I'm not sure trying to unify, say, all auth problems with a single probe makes much sense - eg, maybe we want 1 probe for hawk failures and a different one for tokenserver errors, so we probably want different constants for each.

I'm also still in 2 minds about the need for this module - I can see it makes sense for keyed histograms (in which case we'd also expect some local validation of the key), but don't see much value for the simple counters (it's just another layer of indirection that makes grepping for telemetry that little bit less greppable)

I think we can defer this decision until we see how it ends up looking though, but if we do keep the module, we should be clear and consistent (eg, if we decide to also keep all simple counters here, then we should ensure all existing counters also use it)

@@ +80,5 @@
> +    histogram.add(engine, delta);
> +  },
> +
> +  syncError(engine) {
> +    let histogram = Services.telemetry.getKeyedHistogramById("WEAVE_ENGINE_ERRORS");

I don't think these are keyed histograms?
Attachment #8629159 - Flags: feedback?(markh)
Blocks: 1182288
Rank: 15
(In reply to Mark Hammond [:markh] from comment #6)
> * I think some telemetry should be collected closer to the source of the
> problem - eg, inside hawkclient and the FxA modules.

+1. I removed `telemetry.js` in the latest patch, since we no longer need enumerated histograms, or the extra layer of indirection. It looks like `FxAccountsClient` is the only direct consumer of `hawkclient`, so I just added Hawk error checking to the former.

> * We need to ensure we aren't capturing network errors as "problems", and
> may eventaally want to differentiate different http status codes (which ties
> into the above) - eg, eventually I'd expect we don't want to count
> well-formed 50X errors with an appropriate retry-after header as that
> implies "expected" server maintenance that is presumably already covered by
> server-side metrics and shouldn't skew the client-side problems we are
> trying to find here.

Makes sense. The latest patch should only record auth errors, with the exception of the catch-all "can't fetch keys" and "failed to fetch the cluster URL" cases.

> Another probe that might be worthwhile: _fetchTokenForUser()'s call to
> _canFetchKeys is now theoretically redundant - the (newer)
> unlockAndVerifyAuthState() function should mean we never fail as we abort
> the login. However, this may not be actually true, so a probe to verify that
> might be nice.

`WEAVE_CAN_FETCH_KEYS` now tracks this.

> I'm not sure this is the correct place for this, and ISTM the only thing we
> are catching here is failure inside setSignedInUser - but that doesn't hit
> the servers (ie, it's not clear to me how this would actually get hit).

D'oh, thanks for catching that! Not sure what I was thinking there...

> I don't think this is the correct place for this check. Given what you
> describe and assuming we can correlate telemetry with a profile, I'd think
> recording a single counter when we find a logged in user (either verified or
> not) and another when we find the user *is* verified directly in FxAccounts
> might be enough - then we are looking for the the first flag existing but
> not the second.

The question I wanted to answer was, how many users try to set up Sync, but never actually sync because their account isn't verified? :vladikoff confirmed on IRC that the content server tracks whether a user is signing up for Sync, so we can correlate those numbers against unverified accounts in Datadog. So this can be removed entirely. :-)

> I'm not sure what this probe tells us - the user declining to unlock isn't a
> problem as such - we are far more interested in them unlocking and then
> finding stuff missing (ie, the probe below).

Removed for now. I was wondering if some folks decline to unlock, and never sync as a result. But—assuming that's a question we're interested in answering—maybe we should only record repeated unlock failures.

> So I'm thinking we might be able to do both:
> 
> * Have hawkclient.js and/or fxAccountsClient.jsm record telemetry for all
> "expected" errors (ie, particularly when hitting the network), but ensure we
> filter out network errors.
> 
> * Have _fetchTokenForUser() record coarse telemetry (ie, a simple counter
> for what it considers an auth error).

Sounds good. `WEAVE_FXA_KEY_FETCH_ERRORS` is the coarser counter; `FXA_UNVERIFIED_ACCOUNT_ERRORS`, `FXA_HAWK_ERRORS`, and `TOKENSERVER_AUTH_ERRORS` are the specific error counters.

> I think we are largely handling fetchTokenForUser failing in the blocks
> above. We probably want a new probe down in the "Failed to fetch the cluster
> URL" block, but keeping in mind that a 401 later can just mean a
> node-reassignment, which is theoretically rare but also somewhat expected
> (and can also be measured by the server)

Solid. Added `WEAVE_FXA_NODE_ASSIGNMENT_ERRORS` as another coarse counter for this.
Attachment #8629159 - Attachment is obsolete: true
Attachment #8629159 - Flags: feedback?(ckarlof)
Attachment #8635055 - Flags: feedback?(markh)
Comment on attachment 8635055 [details] [diff] [review]
0001-WIP-First-batch-of-Sync-auth-error-metrics.patch

Review of attachment 8635055 [details] [diff] [review]:
-----------------------------------------------------------------

A very quick scan at the end of my day, but I think this is looking great!

::: services/sync/modules/browserid_identity.js
@@ +478,5 @@
> +        if (this._canFetchKeys()) {
> +          result = STATUS_OK;
> +        } else {
> +          result = LOGIN_FAILED_LOGIN_REJECTED;
> +          Services.telemetry.getHistogramById("WEAVE_CAN_FETCH_KEYS").add(false);

It looks like we are never adding a true for this probe, which sounds bad. I think it will be OK to add a "true" result at the top of the function where we early return and not bother with the probe below (and don't record anything if the user declines to unlock) - this means the probe should always get a true or false value once per sync login while the MP is unlocked.

@@ +762,5 @@
>        log.info("Failed to fetch the cluster URL", err);
> +      // Record all node assignment errors, including network errors. The FxA
> +      // client already records authentication errors.
> +      Services.telemetry.getHistogramById(
> +        "WEAVE_FXA_NODE_ASSIGNMENT_ERRORS").add();

We need to dig a little deeper here to ensure we aren't capturing node reassignments

::: services/sync/modules/resource.js
@@ +323,5 @@
>        this._log.debug(CommonUtils.stackTrace(ex));
>      }
>  
>      let ret     = new String(data);
> +    ret.path    = channel.URI.path;

This seems to be used only in service.js recording a 401, so I wonder if we should just get it directly from |this.infoURL|. However, I think that URL includes the uid, meaning there are a huge number of possible values. I suspect just hard-coding, say, "info" as the histogram key may give us what we want.

::: services/sync/modules/service.js
@@ +756,5 @@
>  
>          case 401:
>            this._log.warn("401: login failed.");
> +          Services.telemetry.getKeyedHistogramById(
> +            "WEAVE_STORAGE_AUTH_ERRORS").add(test.path);

see comments in resource.js

::: toolkit/components/telemetry/Histograms.json
@@ +8476,5 @@
>      "n_values": 20,
>      "releaseChannelCollection": "opt-out",
>      "description": "Reports why a graphics sanity test was run. 0=First Run, 1=App Updated, 2=Device Change, 3=Driver Change."
> +  },
> +  "FXA_UNVERIFIED_ACCOUNT_ERRORS": {

it looks like this patch isn't actually recording all these histograms? I haven't actually looked at the definitions yet though.
Attachment #8635055 - Flags: feedback?(markh) → feedback+
Whiteboard: [fxsync]
Iteration: --- → 42.3 - Aug 10
Kit, any update here? We'd like to hit 42 with this, which moves to Aurora next week. Let me know if I can help in any way.
Flags: needinfo?(kcambridge)
Thanks, Mark! I'm very sorry for the delay on my end...wrapping up a few things for Push. :-/ I have an incoming patch that adds tests and incorporates your initial round of feedback.

(In reply to Mark Hammond [:markh] from comment #9)
> We need to dig a little deeper here to ensure we aren't capturing node
> reassignments

Maybe we can drop this probe. I was originally thinking it'd be a good sanity check (if we see a lot of `WEAVE_FXA_NODE_ASSIGNMENT_ERRORS` compared to, say, `TOKENSERVER_AUTH_ERRORS` or `FXA_HAWK_ERRORS`, we'll know where to focus).

But including network errors and reassignments doesn't tell us much—apart from the fact that a lot of folks are on unreliable network, or the client state changes.

> This seems to be used only in service.js recording a 401, so I wonder if we
> should just get it directly from |this.infoURL|. However, I think that URL
> includes the uid, meaning there are a huge number of possible values. I
> suspect just hard-coding, say, "info" as the histogram key may give us what
> we want.

Ah, good point. The URL is used by `checkServerError`, too, which has quite a few callers. I added a `stripUserPrefix` helper that removes the user-specific portion from the URL...but, if we don't need to be so picky, I could just hard-code the collection name (and add a `cause` param to `checkServerError`).
Attachment #8635055 - Attachment is obsolete: true
Flags: needinfo?(kcambridge)
Attachment #8643429 - Flags: review?(markh)
Comment on attachment 8643429 [details] [diff] [review]
0001-Bug-1124428-Add-Sync-telemetry-probes-for-FxA-relate.patch

Review of attachment 8643429 [details] [diff] [review]:
-----------------------------------------------------------------

Looking good to me.

Vladan, I'm hoping you can give a privacy review of this, and also tell us how the dashboards for count and keyed histograms are going? https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe says "spring 2015" so maybe they are already there? If not, is there an updated ETA, or a technique we could use to reasonably extract the data in the short-ish term?

::: services/common/tokenserverclient.js
@@ +391,5 @@
>        }
>  
> +      if (response.status == 401 || response.status == 403) {
> +        Services.telemetry.getKeyedHistogramById(
> +          "TOKENSERVER_AUTH_ERRORS").add(error.cause);

given the block above it seems error.cause may be undefined. I wonder if this should be .add(error.cause || "unknown") or similar?

::: services/fxaccounts/FxAccountsClient.jsm
@@ +376,5 @@
> +            "FXA_UNVERIFIED_ACCOUNT_ERRORS").add(path);
> +        } else if (isInvalidTokenError(error)) {
> +          Services.telemetry.getKeyedHistogramById(
> +            "FXA_HAWK_ERRORS").add(path);
> +        }

I think another FXA_OTHER_ERRORS histogram might also make sense. Maybe also another for any 50X responses too?

::: services/sync/modules/browserid_identity.js
@@ +474,5 @@
>          this._updateSignedInUser(accountData);
>          // If we still can't get keys it probably means the user authenticated
>          // without unlocking the MP or cleared the saved logins, so we've now
>          // lost them - the user will need to reauth before continuing.
> +        let canFetchKeys = !!this._canFetchKeys();

this looks a little non-obvious to me - how about just |let canFetchKeys = this._canFetchKeys() ? 1 : 0;|?

::: services/sync/modules/engines.js
@@ +1201,5 @@
>  
> +    count.newFailed = this.previousFailed.reduce((count, engine) => {
> +      if (failedInPreviousSync.indexOf(engine) == -1) {
> +        count++;
> +        this._noteApplyError();

I'm not clear on this block - it looks to me like |this.previousFailed| now has the contents of |failed| - and we've already reported failures for everything in |failed|.

It's also not clear how an "apply error" is different than an "apply failure"

::: services/sync/modules/policies.js
@@ +875,5 @@
>          }
>          break;
>  
>        case 401:
> +        {

I don't think these braces are needed, but if they are, please put it on the same line as the case.

::: toolkit/components/telemetry/Histograms.json
@@ +8656,5 @@
> +  },
> +  "FXA_UNVERIFIED_ACCOUNT_ERRORS": {
> +    "alert_emails": ["kcambridge@mozilla.com"],
> +    "expires_in_version": "50",
> +    "kind": "count",

*sob* - https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe tells me that there are no dashboards for "count" or "keyed" probes but that they should appear soon.
Attachment #8643429 - Flags: review?(markh) → feedback?(vdjeric)
Sorry guys, I'll have to take a look at this tomorrow

>*sob* - https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe tells
> me that there are no dashboards for "count" or "keyed" probes but that they should appear soon.

Good news, keyed histogram & count histogram dashes are ready! You should be able to see proper visualization of count histograms now and the keyed histograms can be viewed from the "Histogram Dashboard V4" linked from the "For power users" section of telemetry.mozilla.org.

For example:

Count histograms: http://bit.ly/1W2Wx0J
Keyed histograms: http://bit.ly/1W2Wv96
Comment on attachment 8643429 [details] [diff] [review]
0001-Bug-1124428-Add-Sync-telemetry-probes-for-FxA-relate.patch

Review of attachment 8643429 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +8655,5 @@
>      "description": "Was there an error while performing the v7 permissions DB migration?"
> +  },
> +  "FXA_UNVERIFIED_ACCOUNT_ERRORS": {
> +    "alert_emails": ["kcambridge@mozilla.com"],
> +    "expires_in_version": "50",

this is a year in the future, let's set it ~6 months (4 versions) in the future and extend the expiry as necessary

@@ +8656,5 @@
> +  },
> +  "FXA_UNVERIFIED_ACCOUNT_ERRORS": {
> +    "alert_emails": ["kcambridge@mozilla.com"],
> +    "expires_in_version": "50",
> +    "kind": "count",

Keep in mind that count histograms currently don't report 0 values and therefore the 0 bucket is always empty in the dashboard. Do you have another histogram that tells you the number of Firefox sessions with an FxA account? You need it as the denominator to calculate % of sessions affected.

@@ +8657,5 @@
> +  "FXA_UNVERIFIED_ACCOUNT_ERRORS": {
> +    "alert_emails": ["kcambridge@mozilla.com"],
> +    "expires_in_version": "50",
> +    "kind": "count",
> +    "keyed": true,

what are the keys going to be in these histograms? could they contain sensitive or user-identifying strings?

@@ +8659,5 @@
> +    "expires_in_version": "50",
> +    "kind": "count",
> +    "keyed": true,
> +    "releaseChannelCollection": "opt-out",
> +    "description": "FxA key fetch and certificate signing errors caused by an unverified account"

document the meaning of the keys here and elsewhere

@@ +8665,5 @@
> +  "FXA_HAWK_ERRORS": {
> +    "alert_emails": ["kcambridge@mozilla.com"],
> +    "expires_in_version": "50",
> +    "kind": "count",
> +    "keyed": true,

how many unique keys are you expecting for these histograms?

@@ +8704,5 @@
> +  },
> +  "WEAVE_CAN_FETCH_KEYS": {
> +    "alert_emails": ["kcambridge@mozilla.com"],
> +    "expires_in_version": "50",
> +    "kind": "boolean",

if you're only going to store one value in this histogram per session, consider using a flag histogram instead

@@ +8718,5 @@
> +  },
> +  "WEAVE_FXA_NODE_ASSIGNMENT_ERRORS": {
> +    "alert_emails": ["kcambridge@mozilla.com"],
> +    "expires_in_version": "50",
> +    "kind": "count",

how useful are per-session error counts? especially since you'd need to do a custom analysis to control for session length
Attachment #8643429 - Flags: feedback?(vdjeric)
Iteration: 42.3 - Aug 10 → 43.1 - Aug 24
Attachment #8643429 - Attachment is obsolete: true
Attachment #8647621 - Flags: review?(vdjeric)
Attachment #8647621 - Flags: review?(markh)
(In reply to Mark Hammond [:markh] from comment #13)
> given the block above it seems error.cause may be undefined. I wonder if
> this should be .add(error.cause || "unknown") or similar?

I added a check for `result.status` before setting `error.cause`, but I like your suggestion better.

> I think another FXA_OTHER_ERRORS histogram might also make sense. Maybe also
> another for any 50X responses too?

Good call. Added `FXA_OTHER_ERRORS` and `FXA_SERVER_ERRORS`, respectively.

> I'm not clear on this block - it looks to me like |this.previousFailed| now
> has the contents of |failed| - and we've already reported failures for
> everything in |failed|.
> 
> It's also not clear how an "apply error" is different than an "apply failure"

I'm not clear on the difference, either. It looks like `count.newFailed` is the number of incoming records that failed to apply, and `count.failed` counts repeated failures from previous syncs? The block replaces `Utils.arraySub(this.previousFailed, failedInPreviousSync)`; I inlined it because `histogram.add()` can only increment counters by one. Maybe this is a better approach, though:

    count.newFailed = Utils.arraySub(this.previousFailed, failedInPreviousSync).length;
    for (let i = 0; i < count.newFailed; i++) {
      this._noteApplyError();
    }

Or maybe we should just track new failures, and ignore repeats?
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #15)
> Keep in mind that count histograms currently don't report 0 values and
> therefore the 0 bucket is always empty in the dashboard. Do you have another
> histogram that tells you the number of Firefox sessions with an FxA account?
> You need it as the denominator to calculate % of sessions affected.

I think not recording zero values is OK. In this case, the keys are the endpoints that returned the error (for example, `account/keys`, or `certificate/sign`). I'm interested in which endpoints return this error code, and how the counts compare to one other: if `certificate/sign` returns most of the errors, we'll want to take a closer look at the signing code.

Am I misusing the count histogram for this?

> how many unique keys are you expecting for these histograms?

Updated the patch with documentation; reproducing here for posterity:

* FXA_UNVERIFIED_ACCOUNT_ERRORS: Keyed on the FxA auth server endpoint. Expecting less than five keys, but I don't have a fixed number, and this can change if FxA adds a new endpoint requiring a verified account.
* FXA_HAWK_ERRORS: Ditto; maximum 10 unique keys for now.
* FXA_OTHER_ERRORS, FXA_SERVER_ERRORS: Ditto.
* TOKENSERVER_AUTH_ERRORS: Keyed on the error cause. 4 fixed keys, plus different error types returned by the server...less than 10 total.
* WEAVE_ENGINE_APPLY_NEW_FAILURES: Keyed on the Sync engine name. There are 8 built-in engines, but Mark mentioned add-ons might provide their own Sync engines, too...maybe 15-20 keys is a reasonable estimate?
* WEAVE_ENGINE_APPLY_FAILURES, WEAVE_ENGINE_SYNC_ERRORS: Ditto.
* WEAVE_STORAGE_AUTH_ERRORS: Keyed on the Sync record name. 5 fixed keys, plus any Sync engines that return an error. 20-25 keys?

I'm wondering if some of those would be more appropriate as enums...

> if you're only going to store one value in this histogram per session,
> consider using a flag histogram instead

I think it's possible for multiple values to be stored in a single session, if we successfully fetch keys for one sync, but fail on the next. Mark, is that accurate, or do we only fetch keys once per session?

> how useful are per-session error counts? especially since you'd need to do a
> custom analysis to control for session length

Hmm, good question. I guess I'm just interested in seeing the frequency of errors for each key—if we encounter errors for a particular Sync collection more than others, or we fail to fetch keys because we don't have the necessary token.

This is how I've used statsd on the server, but those counters are persistent, not per-session. What's the right approach for telemetry?
Status: NEW → ASSIGNED
Comment on attachment 8647621 [details] [diff] [review]
0001-Bug-1124428-Add-Sync-telemetry-probes-for-FxA-relate.patch

Review of attachment 8647621 [details] [diff] [review]:
-----------------------------------------------------------------

I don't recall the checkServerError() changes in my previous look - apologies if I missed it :( I'll leave the review of the probe definitions to vladan.

::: services/fxaccounts/FxAccountsClient.jsm
@@ +17,5 @@
>  Cu.import("resource://gre/modules/Credentials.jsm");
>  
>  const HOST = Services.prefs.getCharPref("identity.fxaccounts.auth.uri");
>  
> +const STATUS_CODE_TO_OTHER_ERRORS_LABEL = {

this seems more natural as an array used with indexOf(), but I'm not too bothered if you prefer this.

@@ +388,5 @@
> +          Services.telemetry.getKeyedHistogramById(
> +            "FXA_HAWK_ERRORS").add(path);
> +        } else if (error.code in STATUS_CODE_TO_OTHER_ERRORS_LABEL) {
> +          Services.telemetry.getKeyedHistogramById(
> +            "FXA_OTHER_ERRORS").add(path,

I think renaming this to FXA_OTHER_4XX_ERRORS or something would be clearer.

::: services/sync/modules/policies.js
@@ +595,5 @@
>        case "weave:engine:sync:error":
>          let exception = subject;  // exception thrown by engine's sync() method
>          let engine_name = data;   // engine name that threw the exception
>  
> +        this.checkServerError(exception, "engines/" + engine_name);

sadly I don't think checkServerError() is going to be very useful here - it's simply broken. It expects a http response, but here we are passing it an exception

::: services/sync/modules/service.js
@@ +557,5 @@
>      let info;
>      try {
>        info = this.resource(infoURL).get();
>      } catch (ex) {
> +      this.errorHandler.checkServerError(ex, "info/collections");

ditto here - it is being passed an exception :(

@@ +562,5 @@
>        throw ex;
>      }
>  
>      // Always check for errors; this is also where we look for X-Weave-Alert.
> +    this.errorHandler.checkServerError(info, "info/collections");

this one is actually correct - the response is being passed

@@ +600,5 @@
>        // This only applies when the server is already at version 4.
>        if (infoResponse.status != 200) {
>          this._log.warn("info/collections returned non-200 response. Failing key fetch.");
>          this.status.login = LOGIN_FAILED_SERVER_ERROR;
> +        this.errorHandler.checkServerError(infoResponse, "info/collections");

while this one is also correctly passing a response, that response has already been passed to checkServerError() above - so any 401s will be double counted.

I think we should either (a) hoist the telemetry out of checkServerError() into the callers, or (b) allow "cause" to be undefined and only record telemetry when it has a value - then, eg, this call would not pass the 2nd arg but the one above would.

Probably the former given checkServerError is already so broken - at least we will be able to rationalize about the behaviour of the telemetry
Attachment #8647621 - Flags: review?(markh) → feedback+
Comment on attachment 8647621 [details] [diff] [review]
0001-Bug-1124428-Add-Sync-telemetry-probes-for-FxA-relate.patch

Review of attachment 8647621 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Kit Cambridge [:kitcambridge] from comment #18)
> I think not recording zero values is OK. In this case, the keys are the
> endpoints that returned the error (for example, `account/keys`, or
> `certificate/sign`). I'm interested in which endpoints return this error
> code, and how the counts compare to one other: if `certificate/sign` returns
> most of the errors, we'll want to take a closer look at the signing code.
> 
> Am I misusing the count histogram for this?

This will work. It won't be very convenient to do this via the dash just because the dash is still raw, but you'll be able to answer these questions. 

> * WEAVE_STORAGE_AUTH_ERRORS: Keyed on the Sync record name. 5 fixed keys,
> plus any Sync engines that return an error. 20-25 keys?

The backend takes a long time to respond if a user requests a keyed-histogram in the dash and as a result issues requests for *all* 100+ keys in the histogram and all the data. 20-25 is ok.

> I'm wondering if some of those would be more appropriate as enums...

If the set of possible keys is known ahead of time or if an "OTHER" bucket makes sense, it's preferable to use enums.

> This is how I've used statsd on the server, but those counters are
> persistent, not per-session. What's the right approach for telemetry?

I'm not 100% sure of your requirements, so I'll answer with a question :)

If all you're after is "how often do error conditions A,B,C happen to service X" or "relatively how common is error condition A across services X,Y,Z", then you can answer these questions through dashboards and you should avoid collecting data in such a way that you'll need to do custom analyses. Basically, avoid custom analyses if you can. They're more work.

So to figure out how to represent the data (enums vs keyed-histograms vs keyed on A vs B), just think about what the resulting dashboard will look like.

Here are some examples of dashes which will give you an idea of what kinds of questions they can answer:

enum dash: http://mzl.la/1N8wLVY
Shows absolute & relative error code distributions

Keyed Enum dash: http://mzl.la/1fczS0g
This dash shows absolute error counts & relative error code codes within each key. Comparing error distributions between keys is possible (click on key name to select different key). Note the "Advanced Settings" panel exposes sorting options which decide which keys are shown (it's always max 4 keys at a time).

Keyd Count dash: http://mzl.la/1N8wXEF
Keyed count dashes are less powerful cousins of keyed enum dashes.

----

Clearing needinfo until some of these measurement decisions are decided
Attachment #8647621 - Flags: review?(vdjeric)
Iteration: 43.1 - Aug 24 → 43.2 - Sep 7
Assignee: kcambridge → markh
Iteration: 43.2 - Sep 7 → 43.3 - Sep 21
OK, I talked to MattN, and it sounds like the telemetry dashboard additions from two weeks ago are sufficient for what we'd like to collect.

As I understood it, correlations for particular users would require custom analysis...but we could compare the frequency of the errors by looking at the two sums. I also added an `FXA_CONFIGURED` probe as the denominator.
Attachment #8647621 - Attachment is obsolete: true
Attachment #8656838 - Flags: feedback?(vdjeric)
Attachment #8656838 - Flags: feedback?(markh)
Comment on attachment 8656838 [details] [diff] [review]
0001-Bug-1124428-Add-Sync-telemetry-probes-for-FxA-relate.patch

Review of attachment 8656838 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +9157,5 @@
> +    "expires_in_version": "45",
> +    "kind": "count",
> +    "keyed": true,
> +    "releaseChannelCollection": "opt-out",
> +    "description": "Token server errors caused by invalid BrowserID assertions. Keyed on the token server error cause."

i'm guessing all these keyed count histograms aren't enums because enums wouldn't allow you to see the distribution of "errors with cause X per session"? e.g. 80% of sessions never see cause X, 10% of sessions see cause X once, 5% see cause X twice, etc
Attachment #8656838 - Flags: feedback?(vdjeric) → feedback+
Comment on attachment 8656838 [details] [diff] [review]
0001-Bug-1124428-Add-Sync-telemetry-probes-for-FxA-relate.patch

Review of attachment 8656838 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM - you have far more context than I do about what the dashboards can offer, so if you think my comment below is not a problem and understand what comment 22 means (I don't fully!) and think this is also OK, I think we should just go ahead and land this ASAP (but please push to try first, just in case...)

::: services/fxaccounts/FxAccounts.jsm
@@ +899,5 @@
>      let currentState = this.currentAccountState;
>      return currentState.getUserAccountData()
>        .then(data => {
> +        if (data) {
> +          Services.telemetry.getHistogramById("FXA_CONFIGURED").add(true);

this is a boolean probe but we never set it to false - so I'm wondering if it should be a flag instead?

also, this is only called once at startup - it's not counted for users who log in to this session. I'm not sure how much of a problem is, but I think we could solve it by adding the same probe in setSignedInUser() - that would end up double-counting users who sign out and sign back in, but I expect that would be rare.
Attachment #8656838 - Flags: review+
Attachment #8656838 - Flags: feedback?(markh)
Attachment #8656838 - Flags: feedback+
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #22)
> i'm guessing all these keyed count histograms aren't enums because enums
> wouldn't allow you to see the distribution of "errors with cause X per
> session"? e.g. 80% of sessions never see cause X, 10% of sessions see cause
> X once, 5% see cause X twice, etc

I *think* so. AIUI, for a keyed histogram (e.g., FxA server errors), I can divide the sum for each cause by the denominator to determine how many sessions see an error with that cause. For an enum, I can only see the percentage of sessions that encounter any kind of server error.
Changed `FXA_CONFIGURED` to a flag per comment #23. Also fixed `WEAVE_CAN_FETCH_KEYS`; looks like recording a zero value causes `FlagHistogram::Accumulate` to crash with an assertion error.

xpcshell tests pass locally. try is closed; will push once it reopens.
Attachment #8656838 - Attachment is obsolete: true
Attachment #8659605 - Flags: review?(vladan.bugzilla)
Comment on attachment 8659605 [details] [diff] [review]
0001-Bug-1124428-Add-Sync-telemetry-probes-for-FxA-relate.patch

Review of attachment 8659605 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Kit Cambridge [:kitcambridge] from comment #24)
> I *think* so. AIUI, for a keyed histogram (e.g., FxA server errors), I can
> divide the sum for each cause by the denominator to determine how many
> sessions see an error with that cause. For an enum, I can only see the
> percentage of sessions that encounter any kind of server error.

Ok, that's fine
Attachment #8659605 - Flags: review?(vladan.bugzilla) → review+
Assignee: markh → kcambridge
https://hg.mozilla.org/integration/mozilla-inbound/rev/902358f27dd2f8d8ffabfeba7a088390c13813b8
Bug 1124428 - Add telemetry probes for FxA-related authentication issues. r=markh,vladan
https://hg.mozilla.org/mozilla-central/rev/902358f27dd2
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 43
Depends on: 1233298
You need to log in before you can comment on or make changes to this bug.