Closed Bug 1375635 Opened 7 years ago Closed 7 years ago

Add telemetry to learn which users are in yellow states

Categories

(Firefox :: Sync, enhancement, P1)

55 Branch
enhancement

Tracking

()

RESOLVED FIXED
Firefox 57
Tracking Status
firefox57 --- fixed

People

(Reporter: rfeeley, Assigned: markh)

References

Details

Attachments

(2 files)

We currently lack an understanding of how many users are in the "Verify your account" and "Reconnect to Sync" states. These are bad states for users to be in for a long time as they are Firefox version of car alarms going off in the distance. No one likes that. We should know more about this. Thoughts?
We probably can't reasonably add this to the sync ping, but we could add it to the main ping (eg, a histogram) without too much trouble. I guess an obvious question though is what we will do with this information?
I think that with this information, we would know how many people are are in that state so we can ensure we're doing a better job of getting users out of that state.

rfk: Is this something we could detect on the FxA side?
Flags: needinfo?(rfkelly)
(In reply to Alex Davis [:adavis] [PM FxA+Sync] from comment #2)
> I think that with this information, we would know how many people are are in
> that state so we can ensure we're doing a better job of getting users out of
> that state.

Do you mean we might be putting users into this state when they didn't do anything that would require it (ie, a bug)? Or is the concern that people (say) change a password, then don't notice the prompt to sign in on the hamburger menu, so we should make the UI more intrusive?

I guess I'm still wondering what a plain number will give you, and whether we want something subtly different/additional (eg, average time in that state or similar). I believe that some idea of what mitigations are planned will help us define what data we want to collect.
>Or is the concern that people (say) change a password, then don't notice the prompt to sign in on the hamburger menu, so we should make the UI more intrusive?

Yes, this one. 

I'd like to know how many users remain in limbo. While I can estimate the new volume on a daily basis by looking at our login and registration funnels, I don't know the running total of users remaining unconfirmed or caused by password resets.

If this number happen to be pretty high, it could indicate that our UI is not obvious enough to users. If it's pretty low, then there might not be any additional work required. (or certainly, lower priority)

Tying it to FxA metrics might be interesting too since we could *potentially* know what actions brought them to this state. But to do that... it might not be immediately possible depending on what device_id we need to use. Good news is that it might change in the near future but let's not wait for that.

While I'd love to have all of this, I don't think we need any of that data in the near future to fix the user states Ryan is talking about. We already know there are problems in the product around handling user states.

To go back to Feeley's initial question in comment #1
> We currently lack an understanding of how many users are in the "Verify your account" and "Reconnect to Sync" states. These are bad states for users to be in for a long time as they are Firefox version of car alarms going off in the distance. No one likes that. We should know more about this. Thoughts?

I agree that these are bad states to be in. We should continue to improve covering all the user states across all major touch points and make sure that users are told appropriate next steps to confirm account/login. We started documenting problems during Sync Fest, let's continue that work and then get to fixing them.
> rfk: Is this something we could detect on the FxA side?

Not well - we could try to guess it based on server-side token access patterns but it'd be better to measure it directly in the client.

> I guess an obvious question though is what we will do with this information?

Broadly, I'd like to know how many users are in the yellow "reconnect" state as a percentage of users in the successfully-syncing state, and I'd like to know how long they stay in that state.  We have some ideas for helping users get out of this state by evolving things on both server and client, but we need telemetry to help us know whether and how to prioritize that work.

As one quick example, we could explicitly track "disconnected" devices on the server side, and send the user an email if they have a device that's been disconnected for a week, asking them to either reconnect or delete it.
Flags: needinfo?(rfkelly)
Priority: -- → P1
Assignee: nobody → markh
Alessio and Georg, I'm hoping you can offer some advice here.

For background about what we are trying to measure:

* When a user creates a Firefox Account, it initially starts as "unverified". Users must follow a link in an email to transition to "successful".
* Once successful, there are some conditions that cause the account to enter a "needs reauthentication" state - the most common is the user changing their password on a different device. The user still needs to take explicit action to transition back to "successful"

IOW, there are 3 states - "ok", plus 2 error states - "unverified" and "needs reauth"

We want to measure:

* How many users are in these bad states over time? For example, an uptick in users in an "unverified" state might mean email delivery problems, a client bug that fails to detect verification or loses their credentials, etc.

* How long users typically remain in these states? An uptick here may point at the same problems. We also want to know how many users "never" resolve the bad state (and for the purposes of this, we define "never" as "more than one month") so we can see if we need additional mitigations (eg, further email followups, etc) and if these mitigations have the expected affect.

The artifacts we want:

* Percentage of sessions that see any of the 3 states over time - think "evolution view"

* Histograms for how long users remain in the 2 error states (ie, how long it takes for users to resolve the state) - think normal histograms. We expect to see the majority of users resolve the state "soon" but with a long-tail of users who "never" resolve it.

Re implementation, I see 2 strategies:

* Use 3x "flag" histograms for the states, and an "exponential" histogram for the time period. The key advantage here is that analysis.telemetry.mozilla.org makes visualizing this relatively easy - although bug 1386452 implies this might not actually be true.

* Use 3x "boolean" scalars for the state, and a "uint" scalar for the time period. The key disadvantage I see here is visualizations - sql.telemetry.mozilla.org doesn't seem to make it easy to show histograms, and I'd really like to avoid spending loads of time working our how to actually see the data once it is collected. I'm even somewhat unclear how I'd create an evolution view here, but that sounds easier than the histogram.

IOW, I'm struggling to see how to visualize this data in either scenario. How do you suggest I proceed?
Flags: needinfo?(gfritzsche)
Flags: needinfo?(alessio.placitelli)
(In reply to Mark Hammond [:markh] from comment #6)
> Alessio and Georg, I'm hoping you can offer some advice here.

Sorry for the delay, I was on PTO (reminder to self: set the state on Bugzilla!) :)

> We want to measure:
> 
> * How many users are in these bad states over time? For example, an uptick
> in users in an "unverified" state might mean email delivery problems, a
> client bug that fails to detect verification or loses their credentials, etc.
> 
> * How long users typically remain in these states? An uptick here may point
> at the same problems. We also want to know how many users "never" resolve
> the bad state (and for the purposes of this, we define "never" as "more than
> one month") so we can see if we need additional mitigations (eg, further
> email followups, etc) and if these mitigations have the expected affect.
> 
> The artifacts we want:
> 
> * Percentage of sessions that see any of the 3 states over time - think
> "evolution view"
> 
> * Histograms for how long users remain in the 2 error states (ie, how long
> it takes for users to resolve the state) - think normal histograms. We
> expect to see the majority of users resolve the state "soon" but with a
> long-tail of users who "never" resolve it.

Nice and clear explaination!

> Re implementation, I see 2 strategies:
> 
> * Use 3x "flag" histograms for the states, and an "exponential" histogram
> for the time period. The key advantage here is that
> analysis.telemetry.mozilla.org makes visualizing this relatively easy -
> although bug 1386452 implies this might not actually be true.

Flag histograms are deprecated [1] and should not be used anymore.
However, using an exponential histogram for the time seems like a good idea.
Bug 1386452 was closed, I think it was a data-problem more than a visualization problem.

> * Use 3x "boolean" scalars for the state, and a "uint" scalar for the time
> period. The key disadvantage I see here is visualizations -
> sql.telemetry.mozilla.org doesn't seem to make it easy to show histograms,
> and I'd really like to avoid spending loads of time working our how to
> actually see the data once it is collected. I'm even somewhat unclear how
> I'd create an evolution view here, but that sounds easier than the histogram.

Let's just say we use an exponential histogram for the "time", and think of the rest.

Using 3 boolean scalars (or one boolean, keyed scalar) seems like a good idea. However, you will need to take care of the visualization on STMO/Custom Notebook if you need something more than the number of *pings* that reported the True value for each key.

Scalars are not sent in a ping unless they are *set*: if you're just setting them to True, you will not know how many times it was "not set to True" (False).

The solution here can be something like that: use 2 boolean scalars (or a keyed boolean) for the error states. If the user is not in an error state (= "ok" state), just set the boolean scalar(s) to False instead of True.

> IOW, I'm struggling to see how to visualize this data in either scenario.
> How do you suggest I proceed?

I'd suggest to go with a boolean scalar, make sure to mark the error scalar(s) as "False" when there is no error, then invest time in some SQL code/custom analysis to have a better visualization in case the one provided by STMO is not enough.

Hope this helps, please let me know if you have further questions/concerns.

[1] - https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/histograms.html#flag
Flags: needinfo?(gfritzsche)
Flags: needinfo?(alessio.placitelli)
Comment on attachment 8899683 [details]
Bug 1375635 - Add telemetry for how often and long users are in bad authentication states.

I, for one, welcome our new data steward overlord :)

Rebecca, this patch adds 4 new telemetry probes which will help us measure how often users are "kicked out" of sync, and for how long they remain kicked out. 3 of the probes are for the user's login state (one for success, 2 for errors) and the last is the number of minutes they were not in a "success" state. The sync team will monitor these probes.
Attachment #8899683 - Flags: review?(rweiss)
(In reply to Mark Hammond [:markh] from comment #9)
> Comment on attachment 8899683 [details]
> Bug 1375635 - Add telemetry for how often and long users are in bad
> authentication states.
> 
> I, for one, welcome our new data steward overlord :)
> 
> Rebecca, this patch adds 4 new telemetry probes which will help us measure
> how often users are "kicked out" of sync, and for how long they remain
> kicked out. 3 of the probes are for the user's login state (one for success,
> 2 for errors) and the last is the number of minutes they were not in a
> "success" state. The sync team will monitor these probes.

Any reason why histograms were used instead of boolean scalars?
(In reply to Alessio Placitelli [:Dexter] from comment #10)
> Any reason why histograms were used instead of boolean scalars?

I figured that seeing I actually want histograms as the artifacts it seemed sensible to use them. It wasn't clear from comment 7 that there might be a downside, but I'm happy to change the patch if there are.
(In reply to Mark Hammond [:markh] from comment #11)
> (In reply to Alessio Placitelli [:Dexter] from comment #10)
> > Any reason why histograms were used instead of boolean scalars?
> 
> I figured that seeing I actually want histograms as the artifacts it seemed
> sensible to use them. It wasn't clear from comment 7 that there might be a
> downside, but I'm happy to change the patch if there are.

No big downside other than the transmission format: boolean histograms are not deprecated, I just wanted to make sure this was an intended change.
Chatting with Dexter on IRC, I'll change the flags to scalars - although I'll leave the review request up; changing to scalars shouldn't affect the data review, and it will not change the code in a material way. I'll get review from dexter for the scalar change.
Comment on attachment 8899683 [details]
Bug 1375635 - Add telemetry for how often and long users are in bad authentication states.

https://reviewboard.mozilla.org/r/170998/#review176332

Looks fine. One nit that can be addressed by a comment, and one concern that I'm comfortable just living with. It does seem a bit weird that all the computation for this is in minutes, but it's fairly well commented and documented in names so I guess it doesn't bother me. The logic behind doing it this way also seems clear enough (finer granularity seems useless).

::: services/sync/modules/browserid_identity.js:54
(Diff revision 1)
> +// A telemetry helper that records how long a user was in a "bad" state.
> +// It is recorded in the *main* ping, *not* the Sync ping.
> +// These bad states may persist across browser restarts, and may never change
> +// (eg, users may *never* validate)
> +this.telemetryHelper = {
> +  STATES: {

This seems a little convoluted to me. Each of these is used as a key into the telemetryHelper.HISTOGRAMS. Maybe a comment along the lines of mentioning that would help.

::: services/sync/modules/browserid_identity.js:88
(Diff revision 1)
> +  },
> +
> +  _maybeRecordLoginState(status) {
> +    let histogram = this.HISTOGRAMS[status];
> +    if (!histogram) {
> +      throw new Error(`invalid state ${status}`);

Here's hoping we notice the bad logs should we ever typo the status...
Attachment #8899683 - Flags: review?(tchiovoloni) → review+
Comment on attachment 8899683 [details]
Bug 1375635 - Add telemetry for how often and long users are in bad authentication states.

https://reviewboard.mozilla.org/r/170998/#review176908

The use of Telemetry scalars looks good here, cheers!

::: toolkit/components/telemetry/Histograms.json:10859
(Diff revisions 1 - 2)
> -  },
>    "WEAVE_LOGIN_FAILED_FOR": {
>      "record_in_processes": ["main"],
>      "expires_in_version": "65",
>      "alert_emails": ["sync-dev@mozilla.org"],
>      "kind": "exponential",

Did you check with the [histogram simulator](https://telemetry.mozilla.org/histogram-simulator/#low=1&high=20000&n_buckets=100&kind=exponential&generate=normal) if this choice of parameters make sense? It seems to be ok.
Attachment #8899683 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8899683 [details]
Bug 1375635 - Add telemetry for how often and long users are in bad authentication states.

https://reviewboard.mozilla.org/r/170998/#review176920

::: toolkit/components/telemetry/Histograms.json:10855
(Diff revision 2)
>      "kind": "exponential",
>      "high": 1000,
>      "n_buckets": 10,
>      "description": "The number of times a sync successfully completed in this session"
>    },
> +  "WEAVE_LOGIN_FAILED_FOR": {

This looks like it's opt-in, but the `sync_login_state_transitions` scalar is opt-out. These probes seem like they're answering a product question ("how many users think they're syncing, but aren't, and for how long?"), and lines up with our OKR to increase user trust in Sync. Should the histogram be opt-out, too?
(In reply to Alessio Placitelli [:Dexter] from comment #16)
> Did you check with the [histogram
> simulator](https://telemetry.mozilla.org/histogram-simulator/
> #low=1&high=20000&n_buckets=100&kind=exponential&generate=normal) if this
> choice of parameters make sense? It seems to be ok.

No - TIL that is a thing! I agree it seems to be OK.

(In reply to Kit Cambridge (he/him) [:kitcambridge] (UTC-7) from comment #17)
> This looks like it's opt-in, but the `sync_login_state_transitions` scalar
> is opt-out. These probes seem like they're answering a product question
> ("how many users think they're syncing, but aren't, and for how long?"), and
> lines up with our OKR to increase user trust in Sync. Should the histogram
> be opt-out, too?

Good catch, thanks - that was an oversight. I'll push a new version with that fixed.
Comment on attachment 8899683 [details]
Bug 1375635 - Add telemetry for how often and long users are in bad authentication states.

Rebecca, re-flagging you for data review. :-)
Attachment #8899683 - Flags: review?(rweiss)
:kitcambridge, can you clone this form (https://docs.google.com/document/d/1SSn5w8DfCSkHWJS8DNTd7ya82diWRxaDUFM5aL4UDDo/edit) into a text file, condense the relevant parts of this bug thread as responses to the form, and then attach it to the bug?  I will perform the review on that attachment.
Flags: needinfo?(kit)
Flags: needinfo?(kit)
Attachment #8902046 - Flags: review?(rweiss)
Comment on attachment 8899683 [details]
Bug 1375635 - Add telemetry for how often and long users are in bad authentication states.

https://reviewboard.mozilla.org/r/170998/#review180214

1) Is there documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes, this will end up in the same Telemetry docs.  More detail is in bug #1375635

2) Is there a control mechanism that allows the user to turn the data collection on and off? (Note, for data collection not needed for security purposes, Mozilla provides such a control mechanism)  
Yes, this will be following the Telemetry mechanisms.

3) If the request is for permanent data collection, is there someone who will monitor the data over time?
Not asked for permanent data collection.

4) Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?  
This seems like it falls between Category 1 and 2, both acceptable for the requested parameters.

5) Is the data collection default-on or default-off? 
Default on for all channels and locales.

6) Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc.  See the appendix for more details)? 
No.

7) Is the data collection covered by the existing Firefox privacy notice? If unsure: escalate to legal.
Yes

8) Does there need to be a check-in in the future to determine whether to renew the data? (Yes/No) (If yes, set a todo reminder or file a bug if appropriate)
Yes, please confirm with :gfritzsche about the process for renewing a probe.
Attachment #8899683 - Flags: review?(rweiss) → review+
Attachment #8902046 - Flags: review?(rweiss) → review+
Pushed by mhammond@skippinet.com.au:
https://hg.mozilla.org/integration/autoland/rev/bd25a5e1a355
Add telemetry for how often and long users are in bad authentication states. r=Dexter,rweiss+418169,tcsc
https://hg.mozilla.org/mozilla-central/rev/bd25a5e1a355
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 57
Blocks: 1397610
See Also: → 1488939
Blocks: 1578217
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: