Closed Bug 1397610 Opened 5 years ago Closed 3 years ago

Analyze "users in yellow states" telemetry.

Categories

(Firefox :: Sync, enhancement, P3)

enhancement

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: markh, Assigned: loines)

References

Details

In bug 1375635 we added telemetry to record how long users are in "yellow states". Now we've done that work we should make sure we look at the telemetry - but we probably need to wait a while while it rides the trains.
Alex to help Leif get a bugzilla account then swap the ni? to him :)
Flags: needinfo?(adavis)
Assignee: nobody → loines
Flags: needinfo?(adavis)
Priority: -- → P3
I had a quick look at https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-11-02&keys=__none__!__none__!__none__&max_channel_version=beta%252F57&measure=WEAVE_LOGIN_FAILED_FOR&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-09-25&table=0&trim=1&use_submission_date=0

The vast majority of users get from "login failed" to a success state after ~3 minutes. However, 11% appear to never get out of this state. Note however this is only looking at beta (as it landed in 57) - release might better, or might be worse - but at face value, it seems :rfeeley/:adavis's speculation in bug 1375635 seems correct and implies we should do something here.
> However, 11% appear to never get out of this state.

Wow, that is more than I expected. Thanks mark for looking into this!
This is looking *terrible* for beta 60 and later - https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2018-05-03&keys=__none__!__none__!__none__&max_channel_version=beta%252F60&measure=WEAVE_LOGIN_FAILED_FOR&min_channel_version=beta%252F59&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2018-03-11&table=0&trim=1&use_submission_date=0

It looks fine in beta 59, so I suspect bug 1435929, although I'm not yet sure if this is due to the telemetry not being recorded properly or something actually going wrong - but I spent some time both staring at the code and trying to reproduce either kind of issue here and failed.

Sadly I can't seem to get the figures for release, and nightly 60 and 61 look reasonable. 60 is now on the release channel and we haven't heard reports of obvious breakage in this area. I guess it is *possible* bug 1480335 is related, but I can't see how it would be (and no flood of reports about that either)

So I'm a bit confused. Leif, are we able to sanity check anything against FxA's server-side metrics? Eg, are we "losing" device at a greater than expected rate since 60, or able to deduce if there are more devices in a bad auth state somehow?
Something strange with the telemetry.

Beta 58 has:
 Number of days: 19
 Ping Count: 51.47k
 Sample Count: 51.73k

Beta 59 has:
 Number of days: 13
 Ping Count: 29.66k
 Sample Count: 29.85k

Whereas beta 60 has:
 Number of days: 16
 Ping Count: 15.29k
 Sample Count: 15.86k

Beta 61:
 Number of days: 15
 Ping Count: 8.88k
 Sample Count: 9.3k

Note the massive reduction each release in the ping and sample counts. I've no idea what that could mean.
As I understand it, one way that devices can get into this state is after being required to confirm login via email. If we look at the (number successful logins after email sent) / (number of login confirmation emails sent) then that has hardly changed from 59-61 (71.7%, 71.5%, 71.2% for 59,60,61 respectively)

https://analytics.amplitude.com/mozilla-corp/chart/946k8d3

(Mark if you don't have an amplitude account ping me on slack and I'll send you the chart)

If we're concerned about silent disconnections that might be harder to see. FxA counts daily active users by the certificate signed event which (I believe?) requires users to be in a good auth state.  We certainly haven't seen a disastrous change in those numbers (maybe a small summer slump uncorrelated to fx releases), but that still might not mean there's no problem.

I might be able to count number of users whose device count went down and for how long on average the count went down, and segment by browser version, would that help?
loines: 
Given that a disconnected device doesn't make them an inactive user, I think we would have to look at it from Re:Dash because Amplitude doesn't do a good job at tracking individual device stats.
See Also: → 1488939
See Also: → 1523644

We killed this telemetry.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.