Open Bug 1700188 Opened 3 years ago Updated 1 year ago

Spurious client_id generation analysis (aka "late client ids")

Categories

(Toolkit :: Telemetry, task, P3)

task

Tracking

()

People

(Reporter: chutten, Unassigned)

References

(Blocks 1 open bug)

Details

In bug 1689438 we instrumented ClientID.jsm for some potential failure points, and in analysis noticed that there's a problem causing extra client id generation "late" (ie, after profile_subsession_counter == 1) for some proportion of profiles.

In bug 1697045 we tried fixing a problem where client ids couldn't save to disk, but presumably could save to Preferences. In that case we use the value in Preferences, avoiding regenerating the client_id "late".

Starting today the instrumentation reaches Firefox Release (with 87), and the remediation reaches Firefox Beta (with 88).

We need to write some SQL and plot some plots to see:

Summary: Spurious client_id generation analysis → Spurious client_id generation analysis (aka "late client ids")

See also meeting notes

Preliminary information

  • Instrumentation shipped in Nightly 87 20210202095107 and rode the trains with 87 (which hit Beta on 2021-02-22 and Release on 2021-03-22)
  • Remediation shipped in Nightly 88 20210312033752 and rides the trains with 88 (hitting Beta on 2021-03-22 and Release on 2021-04-19)
  • Some added Instrumentation shipped in Nightly 89 20210329214331. I've requested uplift to Beta 88.

Remediation effects

The effect of the remediation on Beta 88 appears minimal. Over the week+ since the remediation hit beta, we've seen some reduction of "late" client_ids from 50-80/day to 20-40/day, but it's unclear if that's due to the population still remaining on Beta 87 not yet reaching 88. Looking at it as a whole, the proportion of client_ids generated that happened late is about 4.2% which is down from the 5-6% we previously saw in Beta 87.

The bigger change might be how the number of ids generated per day is lower. Again, it's possible that the extra client_id-generating profiles haven't yet switched to 88, but even adding 87's generated client_ids into the mix shows an 11-20% drop per day.

I think this is evidence that the remediation is reducing the number of client_ids being generated (which is to be expected). What is weird is how it's having a smaller effect on the number of client_ids being generated "late". Hypothesis: the remediation is helping mostly profiles that would never have stuck around to generate a new client_id later, and somewhat fewer longer-lived profiles. Given what retention curves look like, I guess I shouldn't be surprised.

I'm disappointed that this didn't fix everything. If we're generating a client_id "late" and there's no value in Preferences... how does that even happen?

Impact on Release

Release appears to be more like Nightly than Beta, with an effect size of 1-2% of generated client_ids. Unfortunately that means some thousands (3-5k) of extra client_ids each day.

Strangely, the number of save errors is higher than (much higher than. Like 4x) the number of ids coming in late. Perhaps that reflects an overall greater difficulty saving the state.json file from single-session clients we don't hear from again? Cloned profiles again?

Note: The size of the "late" client_id problem in release means I can finally use main_1pct for analysis.

Impact of opt-in events

If someone opts back into data collection, we generate for them a new client_id. This can happen on any subsession, so our "late" client_id numbers are inflated some amount by the presence of legitimate reasons for client_ids to be generated "late".

I've finally gone through and looked into the effect of opt-ins on the numbers. They matter! On Beta, they reduce the number of "late" client_id generations per day from (Beta 87) 48-82 to 36-70, or (Beta 88) 31-46 to 24-41. You'll notice that they don't change any of the analysis or conclusions in the previous work, though. "late" client_ids still cause problems on the order of dozens per day on Beta. Thousands per day on Release.

No numbers published before this section take into account the opt-in case. Numbers from now on will.

Depends on: 1700835

Now that the remediation has made it to release, how's it look?

"late" client ids seem to have the same partial improvement as on other channels

As we saw on Nightly and Beta, the number of "late" client_ids diminished by the Remediation is... not absolute. My eyeballs suggest that the decrease is significant (statistically if in no other way), but it clearly hasn't had the effect I desired.

New Profiles measured by clients sending us pings for the first time appear inflated by around 20%

The larger-in-magnitude effect appears to be how many client_id creations the Remediation prevents. That's on the order of 20%, and does indeed result in a decrease in new profile counts as measured by "main" pings with profile_subsession_counter = 1. As measured by "new-profile" pings, new profile ping counts and client counts remain unchanged.

This appears to be visible on the New Profiles dashboard as a progressive decline in new profiles starting April 20: https://mozilla.cloud.looker.com/dashboards-next/53?Date=after%202021%2F01%2F01&Channel=&Activity%20Segment=&OS=&Attributed%20(Yes%20%2F%20No)=&Country%20Name=

So how do we fix the rest of the "late" client_ids, hm?

I'm really not sure. I haven't yet looked at the added instrumentation which reached release on 88 to see if there's anything obvious hidden in there, but I'm not hopeful it holds the answer. It appears as though the answer is "Sometimes we're able to save the session state file but not the Prefs, for no obvious reason", which doesn't lend itself to particularly strong ideas for fixes.

I think the next step for me is to go hunting for ideas in the Release data. Maybe there's something that only becomes obvious at release volumes.

If nothing comes up, the task will be to as exhaustively as possible document the behaviour of these in the literature and continue wishing for a speedy migration of Firefox Desktop collections to Glean where this problem appears to be minimized.

I compared data from the first 14 days of release 87 and 88.

  1. The ratio of generated client_ids to new profile pings was 22% lower in 88 than 87 (this was statistically significant, obviously). This confirms what chutten says above - the remediation does appear to markedly reduced the number of client_ids generated.

  2. While the remediation seems to have reduced the number of generated IDs overall, it does not seem to have reduced the number of clients reporting state_file_save_errors, in fact the absolute number of clients reporting these errors was modestly higher in 88 and even more so as a ratio to new profile pings.

  3. Although the baserates of loaded_client_id_doesnt_match_pref are low, the number of clients reporting this does seem to be significantly lower in 88, both as a ratio to the number of new profile pings and to the number of generated client_ids.

I'll continue to poke around here and post more if I find anything else interesting.

For my part, the added instrumentation fails to uncover any fundamental flaws around what's going on inside the Client ID module. It just seems as though sometimes, for reasons that aren't immediately clear, we are unable to persist the client_id after generation to either state.json or prefs. I think we "just" get to live with some small but tenacious portion of extra client_id generation.

Put another way, I can think of no client-side change to improve this nor any further instrumentation to help discover new nuances about it. Maybe in the future with a clear brain I'll think of something, but for now I'm going to unassign myself from the client-side portion of this and leave the bug to track the work around the "Somewhere on the order of 20% of new profiles were illusory" aspect of this change. (follow-up analyses/communication/reporting etc)

Assignee: chutten → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
See Also: → 1736383
See Also: → 1748813
You need to log in before you can comment on or make changes to this bug.