Validate FOG vs Firefox Telemetry client_ids
Categories
(Toolkit :: Telemetry, task, P1)
Tracking
()
People
(Reporter: chutten, Assigned: chutten)
References
Details
(Whiteboard: [telemetry:fog:m?])
We've instrumented Firefox using Glean, including sending the Telemetry client_id (see bug 1674233). Now time to take a look and see what this suggests about the integrity of FOG (and Firefox Telemetry itself).
| Assignee | ||
Comment 1•5 years ago
|
||
Things to look into include:
- Are the values for the Telemetry client_id being reported in the "fog-validation" pings correct?
- Are they present as expected?
- Are there any Glean SDK Metrics API errors reported?
- How many Telemetry client_ids report multiple Glean client_ids?
- Can this be explained by opting out of Telemetry? (This would reset both the Telemetry client_id and the Glean client_id, but is there evidence of a timing issue here?)
- What are these Telemetry client_ids that have multiple Glean client_ids? Are they the canary id?
- What is the distribution? (Is it one client_id that's managed to make it to thousands of machines, or is it thousands of machines with two cloned profiles apiece?)
- Does this value change over time?
| Assignee | ||
Comment 2•5 years ago
|
||
Well, whoops. We don't record any value if the UUID is invalid because glean-core's API is type-safe (only accepts a uuid::Uuid type which must be valid) but I throw away any parsing failure here. This would manifest as a client that never reports (in all of its pings) a Telemetry client_id. I'd better file a bug about that.
| Assignee | ||
Comment 3•5 years ago
|
||
Some references to aid analysis:
- The first build instrumented with the client id was
- Nightly 20201109215349 (bug 1674233)
- Beta 20201116183642 (the first Beta of Firefox 84)
- The Telemetry client_id is in the payload at
metrics.uuid.fog_validation_legacy_telemetry_client_id
Comment 4•5 years ago
|
||
Chris, given the differences between the current Glean-like FOG implementation and RLB, should this wait a bit after RLB are being used in m-c?
| Assignee | ||
Comment 5•5 years ago
|
||
I'm not sure that the RLB will have any effect on Firefox's ability to record UUIDs and put them into custom pings which are then sent. Is there a specific concern you have in mind?
Comment 6•5 years ago
|
||
(In reply to Chris H-C :chutten from comment #5)
I'm not sure that the RLB will have any effect on Firefox's ability to record UUIDs and put them into custom pings which are then sent. Is there a specific concern you have in mind?
Yes and no. The init sequence is completely different, for example. There were also some trickeries around toggling upload enabled and sending pings.
I'd recommend waiting for RLB to be in m-c to have a reasonable validation: they are different enough in many places to warrant a new validation after a bit anyway.
| Assignee | ||
Comment 7•5 years ago
|
||
Alrighty, some new references to aid analysis:
- The first build with RLB in it was
- Nightly 20201204094005 (bug 1662868)
- Beta will get it when we 85 reaches Beta next week.
- The Telemetry client_id is still in the payload at
metrics.uuid.fog_validation_legacy_telemetry_client_id
To get a full week's data without an uptake curve in it, we're looking at Dec 14 or something to look at Nightly (2 days uptake, 7 days duration, plus a day because the 13th is a Sunday).
| Assignee | ||
Comment 8•5 years ago
|
||
Sample
- Nightly
- Builds at least as new as 20201204094005
- Pings received between [2020-12-07, 2020-12-14)
- About 43k clients
- Beta
- Builds at least as new as 20201214 (Beta 85)
- Pings received between [2020-12-21, 2020-12-28)
- About 880k clients
client_ids reporting more than one of the other kind of client_id
Taking the FOG Glean SDK client_id as authoritative, how many clients report more than one Telemetry client_id?
- About five thousandths of a percent of clients on Nightly.
- About three thousandths of a percent of clients on Beta.
Taking the Telemetry client_id as authoritative, how many clients report more than one Glean SDK client_id?
- About seven tenths of a percent of clients on Nightly.
- About one percent of clients on Beta.
Things I haven't looked at yet
c0ffeecounts- errors reported
- pings without any value at all
- counts of multiple client_ids per day (is there a pattern)
| Assignee | ||
Comment 9•5 years ago
|
||
c0ffee Counts
When upload is disabled, both the Firefox Telemetry client_id and Glean SDK client_id values become c0ffeec0-ffee-c0ff-eec0-ffeec0ffeec0. We shouldn't ever ingest a ping from either system with this as its own client_id value, but it'll be interesting to see if there's a timing window where we sometimes get it in metrics.uuid.fog_validation_legacy_telemetry_client_id with a non-c0ffee client_info.client_id.
- About two thousandths of a percent of pings in the nightly sample have a
c0ffeevalue in the Telemetry client_id - About one hundredth of a percent of pings in the beta sample have a
c0ffeevalue in the Telemetry client_id
And in neither sample do we receive a ping with a c0ffee client_info.client_id, which is a good sign that the whole Glean system is working as desired.
Error Counts
Interestingly, we were looking at "fog-validation" ping validation errors for other reasons, and found that we had a high number of ping_info.seq errors. Strangely, they were all coming from User Agents pretending they were ESR builds of Firefox (which is odd given that the UA String that the Glean SDK uses for upload starts with the word Glean). So it looks like fuzzers are having a field day, as per usual.
There were also about 80 failures on client_info.client_id in a day, which is well below the threshold of caring. So I'd say that, as far as validation errors go, "fog-validation" is pretty clean.
Pings without a Telemetry client_id at all
If it's possible to find situations where we report a c0ffee client_id, it's also possible to find situations where we fail to report any Telemetry client_id at all. In practice, though, this almost never happens:
- In the nightly sample this happens in seven hundred thousandths of a percent of the pings
- In the beta sample this happens in one ten thousandth of a percent of the pings.
Counts of mulitple client_ids per day (is there a pattern)
We might see a seasonality or increasing/decreasing trend in the number of times the same Telemetry client_id pops up attached to different Glean SDK client_ids. I've expanded the sample to look at all pings received since a week after Beta 85 was released to get a better time series look:
- There is no evidence of a pattern in the number of extra client_ids reported per day. Since this is reported as absolute figures there is volume seasonality around holidays and weekends, but that seems to line up with observed seasonality in client volumes over the same period.
This doesn't show anything anomalous in FOG validation. But what this does do is show the undercounting of DAU by Telemetry in Beta due to profile inflation to be about 8-18k over this holiday season. Considering Beta's MAU is 400-700k over the same period, an undercount of 2-3% may warrant further examination.
Conclusion
Looks like the FOG validation is going just fine, thank you very much. We can look forward to repeating the analyses on Release no earlier than February the 8th (two weeks after release, to give time for uptake and a week's data).
ni?travis to check my work
Comment 10•5 years ago
|
||
Bit of a drive-by check
(In reply to Chris H-C :chutten from comment #9)
c0ffeeCountsWhen upload is disabled, both the Firefox Telemetry client_id and Glean SDK client_id values become
c0ffeec0-ffee-c0ff-eec0-ffeec0ffeec0. We shouldn't ever ingest a ping from either system with this as its own client_id value, but it'll be interesting to see if there's a timing window where we sometimes get it inmetrics.uuid.fog_validation_legacy_telemetry_client_idwith a non-c0ffeeclient_info.client_id.
- About two thousandths of a percent of pings in the nightly sample have a
c0ffeevalue in the Telemetry client_id- About one hundredth of a percent of pings in the beta sample have a
c0ffeevalue in the Telemetry client_idAnd in neither sample do we receive a ping with a
c0ffeeclient_info.client_id, which is a good sign that the whole Glean system is working as desired.
Are we sure we're not filtering these out before they hit the tables? After all the message scrubber has code for that.
Then again I don't see that error in the ingestion errors for Desktop.
| Assignee | ||
Comment 11•5 years ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #10)
And in neither sample do we receive a ping with a
c0ffeeclient_info.client_id, which is a good sign that the whole Glean system is working as desired.Are we sure we're not filtering these out before they hit the tables? After all the message scrubber has code for that.
Then again I don't see that error in the ingestion errors for Desktop.
Yeah, that's what I was getting at by saying "the whole Glean system is working". Either we weren't sending them or they weren't making it into the stable tables. Either way, it's a win.
I should've looked into the error stream to figure out which it was, that was sloppy of me. Thank you for catching that and confirming that it's indeed the "we don't send it" half of that : )
Comment 12•5 years ago
|
||
This is really fascinating! I think that your validation holds up and I'm really interested about some of these outliers with 1k + clones of the telemetry-id. As a side note, knowing that Tim Smith is interested in dupes as it comes to experimentation, it might be worth sharing these results with him as well.
r+ from me!
| Assignee | ||
Comment 13•5 years ago
|
||
That sounds like not the worst reason to set a needinfo? -- Hey Tim, here's some analysis on client_id inflation as measured from FOG. Any thoughts?
| Assignee | ||
Updated•5 years ago
|
Comment 14•5 years ago
|
||
Thanks for the ni?; no special thoughts with respect to experimentation. This sounds generally consistent with the duplication we've been able to observe using other methods and it doesn't comment on the normandy_id issue that keeps me up at night.
Taking the FOG Glean SDK client_id as authoritative, how many clients report more than one Telemetry
client_id?
About five thousandths of a percent of clients on Nightly.
About three thousandths of a percent of clients on Beta.
Doesn't 30-50ppm seem a little high here? We don't expect much duplication of the Glean client_id yet. How do we imagine that clients that share a Glean client_id would acquire divergent telemetry client_id's? I think most clients won't be resetting their client_id during the observation window so the number of clients that could be affected by a reset bug is small. I wonder if this is a sign that an important fraction of them are running into problems. How many deletion request pings did we receive from that cohort?
| Assignee | ||
Comment 15•5 years ago
|
||
(In reply to Tim Smith 👨🔬 [:tdsmith] from comment #14)
Taking the FOG Glean SDK client_id as authoritative, how many clients report more than one Telemetry
client_id?
About five thousandths of a percent of clients on Nightly.
About three thousandths of a percent of clients on Beta.Doesn't 30-50ppm seem a little high here? We don't expect much duplication of the Glean client_id yet. How do we imagine that clients that share a Glean client_id would acquire divergent telemetry client_id's? I think most clients won't be resetting their client_id during the observation window so the number of clients that could be affected by a reset bug is small. I wonder if this is a sign that an important fraction of them are running into problems. How many deletion request pings did we receive from that cohort?
We received 0 "deletion-request" pings from the misbehaving cohort. Mind you, there were barely more than two dozen of them all told. Maybe a pattern will be easier to spot once 85 expands across release this week.
Now that you've drawn my attention to it, this does seem like an impossible thing that shouldn't ever happen. Not even to tens of clients out of millions. Looking at the array aggregates of the telemetry client ids, we're seeing solid chunks of pings with each of the different ids. It doesn't appear to be some off-by-one error anyplace.
It's almost as though they're getting a new client id each session. I wonder if there's an addon or something that rotates your client id every day...
If you have no client_id in the file at ProfD/datareporting/state.json (ProfD is the current profile dir) on startup, we'll generate a new one and try to save it to disk. The Glean db is saved at ProfD/datareporting/glean/ so it must be possible to save to disk under this dir otherwise the Glean client_id wouldn't be stable.
Hm. The state file with the client ids in it. We only ever try to save it once. And that's near startup which is a particularly contentious period for IO resources. ...maybe the write to disk fails?
It'd be trivial to instrument this. In the event that we can't load state.json, we can check the prefs to see if toolkit.telemetry.cachedClientID has a value and report it. And even if we do load state.json we could report any mismatch with the cached value. We could be finding a brand new reliability problem with Firefox Telemetry thanks to Glean.
Description
•