Closed Bug 1678146 Opened 5 years ago Closed 5 years ago

Validate FOG vs Firefox Telemetry client_ids

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

RESOLVED FIXED

People

(Reporter: chutten, Assigned: chutten)

References

Details

(Whiteboard: [telemetry:fog:m?])

We've instrumented Firefox using Glean, including sending the Telemetry client_id (see bug 1674233). Now time to take a look and see what this suggests about the integrity of FOG (and Firefox Telemetry itself).

Things to look into include:

  • Are the values for the Telemetry client_id being reported in the "fog-validation" pings correct?
    • Are they present as expected?
    • Are there any Glean SDK Metrics API errors reported?
  • How many Telemetry client_ids report multiple Glean client_ids?
    • Can this be explained by opting out of Telemetry? (This would reset both the Telemetry client_id and the Glean client_id, but is there evidence of a timing issue here?)
    • What are these Telemetry client_ids that have multiple Glean client_ids? Are they the canary id?
    • What is the distribution? (Is it one client_id that's managed to make it to thousands of machines, or is it thousands of machines with two cloned profiles apiece?)
    • Does this value change over time?

Well, whoops. We don't record any value if the UUID is invalid because glean-core's API is type-safe (only accepts a uuid::Uuid type which must be valid) but I throw away any parsing failure here. This would manifest as a client that never reports (in all of its pings) a Telemetry client_id. I'd better file a bug about that.

See Also: → 1678151

Some references to aid analysis:

  • The first build instrumented with the client id was
    • Nightly 20201109215349 (bug 1674233)
    • Beta 20201116183642 (the first Beta of Firefox 84)
  • The Telemetry client_id is in the payload at metrics.uuid.fog_validation_legacy_telemetry_client_id

Chris, given the differences between the current Glean-like FOG implementation and RLB, should this wait a bit after RLB are being used in m-c?

Flags: needinfo?(chutten)

I'm not sure that the RLB will have any effect on Firefox's ability to record UUIDs and put them into custom pings which are then sent. Is there a specific concern you have in mind?

Flags: needinfo?(chutten) → needinfo?(alessio.placitelli)

(In reply to Chris H-C :chutten from comment #5)

I'm not sure that the RLB will have any effect on Firefox's ability to record UUIDs and put them into custom pings which are then sent. Is there a specific concern you have in mind?

Yes and no. The init sequence is completely different, for example. There were also some trickeries around toggling upload enabled and sending pings.

I'd recommend waiting for RLB to be in m-c to have a reasonable validation: they are different enough in many places to warrant a new validation after a bit anyway.

Flags: needinfo?(alessio.placitelli)

Alrighty, some new references to aid analysis:

  • The first build with RLB in it was
    • Nightly 20201204094005 (bug 1662868)
    • Beta will get it when we 85 reaches Beta next week.
  • The Telemetry client_id is still in the payload at metrics.uuid.fog_validation_legacy_telemetry_client_id

To get a full week's data without an uptake curve in it, we're looking at Dec 14 or something to look at Nightly (2 days uptake, 7 days duration, plus a day because the 13th is a Sunday).

Sample

  • Nightly
    • Builds at least as new as 20201204094005
    • Pings received between [2020-12-07, 2020-12-14)
    • About 43k clients
  • Beta
    • Builds at least as new as 20201214 (Beta 85)
    • Pings received between [2020-12-21, 2020-12-28)
    • About 880k clients

client_ids reporting more than one of the other kind of client_id

Taking the FOG Glean SDK client_id as authoritative, how many clients report more than one Telemetry client_id?

Taking the Telemetry client_id as authoritative, how many clients report more than one Glean SDK client_id?

Things I haven't looked at yet

  • c0ffee counts
  • errors reported
  • pings without any value at all
  • counts of multiple client_ids per day (is there a pattern)
Status: NEW → ASSIGNED

c0ffee Counts

When upload is disabled, both the Firefox Telemetry client_id and Glean SDK client_id values become c0ffeec0-ffee-c0ff-eec0-ffeec0ffeec0. We shouldn't ever ingest a ping from either system with this as its own client_id value, but it'll be interesting to see if there's a timing window where we sometimes get it in metrics.uuid.fog_validation_legacy_telemetry_client_id with a non-c0ffee client_info.client_id.

And in neither sample do we receive a ping with a c0ffee client_info.client_id, which is a good sign that the whole Glean system is working as desired.

Error Counts

Interestingly, we were looking at "fog-validation" ping validation errors for other reasons, and found that we had a high number of ping_info.seq errors. Strangely, they were all coming from User Agents pretending they were ESR builds of Firefox (which is odd given that the UA String that the Glean SDK uses for upload starts with the word Glean). So it looks like fuzzers are having a field day, as per usual.

There were also about 80 failures on client_info.client_id in a day, which is well below the threshold of caring. So I'd say that, as far as validation errors go, "fog-validation" is pretty clean.

Pings without a Telemetry client_id at all

If it's possible to find situations where we report a c0ffee client_id, it's also possible to find situations where we fail to report any Telemetry client_id at all. In practice, though, this almost never happens:

Counts of mulitple client_ids per day (is there a pattern)

We might see a seasonality or increasing/decreasing trend in the number of times the same Telemetry client_id pops up attached to different Glean SDK client_ids. I've expanded the sample to look at all pings received since a week after Beta 85 was released to get a better time series look:

This doesn't show anything anomalous in FOG validation. But what this does do is show the undercounting of DAU by Telemetry in Beta due to profile inflation to be about 8-18k over this holiday season. Considering Beta's MAU is 400-700k over the same period, an undercount of 2-3% may warrant further examination.

Conclusion

Looks like the FOG validation is going just fine, thank you very much. We can look forward to repeating the analyses on Release no earlier than February the 8th (two weeks after release, to give time for uptake and a week's data).

ni?travis to check my work

Flags: needinfo?(tlong)
See Also: → 1684980

Bit of a drive-by check

(In reply to Chris H-C :chutten from comment #9)

c0ffee Counts

When upload is disabled, both the Firefox Telemetry client_id and Glean SDK client_id values become c0ffeec0-ffee-c0ff-eec0-ffeec0ffeec0. We shouldn't ever ingest a ping from either system with this as its own client_id value, but it'll be interesting to see if there's a timing window where we sometimes get it in metrics.uuid.fog_validation_legacy_telemetry_client_id with a non-c0ffee client_info.client_id.

And in neither sample do we receive a ping with a c0ffee client_info.client_id, which is a good sign that the whole Glean system is working as desired.

Are we sure we're not filtering these out before they hit the tables? After all the message scrubber has code for that.
Then again I don't see that error in the ingestion errors for Desktop.

(In reply to Jan-Erik Rediger [:janerik] from comment #10)

And in neither sample do we receive a ping with a c0ffee client_info.client_id, which is a good sign that the whole Glean system is working as desired.

Are we sure we're not filtering these out before they hit the tables? After all the message scrubber has code for that.
Then again I don't see that error in the ingestion errors for Desktop.

Yeah, that's what I was getting at by saying "the whole Glean system is working". Either we weren't sending them or they weren't making it into the stable tables. Either way, it's a win.

I should've looked into the error stream to figure out which it was, that was sloppy of me. Thank you for catching that and confirming that it's indeed the "we don't send it" half of that : )

This is really fascinating! I think that your validation holds up and I'm really interested about some of these outliers with 1k + clones of the telemetry-id. As a side note, knowing that Tim Smith is interested in dupes as it comes to experimentation, it might be worth sharing these results with him as well.

r+ from me!

Flags: needinfo?(tlong)

That sounds like not the worst reason to set a needinfo? -- Hey Tim, here's some analysis on client_id inflation as measured from FOG. Any thoughts?

Flags: needinfo?(tdsmith)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

Thanks for the ni?; no special thoughts with respect to experimentation. This sounds generally consistent with the duplication we've been able to observe using other methods and it doesn't comment on the normandy_id issue that keeps me up at night.

Taking the FOG Glean SDK client_id as authoritative, how many clients report more than one Telemetry client_id?
About five thousandths of a percent of clients on Nightly.
About three thousandths of a percent of clients on Beta.

Doesn't 30-50ppm seem a little high here? We don't expect much duplication of the Glean client_id yet. How do we imagine that clients that share a Glean client_id would acquire divergent telemetry client_id's? I think most clients won't be resetting their client_id during the observation window so the number of clients that could be affected by a reset bug is small. I wonder if this is a sign that an important fraction of them are running into problems. How many deletion request pings did we receive from that cohort?

Flags: needinfo?(tdsmith)

(In reply to Tim Smith 👨‍🔬 [:tdsmith] from comment #14)

Taking the FOG Glean SDK client_id as authoritative, how many clients report more than one Telemetry client_id?
About five thousandths of a percent of clients on Nightly.
About three thousandths of a percent of clients on Beta.

Doesn't 30-50ppm seem a little high here? We don't expect much duplication of the Glean client_id yet. How do we imagine that clients that share a Glean client_id would acquire divergent telemetry client_id's? I think most clients won't be resetting their client_id during the observation window so the number of clients that could be affected by a reset bug is small. I wonder if this is a sign that an important fraction of them are running into problems. How many deletion request pings did we receive from that cohort?

We received 0 "deletion-request" pings from the misbehaving cohort. Mind you, there were barely more than two dozen of them all told. Maybe a pattern will be easier to spot once 85 expands across release this week.

Now that you've drawn my attention to it, this does seem like an impossible thing that shouldn't ever happen. Not even to tens of clients out of millions. Looking at the array aggregates of the telemetry client ids, we're seeing solid chunks of pings with each of the different ids. It doesn't appear to be some off-by-one error anyplace.

It's almost as though they're getting a new client id each session. I wonder if there's an addon or something that rotates your client id every day...

If you have no client_id in the file at ProfD/datareporting/state.json (ProfD is the current profile dir) on startup, we'll generate a new one and try to save it to disk. The Glean db is saved at ProfD/datareporting/glean/ so it must be possible to save to disk under this dir otherwise the Glean client_id wouldn't be stable.

Hm. The state file with the client ids in it. We only ever try to save it once. And that's near startup which is a particularly contentious period for IO resources. ...maybe the write to disk fails?

It'd be trivial to instrument this. In the event that we can't load state.json, we can check the prefs to see if toolkit.telemetry.cachedClientID has a value and report it. And even if we do load state.json we could report any mismatch with the cached value. We could be finding a brand new reliability problem with Firefox Telemetry thanks to Glean.

See Also: → 1689438
You need to log in before you can comment on or make changes to this bug.