Validate that Account Ecosystem pings are received as expected
Categories
(Firefox :: Firefox Accounts, task)
Tracking
()
People
(Reporter: rfkelly, Unassigned)
References
Details
Once AET pings are ready per Bug 1635659, we'll need to validate that they send correctly:
- Are the right identifiers and metrics in there?
- Are they sent on the expected schedule?
See Bug 1529234 for similar validation work done on the pre-account ping (and from which I shamelessly stole the idea).
Reporter | ||
Comment 1•4 years ago
|
||
Here are the broad strokes of a plan to cross-check AET data with other existing telemetry, inspired by the discussion in Bug 1529234.
Are we seeing roughly the right number of pings?
- Count the number of AET pings per day, by
reason
. - Count the number of main pings per day where
fxa_configured: true
, byreason
.- This should slightly over-count compared to the AET value, due to e.g. users with unverified accounts
and due to sending the main ping under more circumstances - Compare specifically the number of
reason: shutdown
pings, which should be more closely matched.
- This should slightly over-count compared to the AET value, due to e.g. users with unverified accounts
Are we seeing roughly the right number of unique clients?
- Count the number of unique
ecosystemClientId
values seen per day in AET. - Compare with the number of unique
clientId
values seen per day in the main ping, wherefxa_configured: true
.- This should slightly over-count compared to the AET value, due to e.g. users with unverified accounts.
- Compare with the number of unique
deviceId
values seen per day in the sync ping.- This should slightly under-count compared to the AET value, due to non-Sync-users not reporting a
deviceId
.
- This should slightly under-count compared to the AET value, due to non-Sync-users not reporting a
Are we seeing roughly the right number of unique users?
- Count the number of unique
ecosystemUserId
values seen per day in AET. - Compare with the number of unique
uid
values seen per day in the sync ping.- This should slightly under-count compared to the AET value, due to non-Sync-users not reporting a
uid
. - This should under-count by a similar proportion to the client count above.
- This should slightly under-count compared to the AET value, due to non-Sync-users not reporting a
Are we seeing the right metrics in the pings?
- Count the number of AET pings that are missing
total_uri_count
, as a proportion of all AET pings. - Count the number of main pings that are missing
total_uri_count
, as a proportion of all main pings.- The proportions should be ballpark the same at sufficient volume.
Reporter | ||
Comment 2•4 years ago
|
||
Also we should monitor for decryption or validation errors at ingestion, but I'll need to lean on :klukas for how to do that in production.
Comment 3•4 years ago
|
||
Also we should monitor for decryption or validation errors at ingestion, but I'll need to lean on :klukas for how to do that in production.
The Decoder job that runs in Dataflow records Stackdriver metrics surrounding both decryption errors and validation errors, so those would allow for near-realtime monitoring of these errors.
We also can query error tables in BQ to do periodic counts of these different error types, which seems more in line with the type of validation you're discussing above. So I'd propose roughly:
Are we seeing errors in the pipeline?
- Count the number of documents in the
payload_bytes_error
table with AET-related document types, grouped by error type- The error counts should not exceed some static threshold or some percentage of overall AET-related pings
Reporter | ||
Comment 4•4 years ago
|
||
Count the number of main pings per day where fxa_configured: true
Note to self: in latest Firefox this is now environment.services.account_enabled
rather than fxa_configured
.
Reporter | ||
Comment 5•4 years ago
|
||
I started a dashboard based on the above ideas here:
So far the numbers appear to be ballpark the right magnitude, with AET a little lower than expected, but possibly slowly converging to the matching values calculated from other telemetry. Let's see what they look like after a few more days.
Reporter | ||
Comment 6•4 years ago
|
||
The numbers on the dashboard are still growing closer to convergence, but two quick observations that have got me feeling pretty happy about how the ping is working:
- Less than 2% discrepancy between the volume of AET pings with
reason=shutdown
, and the number of Main telemetry pings withreason=shutdown
andaccount_enabled=true
. - Less than 2% discrepancy between the number of unique clients sending AET pings with
reason=shutdown
, and the number of unique clients sending Main telemetry pings withreason=shutdown
andaccount_enabled=true
.
Focusing on reason=shutdown
is reasonable since this is the one event that we know should reliably trigger both an AET ping and a Main ping. Looking at all ping reasons shows a larger discrepancy (currently around 10% or so), but maybe that can be at least partially explained by scheduling differences between pings, and by users who just haven't restarted their browser in a while.
Reporter | ||
Comment 7•4 years ago
|
||
The last few days have shown an uptick in main-telemetry pings with reason=shutdown
, which has not corresponded to an uptick in AET pings with reason=shutdown
. I need to dig into this a little more, but all the other metrics still seem to be tracking in the right direction.
Reporter | ||
Comment 8•4 years ago
|
||
As of Bug 1661631 this code is no longer active in Firefox, so further validation work will need to wait until we pref it back on.
Reporter | ||
Comment 9•4 years ago
|
||
We're no longer pursuing this approach for AET, closing out the remaining bugs.
Description
•