Closed Bug 1635667 Opened 5 years ago Closed 4 years ago

Validate that Account Ecosystem pings are received as expected

Categories

(Firefox :: Firefox Accounts, task)

task

Tracking

()

RESOLVED WONTFIX

People

(Reporter: rfkelly, Unassigned)

References

Details

Once AET pings are ready per Bug 1635659, we'll need to validate that they send correctly:

  • Are the right identifiers and metrics in there?
  • Are they sent on the expected schedule?

See Bug 1529234 for similar validation work done on the pre-account ping (and from which I shamelessly stole the idea).

Here are the broad strokes of a plan to cross-check AET data with other existing telemetry, inspired by the discussion in Bug 1529234.

Are we seeing roughly the right number of pings?

  • Count the number of AET pings per day, by reason.
  • Count the number of main pings per day where fxa_configured: true, by reason.
    • This should slightly over-count compared to the AET value, due to e.g. users with unverified accounts
      and due to sending the main ping under more circumstances
    • Compare specifically the number of reason: shutdown pings, which should be more closely matched.

Are we seeing roughly the right number of unique clients?

  • Count the number of unique ecosystemClientId values seen per day in AET.
  • Compare with the number of unique clientId values seen per day in the main ping, where fxa_configured: true.
    • This should slightly over-count compared to the AET value, due to e.g. users with unverified accounts.
  • Compare with the number of unique deviceId values seen per day in the sync ping.
    • This should slightly under-count compared to the AET value, due to non-Sync-users not reporting a deviceId.

Are we seeing roughly the right number of unique users?

  • Count the number of unique ecosystemUserId values seen per day in AET.
  • Compare with the number of unique uid values seen per day in the sync ping.
    • This should slightly under-count compared to the AET value, due to non-Sync-users not reporting a uid.
    • This should under-count by a similar proportion to the client count above.

Are we seeing the right metrics in the pings?

  • Count the number of AET pings that are missing total_uri_count, as a proportion of all AET pings.
  • Count the number of main pings that are missing total_uri_count, as a proportion of all main pings.
    • The proportions should be ballpark the same at sufficient volume.

Also we should monitor for decryption or validation errors at ingestion, but I'll need to lean on :klukas for how to do that in production.

Also we should monitor for decryption or validation errors at ingestion, but I'll need to lean on :klukas for how to do that in production.

The Decoder job that runs in Dataflow records Stackdriver metrics surrounding both decryption errors and validation errors, so those would allow for near-realtime monitoring of these errors.

We also can query error tables in BQ to do periodic counts of these different error types, which seems more in line with the type of validation you're discussing above. So I'd propose roughly:

Are we seeing errors in the pipeline?

  • Count the number of documents in the payload_bytes_error table with AET-related document types, grouped by error type
    • The error counts should not exceed some static threshold or some percentage of overall AET-related pings
Depends on: 1658242

Count the number of main pings per day where fxa_configured: true

Note to self: in latest Firefox this is now environment.services.account_enabled rather than fxa_configured.

I started a dashboard based on the above ideas here:

So far the numbers appear to be ballpark the right magnitude, with AET a little lower than expected, but possibly slowly converging to the matching values calculated from other telemetry. Let's see what they look like after a few more days.

The numbers on the dashboard are still growing closer to convergence, but two quick observations that have got me feeling pretty happy about how the ping is working:

  • Less than 2% discrepancy between the volume of AET pings with reason=shutdown, and the number of Main telemetry pings with reason=shutdown and account_enabled=true.
  • Less than 2% discrepancy between the number of unique clients sending AET pings with reason=shutdown, and the number of unique clients sending Main telemetry pings with reason=shutdown and account_enabled=true.

Focusing on reason=shutdown is reasonable since this is the one event that we know should reliably trigger both an AET ping and a Main ping. Looking at all ping reasons shows a larger discrepancy (currently around 10% or so), but maybe that can be at least partially explained by scheduling differences between pings, and by users who just haven't restarted their browser in a while.

Blocks: 1659895

The last few days have shown an uptick in main-telemetry pings with reason=shutdown, which has not corresponded to an uptick in AET pings with reason=shutdown. I need to dig into this a little more, but all the other metrics still seem to be tracking in the right direction.

As of Bug 1661631 this code is no longer active in Firefox, so further validation work will need to wait until we pref it back on.

We're no longer pursuing this approach for AET, closing out the remaining bugs.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.