Closed Bug 1654891 Opened 4 years ago Closed 3 years ago

Update agent - telemetry and metrics collection

Categories

(Toolkit :: Application Update, task)

task

Tracking

()

RESOLVED FIXED

People

(Reporter: nalexander, Unassigned)

References

(Blocks 2 open bugs)

Details

This ticket tracks working through the details of telemetry and metrics collection for the background update agent.

From discussions with chutten, sguha, and mdroettboom, there are 3 main update-related telemetry points:

  1. The "update internals" of speed, successes, and failures -- send it from wherever, whenever. This is strictly internal to the project/product and the team.

  2. "update" ping with reason "ready" -- might not be all that important; there appear to be few analyses that care about this. Could be an "update internal" like 1).

  3. "update" ping with reason "success" -- happens automagically on Firefox startup when it notices that the version's changed. The BUA shouldn't matter here. This is what signals the versions that the profile updated from and to, which makes it very interesting.

From chutten: having the correct information in the "main" ping isn't something the BUA can get in the way of. Firefox reads straight from AppConstants, so unless you're lying to everyone, it has the correct information.

Now, for User Journey, we care about knowing for each client ID:

  1. when did their Firefox know of an update? (I.e., when could it potentially have updated.)

  2. when did their Firefox actually start and finish downloading the update;

  3. when did it actually update.

We want time of life events for each relevant client ID, since that is the information required to investigate why some population of clients are not updating, or have update cycles ending in errors.

Depends on: 1694505
Depends on: 1702052

chutten: OK, it's time for me to do this work. This will be scoped down to just the "update internals" ping that we've talked about at various times. I have a few questions; let's start with:

  1. I would like to submit detailed error reports when things go wrong. These would look like stacks, lightly massaged in the manner of the BackgroundHangMonitor. These are much larger than Glean strings currently allow -- 100 bytes.

    I could manually split my stacks into a StringList metric type, giving me 20 strings at 50 bytes each, which is still not likely to suffice in the wild.

    Doubtless there are other encodings that might be better suited, or perhaps Glean allows some free-form JSON with fewer restrictions?

  2. I would like to send via Glean existing metrics currently captured in UpdateTelemetry.jsm as various histograms, keyed histograms, and keyed scalars. Fine, I can access them via getSnapshotForHistograms and friends, but the mapping is non-obvious to say the least. Is the best approach here to hand-roll some mapping, exploiting the fact that these aren't really histograms in my use case and will all be 0/1 values?

    The only other approach I can think of is to replace the AUSTLMY layer with a Glean-aware layer. I'm not eager to do this, since it's finicky work.

Thoughts?

Flags: needinfo?(chutten)

(In reply to Nick Alexander :nalexander [he/him] from comment #1)

chutten: OK, it's time for me to do this work. This will be scoped down to just the "update internals" ping that we've talked about at various times. I have a few questions; let's start with:

  1. I would like to submit detailed error reports when things go wrong. These would look like stacks, lightly massaged in the manner of the BackgroundHangMonitor. These are much larger than Glean strings currently allow -- 100 bytes.

    I could manually split my stacks into a StringList metric type, giving me 20 strings at 50 bytes each, which is still not likely to suffice in the wild.

    Doubtless there are other encodings that might be better suited, or perhaps Glean allows some free-form JSON with fewer restrictions?

No free-form JSON, that's against the rules : )

Instead please follow the process to request a new metric type for your use case. ("The Process" is "click a link to prefill a bug with a template, then fill it and submit it").

  1. I would like to send via Glean existing metrics currently captured in UpdateTelemetry.jsm as various histograms, keyed histograms, and keyed scalars. Fine, I can access them via getSnapshotForHistograms and friends, but the mapping is non-obvious to say the least. Is the best approach here to hand-roll some mapping, exploiting the fact that these aren't really histograms in my use case and will all be 0/1 values?

    The only other approach I can think of is to replace the AUSTLMY layer with a Glean-aware layer. I'm not eager to do this, since it's finicky work.

Funny you should ask. I did just land the Glean Interface For Firefox Telemetry (GIFFT) which will allow you to mirror data to Telemetry for data collections that have been migrated to Glean. (Essentially I tee the samples to both Glean and Telemetry before they reach Rust). It does mean you'll get to migrate some existing collections to Glean after all!

Thoughts?

I am excited to get to work on this with you. Just lemme know what the timelines for these specific parts of your collection are, so I can be sure to apply suitable encouragement to the New Metric Type Process. I'm available to help out immediately.

Flags: needinfo?(chutten)

:mdboom just suggested an alternative for sending a stack in a ping. Maybe Socorro'd be a better fit, since it's stack-shaped and only sent in exceptional circumstances? ni?willkg for whether this is a weird idea.

Flags: needinfo?(willkg)

If it's a crash report, it could/should get sent to Socorro. If it's not a crash report, but rather some other kind of error, you could send it to Socorro.

I think you should talk with Gabriele about connecting that up or figuring out another option. Adding a needinfo for Gabriele in case he wants to add anything and/or I got something muddled.

Flags: needinfo?(willkg) → needinfo?(gsvelto)

For things that need to be debugged - and thus require context - Socorro is a good idea because we don't only have crashes there (also hangs, IPC errors, slow shutdowns, etc...). The tricky thing is that you need to be able to capture minidumps if you want to use it - but I can help with that. Alternatively crash pings are a good idea. Don't get scared by the name, they were born for reporting crashes but at their core they contain stack traces and error information so they can be used for any kind of backtrace-based data acquisition.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #5)

For things that need to be debugged - and thus require context - Socorro is a good idea because we don't only have crashes there (also hangs, IPC errors, slow shutdowns, etc...). The tricky thing is that you need to be able to capture minidumps if you want to use it - but I can help with that. Alternatively crash pings are a good idea. Don't get scared by the name, they were born for reporting crashes but at their core they contain stack traces and error information so they can be used for any kind of backtrace-based data acquisition.

There's no Firefox Telemetry in the Update Agent, though, so "crash" pings are out of reach. ...Unless we outputted the necessary extra file and had Firefox pick it up for us the next time it loaded. But that'd be weird.

(In reply to Chris H-C :chutten from comment #6)

There's no Firefox Telemetry in the Update Agent, though, so "crash" pings are out of reach. ...Unless we outputted the necessary extra file and had Firefox pick it up for us the next time it loaded. But that'd be weird.

It could use the pingsender like we do in the crashreporter client. If there is interest in it the code could be factorized out so that we don't duplicate it.

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

(In reply to Chris H-C :chutten from comment #6)

There's no Firefox Telemetry in the Update Agent, though, so "crash" pings are out of reach. ...Unless we outputted the necessary extra file and had Firefox pick it up for us the next time it loaded. But that'd be weird.

It could use the pingsender like we do in the crashreporter client. If there is interest in it the code could be factorized out so that we don't duplicate it.

True. However, with the goal of an all-Glean future, I'm loathe to make more accomnodations for Telemetry-based solutions unless we really need them.

Depends on: 1703313

(In reply to Gabriele Svelto [:gsvelto] from comment #5)

For things that need to be debugged - and thus require context - Socorro is a good idea because we don't only have crashes there (also hangs, IPC errors, slow shutdowns, etc...). The tricky thing is that you need to be able to capture minidumps if you want to use it - but I can help with that. Alternatively crash pings are a good idea. Don't get scared by the name, they were born for reporting crashes but at their core they contain stack traces and error information so they can be used for any kind of backtrace-based data acquisition.

Mmm, interesting. Sadly it's not clear to me how to capture a minidump at the right time. That is, I'm going to catch an exception and want to report it to metrics; I don't see how to "reconstruct" a minidump at that point. A quick skim of the background hang monitor show lots of bespoke stack mangling code and nothing obviously minidump-y, so it doesn't look like a minidump is actually required. In any case, I'm going to leave this for now, because I'm trying to get something done in days and this is way out of my scope.

Depends on: 1703318
Depends on: 1704871
No longer depends on: 1702052, 1703313

We landed enough of a Glean ping here to satisfy Milestone 1, so I'll close this out. Improvements can still block update-agent or update-agent-m2.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.