Closed Bug 1602828 Opened 3 years ago Closed 3 years ago

Write a proposal for changing Glean 'application' lifetime

Categories

(Data Platform and Tools :: Glean: SDK, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: Dexter)

References

Details

(Whiteboard: [telemetry:glean-rs:m11])

Currently, metrics with lifetime: application get stored in memory and get cleared when the application is killed or is shut down.

Next time the application starts, unless the product or the components set the metrics again before Glean has the chance to collect the ping, no data gets in the ping.

This is one of the causes why we're seeing so many NULLs in bug 1601091.

One possible solution to this problem is to persist lifetime: application metrics to disk and clear them right after startup, before new metrics are set but after glean sends any ping generated at startup. While this would make the NULL problem less critical, it has a bunch of edge cases that need to be carefully investigated:

  • If Glean is updated and a ping with "old data" is generated from the new version, what version of the glean SDK and of the application should be reported?
  • when exactly should we clear application lifetime metrics?

Mike, I seem to remember that we brainstormed another edge case, but I can't really remember it. Any chance you do?

Flags: needinfo?(mdroettboom)
Assignee: nobody → alessio.placitelli
Priority: -- → P1
Whiteboard: [telemetry:glean-rs:m11]

The other edge case (which I would argue is an order-of-magnitude more "edge" than these other ones, and may be below the "give a damn threshold") is:

If you disable telemetry and turn it back on, we reinstate app lifetime metrics in Glean, but we currently have no way to reinstate them from other consumers of Glean. This means that pings will be missing these metrics until the application restarts.

Flags: needinfo?(mdroettboom)

(In reply to Alessio Placitelli [:Dexter] from comment #0)

  • If Glean is updated and a ping with "old data" is generated from the new version, what version of the glean SDK and of the application should be reported?

I am of the opinion that this should be the version that did the collecting, as the samples were recorded against the application of that version. If we want to ensure we record which version it was sent with, we should include that separately in the case that it is different.

(idle thought: will a ping with app build id 20191202 (newly updated) contain metrics that were recorded against 20191114 but were set to expire on 20191130?)

(In reply to Chris H-C :chutten from comment #2)

(In reply to Alessio Placitelli [:Dexter] from comment #0)

  • If Glean is updated and a ping with "old data" is generated from the new version, what version of the glean SDK and of the application should be reported?

I am of the opinion that this should be the version that did the collecting, as the samples were recorded against the application of that version. If we want to ensure we record which version it was sent with, we should include that separately in the case that it is different.

Yes, that's probably correct. But I think the expectations of the "correctness" really depend on the analysis we're doing: while the data is recorded in a certain version, the ping is assembled on the new version.

I guess this could just be clarified by having an "updated_from" field that we send as part of the client_info (we could, for example, store app_* metrics to disk and check if they are changing at startup. If that's the case, then we updated... :) )

(idle thought: will a ping with app build id 20191202 (newly updated) contain metrics that were recorded against 20191114 but were set to expire on 20191130?)

If they were recorded in the old version and the ping is sent in the new version, yes.

Flags: needinfo?(tlong)
Flags: needinfo?(jrediger)
Flags: needinfo?(gfritzsche)
Flags: needinfo?(chutten)
Flags: needinfo?(beatriz.rizental)
Blocks: 1601091

I left my initial comments in the doc.

After reading the proposal, I don't have anything more to add that others haven't already mentioned.

Flags: needinfo?(tlong)

I wonder if part of the problem we're presently having with Glean SDK lifetimes is that they're not orderable. By which I mean: user-lifetime is always at least as long as application-lifetime... but with application-lifetime and ping-lifetime either can be longer than the other.

This can be a feature. For custom pings where the ping lifetime is controlled by the component putting metrics in it, a ping-lifetime that's flexible can be really nice. (imagine short-lived pings for onboarding events or long-lived pings aggregating a week's worth of federated learning adjustments)

However, that doesn't work so well for non-custom pings where the lifetime is under the Glean SDK's control. The reason things mostly still work with our setup is that our metric types support composition/aggregation (A complete gfx.composite_time would be across the entire session a compositor is active, but since there's no way to tie that to a ping lifetime we ensure that timing distributions can be combined (stable bucket layouts ftw) and then put the burden on analysis to combine them as appropriate).

(( In Firefox Desktop we of course fix this by not having lifetimes, but also by ensuring built-in pings' and metrics' lifetimes never exceed the app session's length. (ie, you could think of Firefox having application-lifetime always being longer than ping-lifetime (because we persist only pings, not metrics). ))

For the "gfx application-lifetime metrics aren't working as we hoped" problem case the ideal solution could be a custom ping (though how we get that through Project EXTRACT is anyone's guess D: ). For the "pings sent with the current application-lifetime metrics even if the ping-lifetime metrics are from a previous app session" problem case we'd still need some sort of application-lifetime metric persistence, if only for the special cases of things inside *_info.

But if the MPS redesign gives us an orderable sort for lifetimes within Glean-SDK-owned pings, then we might have a chance. If the "metrics" ping is never longer than an application lifetime, then we end up in a pseudo-Firefox-Desktop situation where most of the corner cases are solvable. (the setUploadEnabled case gets a little weird, but we might be able to get around that with cleverness).

What do you think, Alessio?

Flags: needinfo?(chutten) → needinfo?(alessio.placitelli)

(In reply to Chris H-C :chutten from comment #7)

I wonder if part of the problem we're presently having with Glean SDK lifetimes is that they're not orderable. By which I mean: user-lifetime is always at least as long as application-lifetime... but with application-lifetime and ping-lifetime either can be longer than the other.

Yes, your understanding of the lifetimes is correct.

User is the longest possible one.
Application is as long as the app process lives.
Ping is as long as the next ping is sent.

This can be a feature. For custom pings where the ping lifetime is controlled by the component putting metrics in it, a ping-lifetime that's flexible can be really nice. (imagine short-lived pings for onboarding events or long-lived pings aggregating a week's worth of federated learning adjustments)

This is indeed a feature :) For pings that are sent at a different frequency than the one given by the application process lifetime (which is usually the case on mobile! Process lifetime can be short!), this is a requirement. Otherwise consumers would need to deal with persistence themselves.

However, that doesn't work so well for non-custom pings where the lifetime is under the Glean SDK's control.

I respectfully disagree with this :-)

The reason things mostly still work with our setup is that our metric types support composition/aggregation (A complete gfx.composite_time would be across the entire session a compositor is active, but since there's no way to tie that to a ping lifetime we ensure that timing distributions can be combined (stable bucket layouts ftw) and then put the burden on analysis to combine them as appropriate).

(( In Firefox Desktop we of course fix this by not having lifetimes, but also by ensuring built-in pings' and metrics' lifetimes never exceed the app session's length. (ie, you could think of Firefox having application-lifetime always being longer than ping-lifetime (because we persist only pings, not metrics). ))

On Desktop we have a related/similar problem: we do have the concept of "application lifetime" metrics there and we can get, as far as I understand, pings that are lacking certain sets of metrics. The difference is that the semantic is a bit unclear and hidden under the hood: think, for example, about all the deferred messages we listen to when filling in the environment. Depending on when they get hit, we might get a partial environment in, for example, the new-profile ping or shutdown main-pings (for short session).

I'm not saying that's common, just saying that's possible: we've had bugs/questions about fields being missing from pings for this exact reason before!

For the "gfx application-lifetime metrics aren't working as we hoped" problem case the ideal solution could be a custom ping (though how we get that through Project EXTRACT is anyone's guess D: ). For the "pings sent with the current application-lifetime metrics even if the ping-lifetime metrics are from a previous app session" problem case we'd still need some sort of application-lifetime metric persistence, if only for the special cases of things inside *_info.

Custom pings are possible, if needed ;-) But let's keep this discussion focused on the lifetimes: from the original design doc:

application: the metric contains a property that is related to the application, and is reset on application starts. It is not reset after sending it in a ping.

This seems to be a bug, for us: we implemented a behaviour that's different than the one we spec'd around :(

The real changes would only revolve around startup, for pings that get assembled by Glean during its init. If Fenix inits GV in a deferred way/later, then I'm afraid this is really a problem with Fenix/A-C/GV that needs to be solved there, not in the Glean SDK. But I think that's also easily solvable as well: Fenix could somehow trigger GV telemetry to always be set when it starts.

But if the MPS redesign gives us an orderable sort for lifetimes within Glean-SDK-owned pings, then we might have a chance. If the "metrics" ping is never longer than an application lifetime, then we end up in a pseudo-Firefox-Desktop situation where most of the corner cases are solvable. (the setUploadEnabled case gets a little weird, but we might be able to get around that with cleverness).

I'm afraid that won't be the case, the MPS won't really change the way lifetimes are defined. It addresses a different set of problems (force-close resilience among them!).

Flags: needinfo?(u566121)
Flags: needinfo?(jrediger)
Flags: needinfo?(gfritzsche)
Flags: needinfo?(alessio.placitelli)
Blocks: 1604862

The proposal was discussed and we agreed on a solution. Implementation will happen as part of bug 1604862.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.