Closed Bug 1536930 Opened 6 years ago Closed 5 years ago

Glean: Make baseline ping crash resistent

Categories

(Data Platform and Tools :: Glean: SDK, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1599877

People

(Reporter: gfritzsche, Unassigned)

References

Details

(Whiteboard: [telemetry:glean-rs:m11])

Looking at an ingestigation for FFTV of core pings (from the old Telemetry library), the sequence number for core pings seem to not have gaps when crashes happen, but would be missing pings.
That means that seq for the baseline ping does not match the count of observed foreground sessions.
This has impact on critical metrics, where we might see lower numbers for e.g.

Can we make the baseline ping more resilient for crashes or other uncontrollable termination events?
I see that we added ping lifetime metrics persistence in bug 1505732, but does that solve the issue here?

Some possible options that come to mind:

  • On startup, detect previous sessions without a baseline ping from persisted data and send one.
  • Accept missing data and increase seq on each app start to make the gaps visible.
  • Add a separate counter for foreground sessions with ping lifetime.
Priority: -- → P3
Whiteboard: [telemetry:mobilesdk:m?] → [telemetry:mobilesdk:m7]
Assignee: nobody → alessio.placitelli
Priority: P3 → P1

This is a good thought. If we were to crash after a ping was sent, we would have no way to detect this from the sequence numbers, as we'd just load them off the disk and increment them before sending the ping. Let's consider the following scenario:

  1. app starts;
  2. baseline ping gets generated;
    2a. seq=2 is loaded off the disk;
    2b. seq++ so seq=3 now;
    2c. we send the ping
  3. app crashes;
  4. app restarts;
  5. baseline ping gets generated;
    5a. seq=3 is loaded off the disk;
    5b. seq++ so seq=4 now;
    5c. we send the ping

(In reply to Georg Fritzsche [:gfritzsche] from comment #1)

  • Accept missing data and increase seq on each app start to make the gaps visible.

I think I like this option, with a little tweak: we could additionally increment the sequence numbers after. This would give us:

  1. app starts;
    1a. seq=2 is loaded off the disk;
    1b. seq++ so `seq=3;
  2. baseline ping gets generated;
    2a. we send the ping with seq=3
    2b. seq++ so seq=4 now;
  3. app crashes;
  4. app restarts;
    4a. seq=4 is loaded off the disk;
    4b. seq++ so `seq=5;
  5. baseline ping gets generated;
    5a. we send the ping with seq=5
    5b. seq++ so seq=6 now;

@All - what do you think of this approach?

Flags: needinfo?(tlong)
Flags: needinfo?(mdroettboom)
Flags: needinfo?(gfritzsche)
Flags: needinfo?(fbertsch)

I'm good with this approach.

I'm also curious how we could tie this into a crash ping or use it alongside a crash ping (later on when that sort of thing exists). It seems like a good way of knowing we should look for a crash ping, if we see a hole in sequence numbers then we should see a crash ping that corresponds to that missing ping/seq number.

Flags: needinfo?(tlong)

I'm wondering if we should keep these separate, since they have separate concerns:

  • Crashes - if we want to track crashes, we will probably use the crash ping. We should not use sequence numbers to determine if or when an app crashed; since there are other reasons for missing pings. If we want to track number of crashes, we could additionally make it a counter.

  • Foreground Sessions - Since this is an important metric, we should track it separately, as exactly the number of times we went to foreground. I like having this be a ping-lifetime, so that we can look at a period of time and easily SUM the number of foreground sessions.

  • Sequence Numbers - Given we're not tracking crashes or sessions, sequence numbers should be mostly relevant to us in determining "Did we miss a piece of data?". My assumption here is that if a user is using the application, there should be a baseline ping representing that activity in some way. I think then that we should attempt to send pings that we see were missed for some reason, especially if we separate out Foreground Sessions into a separate metric. Alessio's approach might then look like:

  1. app starts;
    1a. seq=2 is loaded off the disk;
    1b. seq++ so `seq=3;
  2. baseline ping gets generated;
    2a. we send the ping with seq=3
    2b. seq++ so seq=4 now;
  3. app crashes;
  4. app restarts;
    4a. seq=4 is loaded off the disk;
    4b. The previous ping was unsent, send;
    4c. seq++ so `seq=5;
  5. baseline ping gets generated;
    5a. we send the ping with seq=5
    5b. seq++ so seq=6 now;

That gives a more complete history of the user. Then if we see missing sequence numbers they should mean something more serious to us, and hopefully provide a valuable signal.

Thoughts, Alessio?

Flags: needinfo?(fbertsch) → needinfo?(alessio.placitelli)

I don't have a lot of experience with how these things go, but I do feel that missing sequence numbers is a little obscure to use as a signal for something else (crashing), as we should probably aim for trying to keep them gapless (knowing that that's not 100% achievable).

Flags: needinfo?(mdroettboom)

Can we set up a test for this to try and replicate any of the scenarios presented to see exactly what glean does now? I know it's hard to say what sorts of crashes we may run into but we could simulate a few simple ones such as divide by zero, stack overflow, or any exception, just to see if glean's seq number and actual number of pings gets out of sync.

(In reply to Travis Long from comment #6)

Can we set up a test for this to try and replicate any of the scenarios presented to see exactly what glean does now? I know it's hard to say what sorts of crashes we may run into but we could simulate a few simple ones such as divide by zero, stack overflow, or any exception, just to see if glean's seq number and actual number of pings gets out of sync.

Right now glean (as the legacy telemetry library) simply hides this away, as we never see inconsistent sequence numbers due to crashes or the application being force-killed. But yes, things do get out of sync. It would be sane to have a test about this, but this is something that we know has happened given the analysis cited by Georg in comment 0.

(In reply to Frank Bertsch [:frank] from comment #4)

I'm wondering if we should keep these separate, since they have separate concerns:

  • Crashes - if we want to track crashes, we will probably use the crash ping. We should not use sequence numbers to determine if or when an app crashed; since there are other reasons for missing pings. If we want to track number of crashes, we could additionally make it a counter.

Yes, this is definitely covered by the crash ping.

  • Foreground Sessions - Since this is an important metric, we should track it separately, as exactly the number of times we went to foreground. I like having this be a ping-lifetime, so that we can look at a period of time and easily SUM the number of foreground sessions.

This is the number of 'baseline' pings received and, I think, this should overlap with the sequence number for the baseline ping (to have a nice absolute). We can of course have an explicit separate metric that goes in the 'metrics' ping, if we want to be explicit about this.

  • Sequence Numbers - Given we're not tracking crashes or sessions, sequence numbers should be mostly relevant to us in determining "Did we miss a piece of data?".

Yes, this is the idea. Basically, in the current state, if we loose data we just don't won't know about it. We won't see holes in the sequence numbers due to 'baseline' ping not being sent. Thinking of this again, in the current state, sequence numbers are partially broken. They would only show holes for connection problems, not for other reasons. I think that the point of this bug is to uncover this problem and, if it's a concrete problem, devise a solution :)

My assumption here is that if a user is using the application, there should be a baseline ping representing that activity in some way. I think then that we should attempt to send pings that we see were missed for some reason, especially if we separate out Foreground Sessions into a separate metric. Alessio's approach might then look like:

  1. app starts;
    1a. seq=2 is loaded off the disk;
    1b. seq++ so `seq=3;
  2. baseline ping gets generated;
    2a. we send the ping with seq=3
    2b. seq++ so seq=4 now;
  3. app crashes;
  4. app restarts;
    4a. seq=4 is loaded off the disk;
    4b. The previous ping was unsent, send;
    4c. seq++ so `seq=5;
  5. baseline ping gets generated;
    5a. we send the ping with seq=5
    5b. seq++ so seq=6 now;

That gives a more complete history of the user. Then if we see missing sequence numbers they should mean something more serious to us, and hopefully provide a valuable signal.

Thoughts, Alessio?

Please correct me if I got this wrong, I think you are suggesting to apply both the sequence number "fix to uncover problems" (the tweaked version of Georg's 2nd) and 1 or 3. @Frank, is this correct?

I think I would be ok in implementing both (1) and (3). My concern with (2) is that it would change the behaviour of the baseline ping once again, in addition to making the "previous seq unsent" logic a potential complex surface for errors we should handle.

@Georg, what do you think?

(In reply to Georg Fritzsche [:gfritzsche] from comment #1)

Some possible options that come to mind:

  1. On startup, detect previous sessions without a baseline ping from persisted data and send one.
  2. Accept missing data and increase seq on each app start to make the gaps visible.
  3. Add a separate counter for foreground sessions with ping lifetime.
Flags: needinfo?(alessio.placitelli) → needinfo?(fbertsch)
Whiteboard: [telemetry:mobilesdk:m7] → [telemetry:mobilesdk:m8]

(In reply to Alessio Placitelli [:Dexter] from comment #8)

Please correct me if I got this wrong, I think you are suggesting to apply both the sequence number "fix to uncover problems" (the tweaked version of Georg's 2nd) and 1 or 3. @Frank, is this correct?

I'm getting a bit confused about what 1 and 3 are, but if they are:

1. On startup, detect previous sessions without a baseline ping from persisted data and send one.

and

3. Add a separate counter for foreground sessions with ping lifetime.

I think we should do both.

Flags: needinfo?(fbertsch)

(In reply to Alessio Placitelli [:Dexter] from comment #8)

(In reply to Frank Bertsch [:frank] from comment #4)

  • Sequence Numbers - Given we're not tracking crashes or sessions, sequence numbers should be mostly relevant to us in determining "Did we miss a piece of data?".

Yes, this is the idea. Basically, in the current state, if we loose data we just don't won't know about it. We won't see holes in the sequence numbers due to 'baseline' ping not being sent. Thinking of this again, in the current state, sequence numbers are partially broken. They would only show holes for connection problems, not for other reasons. I think that the point of this bug is to uncover this problem and, if it's a concrete problem, devise a solution :)

+1, visibility into gaps should be the main point here.
Whatever simple solution gets us this is good for me.

Flags: needinfo?(gfritzsche)
Component: Telemetry → Glean: SDK
Product: Toolkit → Data Platform and Tools

Untaking, as I'm not actively working on this right now. Turns out it's not that trivial anyway: doing what was suggested in the earlier comments with the sequence numbers makes them behave weirdly on some edge cases.

Assignee: alessio.placitelli → nobody
Priority: P1 → P3
Whiteboard: [telemetry:mobilesdk:m8] → [telemetry:glean-rs:m?]
Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:backlog]
Whiteboard: [telemetry:glean-rs:backlog] → [telemetry:glean-rs:m11]
See Also: → 1599877

Once bug 1599877 is fixed, the Glean SDK will generate a startup baseline ping with reason "dirty_startup" to signal that the previous session was not cleanly closed.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.