1536930 - Glean: Make baseline ping crash resistent

Reporter

Description

•

6 years ago

•

Looking at an ingestigation for FFTV of core pings (from the old Telemetry library), the sequence number for core pings seem to not have gaps when crashes happen, but would be missing pings.
That means that seq for the baseline ping does not match the count of observed foreground sessions.
This has impact on critical metrics, where we might see lower numbers for e.g.

Can we make the baseline ping more resilient for crashes or other uncontrollable termination events?
I see that we added ping lifetime metrics persistence in bug 1505732, but does that solve the issue here?

Georg Fritzsche [:gfritzsche]

Reporter

Comment 1

•

6 years ago

Some possible options that come to mind:

On startup, detect previous sessions without a baseline ping from persisted data and send one.
Accept missing data and increase seq on each app start to make the gaps visible.
Add a separate counter for foreground sessions with ping lifetime.

Travis Long [:travis_]

Updated

•

6 years ago

Priority: -- → P3

Alessio Placitelli [:Dexter]

Updated

•

6 years ago

Whiteboard: [telemetry:mobilesdk:m?] → [telemetry:mobilesdk:m7]

Alessio Placitelli [:Dexter]

Updated

•

6 years ago

Assignee: nobody → alessio.placitelli

Priority: P3 → P1

Alessio Placitelli [:Dexter]

Comment 2

•

6 years ago

This is a good thought. If we were to crash after a ping was sent, we would have no way to detect this from the sequence numbers, as we'd just load them off the disk and increment them before sending the ping. Let's consider the following scenario:

app starts;
baseline ping gets generated;
2a. seq=2 is loaded off the disk;
2b. seq++ so seq=3 now;
2c. we send the ping
app crashes;
app restarts;
baseline ping gets generated;
5a. seq=3 is loaded off the disk;
5b. seq++ so seq=4 now;
5c. we send the ping

(In reply to Georg Fritzsche [:gfritzsche] from comment #1)

Accept missing data and increase seq on each app start to make the gaps visible.

I think I like this option, with a little tweak: we could additionally increment the sequence numbers after. This would give us:

app starts;
1a. seq=2 is loaded off the disk;
1b. seq++ so `seq=3;
baseline ping gets generated;
2a. we send the ping with seq=3
2b. seq++ so seq=4 now;
app crashes;
app restarts;
4a. seq=4 is loaded off the disk;
4b. seq++ so `seq=5;
baseline ping gets generated;
5a. we send the ping with seq=5
5b. seq++ so seq=6 now;

@All - what do you think of this approach?

Flags: needinfo?(tlong)

Flags: needinfo?(mdroettboom)

Flags: needinfo?(gfritzsche)

Flags: needinfo?(fbertsch)

Travis Long [:travis_]

Comment 3

•

6 years ago

I'm good with this approach.

I'm also curious how we could tie this into a crash ping or use it alongside a crash ping (later on when that sort of thing exists). It seems like a good way of knowing we should look for a crash ping, if we see a hole in sequence numbers then we should see a crash ping that corresponds to that missing ping/seq number.

Flags: needinfo?(tlong)

Frank Bertsch [:frank]

Comment 4

•

6 years ago

•

Edited

I'm wondering if we should keep these separate, since they have separate concerns:

Crashes - if we want to track crashes, we will probably use the crash ping. We should not use sequence numbers to determine if or when an app crashed; since there are other reasons for missing pings. If we want to track number of crashes, we could additionally make it a counter.
Foreground Sessions - Since this is an important metric, we should track it separately, as exactly the number of times we went to foreground. I like having this be a ping-lifetime, so that we can look at a period of time and easily SUM the number of foreground sessions.
Sequence Numbers - Given we're not tracking crashes or sessions, sequence numbers should be mostly relevant to us in determining "Did we miss a piece of data?". My assumption here is that if a user is using the application, there should be a baseline ping representing that activity in some way. I think then that we should attempt to send pings that we see were missed for some reason, especially if we separate out Foreground Sessions into a separate metric. Alessio's approach might then look like:

app starts;
1a. seq=2 is loaded off the disk;
1b. seq++ so `seq=3;
baseline ping gets generated;
2a. we send the ping with seq=3
2b. seq++ so seq=4 now;
app crashes;
app restarts;
4a. seq=4 is loaded off the disk;
4b. The previous ping was unsent, send;
4c. seq++ so `seq=5;
baseline ping gets generated;
5a. we send the ping with seq=5
5b. seq++ so seq=6 now;

That gives a more complete history of the user. Then if we see missing sequence numbers they should mean something more serious to us, and hopefully provide a valuable signal.

Thoughts, Alessio?

Flags: needinfo?(fbertsch) → needinfo?(alessio.placitelli)

Michael Droettboom [:mdroettboom]

Comment 5

•

6 years ago

I don't have a lot of experience with how these things go, but I do feel that missing sequence numbers is a little obscure to use as a signal for something else (crashing), as we should probably aim for trying to keep them gapless (knowing that that's not 100% achievable).

Flags: needinfo?(mdroettboom)

Travis Long [:travis_]

Comment 6

•

6 years ago

Can we set up a test for this to try and replicate any of the scenarios presented to see exactly what glean does now? I know it's hard to say what sorts of crashes we may run into but we could simulate a few simple ones such as divide by zero, stack overflow, or any exception, just to see if glean's seq number and actual number of pings gets out of sync.

Alessio Placitelli [:Dexter]

Comment 7

•

6 years ago

(In reply to Travis Long from comment #6)

Can we set up a test for this to try and replicate any of the scenarios presented to see exactly what glean does now? I know it's hard to say what sorts of crashes we may run into but we could simulate a few simple ones such as divide by zero, stack overflow, or any exception, just to see if glean's seq number and actual number of pings gets out of sync.

Right now glean (as the legacy telemetry library) simply hides this away, as we never see inconsistent sequence numbers due to crashes or the application being force-killed. But yes, things do get out of sync. It would be sane to have a test about this, but this is something that we know has happened given the analysis cited by Georg in comment 0.

Alessio Placitelli [:Dexter]

Comment 8

•

6 years ago

(In reply to Frank Bertsch [:frank] from comment #4)

I'm wondering if we should keep these separate, since they have separate concerns:

Crashes - if we want to track crashes, we will probably use the crash ping. We should not use sequence numbers to determine if or when an app crashed; since there are other reasons for missing pings. If we want to track number of crashes, we could additionally make it a counter.

Yes, this is definitely covered by the crash ping.

Foreground Sessions - Since this is an important metric, we should track it separately, as exactly the number of times we went to foreground. I like having this be a ping-lifetime, so that we can look at a period of time and easily SUM the number of foreground sessions.

This is the number of 'baseline' pings received and, I think, this should overlap with the sequence number for the baseline ping (to have a nice absolute). We can of course have an explicit separate metric that goes in the 'metrics' ping, if we want to be explicit about this.

Sequence Numbers - Given we're not tracking crashes or sessions, sequence numbers should be mostly relevant to us in determining "Did we miss a piece of data?".

Yes, this is the idea. Basically, in the current state, if we loose data we just don't won't know about it. We won't see holes in the sequence numbers due to 'baseline' ping not being sent. Thinking of this again, in the current state, sequence numbers are partially broken. They would only show holes for connection problems, not for other reasons. I think that the point of this bug is to uncover this problem and, if it's a concrete problem, devise a solution :)

My assumption here is that if a user is using the application, there should be a baseline ping representing that activity in some way. I think then that we should attempt to send pings that we see were missed for some reason, especially if we separate out Foreground Sessions into a separate metric. Alessio's approach might then look like:

app starts;
1a. seq=2 is loaded off the disk;
1b. seq++ so `seq=3;

baseline ping gets generated;
2a. we send the ping with seq=3
2b. seq++ so seq=4 now;

app crashes;

app restarts;
4a. seq=4 is loaded off the disk;
4b. The previous ping was unsent, send;
4c. seq++ so `seq=5;

baseline ping gets generated;
5a. we send the ping with seq=5
5b. seq++ so seq=6 now;

That gives a more complete history of the user. Then if we see missing sequence numbers they should mean something more serious to us, and hopefully provide a valuable signal.

Thoughts, Alessio?

Please correct me if I got this wrong, I think you are suggesting to apply both the sequence number "fix to uncover problems" (the tweaked version of Georg's 2nd) and 1 or 3. @Frank, is this correct?

I think I would be ok in implementing both (1) and (3). My concern with (2) is that it would change the behaviour of the baseline ping once again, in addition to making the "previous seq unsent" logic a potential complex surface for errors we should handle.

@Georg, what do you think?

(In reply to Georg Fritzsche [:gfritzsche] from comment #1)

Some possible options that come to mind:

On startup, detect previous sessions without a baseline ping from persisted data and send one.

Accept missing data and increase seq on each app start to make the gaps visible.

Add a separate counter for foreground sessions with ping lifetime.

Flags: needinfo?(alessio.placitelli) → needinfo?(fbertsch)

Alessio Placitelli [:Dexter]

Updated

•

6 years ago

Whiteboard: [telemetry:mobilesdk:m7] → [telemetry:mobilesdk:m8]

Frank Bertsch [:frank]

Comment 9

•

6 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #8)

Please correct me if I got this wrong, I think you are suggesting to apply both the sequence number "fix to uncover problems" (the tweaked version of Georg's 2nd) and 1 or 3. @Frank, is this correct?

I'm getting a bit confused about what 1 and 3 are, but if they are:

1. On startup, detect previous sessions without a baseline ping from persisted data and send one.

and

3. Add a separate counter for foreground sessions with ping lifetime.

I think we should do both.

Flags: needinfo?(fbertsch)

Georg Fritzsche [:gfritzsche]

Reporter

Comment 10

•

5 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #8)

(In reply to Frank Bertsch [:frank] from comment #4)

Sequence Numbers - Given we're not tracking crashes or sessions, sequence numbers should be mostly relevant to us in determining "Did we miss a piece of data?".

Yes, this is the idea. Basically, in the current state, if we loose data we just don't won't know about it. We won't see holes in the sequence numbers due to 'baseline' ping not being sent. Thinking of this again, in the current state, sequence numbers are partially broken. They would only show holes for connection problems, not for other reasons. I think that the point of this bug is to uncover this problem and, if it's a concrete problem, devise a solution :)

+1, visibility into gaps should be the main point here.
Whatever simple solution gets us this is good for me.

Flags: needinfo?(gfritzsche)

Alessio Placitelli [:Dexter]

Updated

•

5 years ago

Component: Telemetry → Glean: SDK

Product: Toolkit → Data Platform and Tools

Alessio Placitelli [:Dexter]

Comment 11

•

5 years ago

Untaking, as I'm not actively working on this right now. Turns out it's not that trivial anyway: doing what was suggested in the earlier comments with the sequence numbers makes them behave weirdly on some edge cases.

Assignee: alessio.placitelli → nobody

Priority: P1 → P3

Travis Long [:travis_]

Updated

•

5 years ago

Whiteboard: [telemetry:mobilesdk:m8] → [telemetry:glean-rs:m?]

Jan-Erik Rediger [:janerik]

Updated

•

5 years ago

Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:backlog]

Michael Droettboom [:mdroettboom]

Updated

•

5 years ago

Whiteboard: [telemetry:glean-rs:backlog] → [telemetry:glean-rs:m11]

Michael Droettboom [:mdroettboom]

Updated

•

5 years ago

Depends on: 1602824

Michael Droettboom [:mdroettboom]

Updated

•

5 years ago

Comment 12

•

5 years ago

Once bug 1599877 is fixed, the Glean SDK will generate a startup baseline ping with reason "dirty_startup" to signal that the previous session was not cleanly closed.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → DUPLICATE

Bugzilla

Quick Search

Glean: Make baseline ping crash resistent

Categories

(Data Platform and Tools :: Glean: SDK, enhancement, P3)

Tracking

(Not tracked)

People

(Reporter: gfritzsche, Unassigned)

References

Details

(Whiteboard: [telemetry:glean-rs:m11])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Updated

Comment 11

Updated

Updated

Updated

Updated

Updated

Comment 12