Closed Bug 1663573 Opened 4 years ago Closed 3 years ago

Somewhere from 10-30% of sync pings seem to be going missing

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
2

Tracking

(Not tracked)

RESOLVED INACTIVE

People

(Reporter: markh, Assigned: markh)

References

Details

(Whiteboard: [dataquality])

We have reason to believe that some "sync pings" sent by desktop are going missing. For background, "sync pings" are regular telemetry pings sent by desktop, but only for users with FxA/Sync enabled.

We have no idea where in the pipeline these events are going missing - there's a chance it's in the client itself (it uses the telemetry modules for the actual submission, but not for all scheduling), or anywhere else in the various stages before it makes its way to redash.

We have no reason to believe this is a recent regression - infact, I believe the problem might even predate this data getting into redash, as the analysis issues here (specifically, the first item below, where we see tabs "received" without having been "sent") existed when the only way to look at this data was via jupyter.

So what we have as evidence is:

  • In bug 1639841, Ryan shows that ~15-20% of the "tab received" events we see are for tabs where we can't see the "tab sent" event, which is theoretically impossible.

But this isn't a smoking gun, especially given the complexities in the queries there, so in the absence of other ideas about how to reconcile this, we've also come up with:

  • All sync users who send main pings should also send "sync" pings. To try and avoid "scheduling" differences, I've only considered both pings with reason="shutdown" - this shows there are only ~82% of "sync pings" compared to "main pings for sync users". https://sql.telemetry.mozilla.org/queries/74449/source#186178.

  • When sending a tab, desktop clients record an event in the sync ping, and the server reports events via some other mechanism, but which end up in redash. For any given day, the events seen by desktop are around 30% less than the events recorded by the server. https://sql.telemetry.mozilla.org/queries/74450/source#186179

Whiteboard: [data-quality]
Priority: -- → P2

Hi Mark, a few questions on this.

Are you planning to investigate this further?
What downstream impact does it have?
What is the level of urgency around getting it figured out?

Flags: needinfo?(markh)

(In reply to Mark Reid [:mreid] from comment #1)

Hi Mark, a few questions on this.

Are you planning to investigate this further?

No, we've investigated this to the best of our ability. We probably could investigate further if there was evidence the issue was on the client, but I don't think there's evidence that is the case.

What downstream impact does it have?

It prevents us working out what our "send tab" success rate is, and why.

It also means we struggle to have confidence in telemetry - I imagine we wouldn't accept this amount of telemetry going missing without explanation anywhere else, right?

What is the level of urgency around getting it figured out?

Send-tab is a high value feature and we've invested significant time and effort in recording and analysing telemetry so we can ensure the feature works correctly. This is blocking us from doing that. How "urgent" that is isn't really my call to make - we could escalate up the product chain to ask if you think that would be helpful?

Flags: needinfo?(markh)
See Also: → 1642386, 1373439, 1572744, 1638899

Thanks Mark. Arkadiusz is going to do some investigation here and see if we can track this down on the pipeline side.

Assignee: nobody → akomarzewski
Points: --- → 2
Priority: P2 → P1

From the pipeline point of view it doesn't look like we're discarding any sync pings.
I have checked error tables looking for any hints but there's nothing suspicious related to sync pings there. Also considering how we process them, they are not really different from main pings.

problem might even predate this data getting into redash, as the analysis issues here (specifically, the first item below, where we see tabs "received" without having been "sent") existed when the only way to look at this data was via jupyter.

If the problem was on the ingestion side, it would mean we had reimplemented it while moving from AWS to GCP.

All sync users who send main pings should also send "sync" pings. To try and avoid "scheduling" differences, I've only considered both pings with reason="shutdown" - this shows there are only ~82% of "sync pings" compared to "main pings for sync users". https://sql.telemetry.mozilla.org/queries/74449/source#186178.

It might be a long shot but this gets interesting when grouped by channel: https://sql.telemetry.mozilla.org/queries/75370/source - release still shows ~80% main_clients to sync_devices ratio, while for nightly this is ~100%.
Mark: could this be a clue?

Flags: needinfo?(markh)

(In reply to Arkadiusz Komarzewski [:akomar] from comment #4)

From the pipeline point of view it doesn't look like we're discarding any sync pings.
I have checked error tables looking for any hints but there's nothing suspicious related to sync pings there. Also considering how we process them, they are not really different from main pings.

I thought there was some scary scala code that validates etc - are we confident that any issues in that code would be reported in the error tables? Years ago we had some validation problems but I don't recall what made us notice then.

It might be a long shot but this gets interesting when grouped by channel: https://sql.telemetry.mozilla.org/queries/75370/source - release still shows ~80% main_clients to sync_devices ratio, while for nightly this is ~100%.
Mark: could this be a clue?

The client code doesn't make any distinction for channels and uses the telemetry modules to do the actual submission - so while I agree this might be a clue, I'm afraid I can't think of what it might mean :/ It's certainly true that the problem might be in the client code somewhere, but I'm not sure how to get any insights into that.

Flags: needinfo?(markh)

(In reply to Mark Hammond [:markh] [:mhammond] from comment #5)

I thought there was some scary scala code that validates etc - are we confident that any issues in that code would be reported in the error tables? Years ago we had some validation problems but I don't recall what made us notice then.

That is no longer the case as we have migrated the pipeline to GCP last year and this involved complete rewrite of the ingestion system.

I'm afraid I can't think of what it might mean :/ It's certainly true that the problem might be in the client code somewhere, but I'm not sure how to get any insights into that.

In the bug description you mentioned:

client itself (it uses the telemetry modules for the actual submission, but not for all scheduling)

Does this mean that sync pings use some custom scheduling mechanism, different than the main pings?

(In reply to Arkadiusz Komarzewski [:akomar] from comment #6)

client itself (it uses the telemetry modules for the actual submission, but not for all scheduling)

Does this mean that sync pings use some custom scheduling mechanism, different than the main pings?

Yes.

We'd be fine with having hooks into the telemetry scheduler, but there was pushback on that from the telemetry client team back in the day. As you can expect, we can't reproduce pings going missing (ie, all that scheduling seems to do what we expect according to about:telemetry) so there's nothing glaringly obvious going wrong. But as mentioned, I can't rule out it's still a client problem. I'd obviously be fine with landing some kind of cross-checking telemetry in the main ping to try and help us narrow down where things are going wrong, but I can't think of what that might look like.

Hey :chutten - could you take a look here? Custom scheduling that Mark mentions is something I would be concerned about, but I'm not really familiar with the client code enough to judge if it makes sense. From your perspective, what would be the best approach here?

Flags: needinfo?(chutten)

Theory: shutdown contention

Sending at shutdown is a contentious time... in that there's a lot of resource contention. Telemetry's had problems in the recent past about being the reason for shutdown hangs because the disk IO inherent in sending pings can take Absolutely Forever on machines with older spinning disks.

However, "main" pings are sent during the sendTelemetry Async Shutdown barrier (which is started by the profile-before-change-telemetry topic). This can't start when there's synchronous work being done during profile-before-change, which includes sync's call to submitExternalPing.

We archive and process pings in order, so since the "sync" ping was submitted before the "main" ping was, there should never be a case where we get the latter without the former. This theory doesn't seem likely, but I include it for context and future reference.

Cross-validation

Unfortunately, because we keep links between "sync" pings and "main" pings from forming, we can't easily identify sessions that should send both pings to validate them. We can't count the number of such sessions then ensure the counts of "main" pings and "sync" pings match that number of sessions. Assuming sync being enabled is fairly stable over time, we should always see more shutdown-reason "main" pings than shutdown-reason "sync" pings because the former are always sent and latter are only sent if there's data to send.

If the problem is that we don't have a good handle on the number of "sync" pings that are submitted, we can get a clear picture with some one or more of these changes to instrumentation:

  • Record to a "main"-ping uint scalar the number of times a "sync" ping is submitted. Then we can sum those and compare to the number of "sync" pings received to ensure we're seeing what we expect.
    • Cross-link threat: low. This is the minimal information we can collect to answer our questions. But we should still limit this to just a small window of time in case we find that it's too good at fingerprinting.
  • Send a small "sync-health" ping on every shutdown with the number of "sync" pings submitted since this session. Won't help us if the problem is that "sync" can't send anything at shutdown, but will help us if the problem is getting the correct count of "sync" pings. Allows us to send more diagnostic information as well (like the reasons of the pings that have been submitted) that we wouldn't risk putting in a "main" ping.
    • Cross-link thread: zero.
  • Change the "sync" ping to always submit at shutdown, even if the payload is all zeroes. This will make shutdown-reason "main" ping to shutdown-reason "sync" ping counts easier, but doesn't help if we then worry about other reasons of pings.
    • Cross-link threat: zero.

If our confidence in the count of "sync" pings being submitted is high (ie, that the numbers in :markh's query should already be exactly correct (or near enough modulo people turning sync on/off within a day)), then either "main" pings with reason "shutdown" and claiming environment.services.sync_enabled and with session length > 60 seconds are being overreceived or "sync" pings with reason "shutdown" are being underreceived.

  • "main" pings with reason "shutdown" are some of the more-studied pings in our pipeline, so I really really hope we can trust their numbers aren't inflated. But maybe environment.services.sync_enabled isn't staying true all the way to shutdown in some instances? Maybe the "main" ping's session length is inflated quite a bit sometimes?
    • Idea: Since the "main" ping is submitted later in shutdown than the "sync" ping, the session length reported by the "main" ping will be longer than that observed by the "sync" ping. It seems unlikely that it'd be 40s later (the query starts at 60s given an assumption that we'd have data within 20s), but shutdown hang timeouts are measured in minutes and Telemetry's hit a few of those on slow machines...
      • Unfortunately, if this is a measureable effect we'll have no way to exclude these clients sensibly. There'll always be a session length S for which the number of pings line up (in this case it seems to be around 2min), but that doesn't tell us anything useful.

:markh, I don't know how sync's internals might affect whether the ping decides to be submitted. How do my wild speculations hold up?

Flags: needinfo?(chutten) → needinfo?(markh)

Thanks chutten (and apologies for letting this slip for nearly a couple of weeks!)

(In reply to Chris H-C :chutten from comment #9)

Theory: shutdown contention

Sending at shutdown is a contentious time... in that there's a lot of resource contention. Telemetry's had problems in the recent past about being the reason for shutdown hangs because the disk IO inherent in sending pings can take Absolutely Forever on machines with older spinning disks.

The issues here all sound legitimate but it's difficult to guess how much of a problem they are in practice. We should consider a tighter integration of the sync ping into the core telemetry modules, so any issues around contention are better managed and sync becomes less of a "special case".

If the problem is that we don't have a good handle on the number of "sync" pings that are submitted, we can get a clear picture with some one or more of these changes to instrumentation:

  • Record to a "main"-ping uint scalar the number of times a "sync" ping is submitted. Then we can sum those and compare to the number of "sync" pings received to ensure we're seeing what we expect.

This sounds relatively easy to do. To be clear though, this would help (dis-?)prove sync pings are going missing, but make no steps towards making them appear.

  • Change the "sync" ping to always submit at shutdown, even if the payload is all zeroes. This will make shutdown-reason "main" ping to shutdown-reason "sync" ping counts easier, but doesn't help if we then worry about other reasons of pings.
    • Cross-link threat: zero.

This is also easy and doesn't require any new counters be defined. However, this is the opposite of the above - it gets no closer to proving pings are going missing, but speculatively attempts a solution.

If our confidence in the count of "sync" pings being submitted is high (ie, that the numbers in :markh's query should already be exactly correct (or near enough modulo people turning sync on/off within a day)), then either "main" pings with reason "shutdown" and claiming environment.services.sync_enabled and with session length > 60 seconds are being overreceived or "sync" pings with reason "shutdown" are being underreceived.

As mentioned above, we've come to the conclusion that sync pings are being underreceived due to data being missing (specifically, "receive" events without a matching "send" event). So I find it highly unlikely the problem is actually main pings being overreceived.

  • Idea: Since the "main" ping is submitted later in shutdown than the "sync" ping, the session length reported by the "main" ping will be longer than that observed by the "sync" ping. It seems unlikely that it'd be 40s later (the query starts at 60s given an assumption that we'd have data within 20s), but shutdown hang timeouts are measured in minutes and Telemetry's hit a few of those on slow machines...
    • Unfortunately, if this is a measureable effect we'll have no way to exclude these clients sensibly. There'll always be a session length S for which the number of pings line up (in this case it seems to be around 2min), but that doesn't tell us anything useful.

It seems unlikely to me that the volume we seem to be missing could be accounted for by this.

:markh, I don't know how sync's internals might affect whether the ping decides to be submitted.

It will always submit once a single sync has finished - which typically happens 10 seconds after startup, which my query tries to take into account (and assuming longer period before the first sync doesn't change the numbers in a meaningful way).

How do my wild speculations hold up?

They all sound reasonable enough.

Flags: needinfo?(markh)

(In reply to Mark Hammond [:markh] [:mhammond] from comment #10)

Thanks chutten (and apologies for letting this slip for nearly a couple of weeks!)

(In reply to Chris H-C :chutten from comment #9)

Theory: shutdown contention

Sending at shutdown is a contentious time... in that there's a lot of resource contention. Telemetry's had problems in the recent past about being the reason for shutdown hangs because the disk IO inherent in sending pings can take Absolutely Forever on machines with older spinning disks.

The issues here all sound legitimate but it's difficult to guess how much of a problem they are in practice. We should consider a tighter integration of the sync ping into the core telemetry modules, so any issues around contention are better managed and sync becomes less of a "special case".

About the only difference I can see is the "sync" ping submitting synchronously during profile-before-change instead of synchronously during the shutdownTelemetry AsyncShutdown barrier (which happens immediately afterwards)... so the tightness of the integration shouldn't matter.

Of course "should" is doing some heavy lifting here. Assumptions consistently allow me to make a fool of myself.

If the problem is that we don't have a good handle on the number of "sync" pings that are submitted, we can get a clear picture with some one or more of these changes to instrumentation:

  • Record to a "main"-ping uint scalar the number of times a "sync" ping is submitted. Then we can sum those and compare to the number of "sync" pings received to ensure we're seeing what we expect.

This sounds relatively easy to do. To be clear though, this would help (dis-?)prove sync pings are going missing, but make no steps towards making them appear.

Depends on your definition of "missing", but yes: this would ensure that we have a concrete count of how many were submitted, and if that lines up with how many were received then it would disprove that the pings have gone walkabout someplace between submission and receipt. But it goes a little further than that by, if the counts don't line up, giving us a concrete benchmark of how many are going missing instead of trying to infer the correct number based on external measures ("shutdown"-reason "main" pings with environment.services.sync_enabled being true and session_length being > 60 are probably pretty close, but there might be too many moving parts making this a good estimate of how many "shutdown"-reason "sync" pings we should expect). Counting them directly might get us closer (and will certainly make the analyses easier to write).

  • Change the "sync" ping to always submit at shutdown, even if the payload is all zeroes. This will make shutdown-reason "main" ping to shutdown-reason "sync" ping counts easier, but doesn't help if we then worry about other reasons of pings.
    • Cross-link threat: zero.

This is also easy and doesn't require any new counters be defined. However, this is the opposite of the above - it gets no closer to proving pings are going missing, but speculatively attempts a solution.

Yes and no. Both of these attempt to find a better count for how many "shutdown"-reason "sync" pings we should receive, the first by counting them and this second one by making the approximate ("shutdown"-reason "main" pings with environment.services.sync_enabled being true and session_length being > 60 ) a closer one. I prefer the first suggestion to the second because it's a direct measurement, but both of these are predicated on the idea that we don't already have a good estimate (which might just be me not knowing enough about the sync internals to be a good judge).

And, of course, we could do both.

If our confidence in the count of "sync" pings being submitted is high (ie, that the numbers in :markh's query should already be exactly correct (or near enough modulo people turning sync on/off within a day)), then either "main" pings with reason "shutdown" and claiming environment.services.sync_enabled and with session length > 60 seconds are being overreceived or "sync" pings with reason "shutdown" are being underreceived.

As mentioned above, we've come to the conclusion that sync pings are being underreceived due to data being missing (specifically, "receive" events without a matching "send" event). So I find it highly unlikely the problem is actually main pings being overreceived.

Right, yes, of course. Sorry about that, I was enumerating possibilities and forgot you'd already ruled that out.

Oh here's a cheery thought: I suppose it's possible that we could be underreceiving "sync" pings and overreceiving "main" pings at the same time... : |

  • Idea: Since the "main" ping is submitted later in shutdown than the "sync" ping, the session length reported by the "main" ping will be longer than that observed by the "sync" ping. It seems unlikely that it'd be 40s later (the query starts at 60s given an assumption that we'd have data within 20s), but shutdown hang timeouts are measured in minutes and Telemetry's hit a few of those on slow machines...
    • Unfortunately, if this is a measureable effect we'll have no way to exclude these clients sensibly. There'll always be a session length S for which the number of pings line up (in this case it seems to be around 2min), but that doesn't tell us anything useful.

It seems unlikely to me that the volume we seem to be missing could be accounted for by this.

Good. One hypothesis knocked down.

:markh, I don't know how sync's internals might affect whether the ping decides to be submitted.

It will always submit once a single sync has finished - which typically happens 10 seconds after startup, which my query tries to take into account (and assuming longer period before the first sync doesn't change the numbers in a meaningful way).

How do my wild speculations hold up?

They all sound reasonable enough.

All of my suggestions and ideas and hypotheses are centred around gaining concrete evidence that Telemetry's doing the wrong thing somehow, and then using the data from clients who exhibit the wrongness to identify patterns and suggest fixes. It might be that the mismatch your analysis shows should be concrete enough for me and I'm just being difficult, in which case do tell me and I'll start working the problem from that angle.

I have taken a look at the send failures by ping type probe and there are under 3M failed attempts to send "sync" pings (compared to 25M+ failures to send "main" pings). If there was a permanent failure (repeated temporary failure) to send, I'd expect it to show up there as an outrageously-large spike in volume. There's a higher-than-normal proportion of eTooLateToSend (the ping couldn't be sent by non-pingsender because it was too late in shutdown and the network's gone to sleep) in there, but it wouldn't apply to "shutdown"-reason "sync" pings because those are sent using pingsender.

Another thought: while we're entertaining possible fixes we could replace the profile-before-change notification observer with a direct call to send the sync ping during TelemetryController's shutdown (adding it to the likes of the "event" ping and "main" ping and so forth around here). If it's a failure due to the differences in shutdown signal, that'll sort it.

Thoughts on which course of action to pursue?

Flags: needinfo?(markh)

(In reply to Chris H-C :chutten from comment #11)

All of my suggestions and ideas and hypotheses are centred around gaining concrete evidence that Telemetry's doing the wrong thing somehow, and then using the data from clients who exhibit the wrongness to identify patterns and suggest fixes. It might be that the mismatch your analysis shows should be concrete enough for me and I'm just being difficult, in which case do tell me and I'll start working the problem from that angle.

No, I think your approach is correct and I think your judgement is better than mine in deciding exactly which approach to use here.

Another thought: while we're entertaining possible fixes we could replace the profile-before-change notification observer with a direct call to send the sync ping during TelemetryController's shutdown (adding it to the likes of the "event" ping and "main" ping and so forth around here). If it's a failure due to the differences in shutdown signal, that'll sort it.

Yeah, that's what I had in mind when I mentioned "tighter integration".

Thoughts on which course of action to pursue?

I think it makes sense to move the sending into TelemetryControllerParent.jsm and have that record in the main ping the number of times it tried to submit a sync ping. We could consider doing that for some other pings too for extra confidence, although that probably introduces a bigger cross-linking risk, which I understand you want to avoid. As above though, I'd happily defer to your better judgement on what's the best bang for our buck here.

Flags: needinfo?(markh)

(In reply to Mark Hammond [:markh] [:mhammond] from comment #12)

I think it makes sense to move the sending into TelemetryControllerParent.jsm and have that record in the main ping the number of times it tried to submit a sync ping. We could consider doing that for some other pings too for extra confidence, although that probably introduces a bigger cross-linking risk, which I understand you want to avoid. As above though, I'd happily defer to your better judgement on what's the best bang for our buck here.

*number of times it tried to submit a "shutdown"-reason "sync" ping

That sounds like an excellent course of action to me. It's not 100% scientific us changing what we're measuring as we start measuring it, but we're not interested in the change, we're interested in the result, so I'll just tell my inner Bill Nye to back off : )

Depends on: 1673795

The numbers seem to agree much more when counted using the scalar you added in bug 1673795: https://sql.telemetry.mozilla.org/queries/76156/source#189947 (( that first build day's disagreement I originally figured was due to multiple builds, only some of which had the probe, but I think I corrected for that in the SQL and they still disagree ))

We're at 98+% of agreement between the two mechanisms at present. Let's see if it holds.

(In reply to Chris H-C :chutten from comment #14)

We're at 98+% of agreement between the two mechanisms at present. Let's see if it holds.

It does seem to be holding. However, the graphs in comment 0 haven't changed - so we are still seeing the server report ~30% more tabs sent than desktop (it looks like < 10% of users disable telemetry) and also that we can see many tabs being received that were never recorded as being sent (and you'd expect users who disable telemetry would do so on all their devices).

So I guess the next thing might be to investigate if the scheduled sync pings are somehow going missing - ni? me to try and work out what that means.

Flags: needinfo?(markh)

:markh - as discussed on Slack, ping me when you get back to this and need anything.

Assignee: akomarzewski → markh

FTR, we are removing the telemetry.sync_shutdown_ping_sent probe in bug 1700834, and looking at the graph in comment 14 updated for all channels and more recent dates shows it is still recording 98+% agreement for the ping types.

I'm not going to close this yet because there's still strangeness somewhere, even if it's not in the client submission process.

Flags: needinfo?(markh)

I'm closing this since we haven't found any concrete issues. Let's revisit if something comes up again.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INACTIVE
Whiteboard: [data-quality] → [dataquality]
You need to log in before you can comment on or make changes to this bug.