30% of clients send a metrics ping without a baseline ping in the last two days
Categories
(Data Platform and Tools :: Glean: SDK, defect, P3)
Tracking
(Not tracked)
People
(Reporter: mdroettboom, Assigned: mdroettboom)
Details
(Whiteboard: [telemetry:glean-rs:m11])
Attachments
(2 files, 1 obsolete file)
3.24 KB,
text/plain
|
chutten
:
data-review+
|
Details |
41 bytes,
text/x-github-pull-request
|
Details | Review |
Details of this bug are available in this document:
https://docs.google.com/document/d/1J-KM7wFuJkWMHwib9Is-Ymu58-pkQvPgHnOtwr1hcvU/edit
Analysis is ongoing here:
https://iodide.telemetry.mozilla.org/notebooks/350/
We are proposing adding two new metrics:
- to count the number of baseline pings during the lifetime of each metrics ping. This would be a counter in the metrics ping that is incremented every time a baseline ping is collected, and reset every time a metrics ping is sent (has ping lifetime).
- the inverse: to count the number of metrics pings during the lifetime of each baseline ping. This would be a counter in the baseline ping that is incremented every time a metrics ping is collected, and reset every time a baseline ping is sent (has ping lifetime).
This will be helpful in determining where in the code the problem may lie. Are we sending baseline pings that don't arrive at the endpoint, or is the metrics ping truly being sent without any intervening baseline pings?
Assignee | ||
Comment 1•6 years ago
|
||
Assignee | ||
Comment 2•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Comment 3•6 years ago
|
||
Comment 4•6 years ago
|
||
Updated•6 years ago
|
Comment 5•6 years ago
|
||
This seems to validate well: https://sql.telemetry.mozilla.org/queries/66561/source#168778
Validation is complicated somewhat by the fact that a client can never report the exact same number of both "metrics" and "baseline" pings sent as received, because the means by which we learn how many pings have been sent sends another ping. Thus we can expect counts to be either exact or to show that we've received more pings than we have sent.
And we find out that this is half true:
- We receive more "baseline" pings than we have been told have been sent
- This could be because "baseline" pings are sent with low delay, all the time. So if the last "metrics" ping we received was at 4AM, of course the "baseline" pings we've been sending all day haven't been reported as sent yet. Only 7 clients report differently, and that falls at least an order of magnitude below our threshold of professional interest (aka "give a damn threshold").
- However, we receive fewer "metrics" pings than we have been told have been sent
- I don't know how this could be. The code is quite clear (if anything it might _under_report the number of sent pings if
pingNames
contains more than one "baseline" or "metrics" ping (ni?mdroetboom to see if that should be a bug someplace))
- I don't know how this could be. The code is quite clear (if anything it might _under_report the number of sent pings if
Is there a code reason why this would be? Am I too close to the end of my week to see that this is normal?
Assignee | ||
Comment 6•6 years ago
|
||
:chutten : I definitely see your point about the potential to under count if pingNames
contains duplicate entries. In practice, I don't think this would ever happen, though. These are always called with "constant" lists (i.e. the code never aggregates a bunch of pings to send to generate a pingNames
at run time).
As for what might be happening that would result in fewer "metrics" pings than we've been told... There are a few things that happen between the recording of the metric and actually successfully getting the ping to the server, so maybe that narrows down where to look.
- Generating the metrics ping content
- Writing the ping to disk
- Queuing the upload worker
- The upload worker waking up
- Uploading -- a successful HTTP transaction, including any retries on 5XX errors etc.
(1) Feels unlikely. Perhaps the logic to determine whether there is new metric content is faulty and it thinks there's nothing to send. It should be impossible that a metrics ping would get scheduled without any content. But maybe we could make use of the new "send_if_empty" flag to send all pings for a while to see if that changes things.
(2) There's a number of I/O errors possible here -- we log when that happens but don't report it back. While I don't know what the failure might be, a failure at this step would explain the data.
(3/4) This involves WorkManager, which makes me worried. In the standard 4am case, there is a WorkManager task (a) triggered at 4am, and then that triggers another WorkManager task (b) to actually do the uploading. (b) uses the ExistingWorkPolicy.KEEP
, meaning if there's already a worker doing this task, it will not be replaced. Does that mean we could have two upload tasks working in parallel? That could be a problem since that task ultimately deletes the ping from disk. Still, if the problem is just that (b) never fires, the ping should remain on disk and be uploaded the next time the uploader fires, so unclear if this would result in a missing ping.
(5) I feel like looking at the HTTP server logs at the endpoint might be helpful in diagnosing issues at this step. I understand from :whd that this is non-trivial given the load balancer etc.
Description
•