Closed Bug 1597980 Opened 2 months ago Closed 2 months ago

30% of clients send a metrics ping without a baseline ping in the last two days

Categories

(Data Platform and Tools :: Glean: SDK, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdroettboom, Assigned: mdroettboom)

Details

(Whiteboard: [telemetry:glean-rs:m11])

Attachments

(2 files, 1 obsolete file)

Details of this bug are available in this document:

https://docs.google.com/document/d/1J-KM7wFuJkWMHwib9Is-Ymu58-pkQvPgHnOtwr1hcvU/edit

Analysis is ongoing here:

https://iodide.telemetry.mozilla.org/notebooks/350/

We are proposing adding two new metrics:

  • to count the number of baseline pings during the lifetime of each metrics ping. This would be a counter in the metrics ping that is incremented every time a baseline ping is collected, and reset every time a metrics ping is sent (has ping lifetime).
  • the inverse: to count the number of metrics pings during the lifetime of each baseline ping. This would be a counter in the baseline ping that is incremented every time a metrics ping is collected, and reset every time a baseline ping is sent (has ping lifetime).

This will be helpful in determining where in the code the problem may lie. Are we sending baseline pings that don't arrive at the endpoint, or is the metrics ping truly being sent without any intervening baseline pings?

Attached file Data review request (obsolete) —
Flags: needinfo?(chutten)
Attached file Data review request
Attachment #9110248 - Flags: data-review?(chutten)
Attachment #9110242 - Attachment is obsolete: true
Comment on attachment 9110248 [details]
Data review request

DATA COLLECTION REVIEW RESPONSE:

    Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Glean so is documented in its [definitions file](https://github.com/mozilla/glean/blob/master/glean-core/metrics.yaml) and in [the Glean SDK's documentation](https://mozilla.github.io/glean/book/user/collected-metrics/metrics.html).

    Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection can be controlled through the preferences of the application the Glean SDK is embedded within.

    If the request is for permanent data collection, is there someone who will monitor the data over time?

No. This collection will expire in builds after 2019-12-20.

    Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

    Is the data collection request for default-on or default-off?

Default on for all channels.

    Does the instrumentation include the addition of any new identifiers?

No.

    Is the data collection covered by the existing Firefox privacy notice?

Yes.

    Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :mdboom is responsible for renewing or removing the collection before it expires on 2019-12-20.

---
Result: datareview+
Flags: needinfo?(chutten)
Attachment #9110248 - Flags: data-review?(chutten) → data-review+
Attached file Glean PR
Assignee: nobody → mdroettboom
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m11]

This seems to validate well: https://sql.telemetry.mozilla.org/queries/66561/source#168778

Validation is complicated somewhat by the fact that a client can never report the exact same number of both "metrics" and "baseline" pings sent as received, because the means by which we learn how many pings have been sent sends another ping. Thus we can expect counts to be either exact or to show that we've received more pings than we have sent.

And we find out that this is half true:

  • We receive more "baseline" pings than we have been told have been sent
    • This could be because "baseline" pings are sent with low delay, all the time. So if the last "metrics" ping we received was at 4AM, of course the "baseline" pings we've been sending all day haven't been reported as sent yet. Only 7 clients report differently, and that falls at least an order of magnitude below our threshold of professional interest (aka "give a damn threshold").
  • However, we receive fewer "metrics" pings than we have been told have been sent
    • I don't know how this could be. The code is quite clear (if anything it might _under_report the number of sent pings if pingNames contains more than one "baseline" or "metrics" ping (ni?mdroetboom to see if that should be a bug someplace))

Is there a code reason why this would be? Am I too close to the end of my week to see that this is normal?

Flags: needinfo?(mdroettboom)

:chutten : I definitely see your point about the potential to under count if pingNames contains duplicate entries. In practice, I don't think this would ever happen, though. These are always called with "constant" lists (i.e. the code never aggregates a bunch of pings to send to generate a pingNames at run time).

As for what might be happening that would result in fewer "metrics" pings than we've been told... There are a few things that happen between the recording of the metric and actually successfully getting the ping to the server, so maybe that narrows down where to look.

  1. Generating the metrics ping content
  2. Writing the ping to disk
  3. Queuing the upload worker
  4. The upload worker waking up
  5. Uploading -- a successful HTTP transaction, including any retries on 5XX errors etc.

(1) Feels unlikely. Perhaps the logic to determine whether there is new metric content is faulty and it thinks there's nothing to send. It should be impossible that a metrics ping would get scheduled without any content. But maybe we could make use of the new "send_if_empty" flag to send all pings for a while to see if that changes things.

(2) There's a number of I/O errors possible here -- we log when that happens but don't report it back. While I don't know what the failure might be, a failure at this step would explain the data.

(3/4) This involves WorkManager, which makes me worried. In the standard 4am case, there is a WorkManager task (a) triggered at 4am, and then that triggers another WorkManager task (b) to actually do the uploading. (b) uses the ExistingWorkPolicy.KEEP, meaning if there's already a worker doing this task, it will not be replaced. Does that mean we could have two upload tasks working in parallel? That could be a problem since that task ultimately deletes the ping from disk. Still, if the problem is just that (b) never fires, the ping should remain on disk and be uploaded the next time the uploader fires, so unclear if this would result in a missing ping.

(5) I feel like looking at the HTTP server logs at the endpoint might be helpful in diagnosing issues at this step. I understand from :whd that this is non-trivial given the load balancer etc.

Flags: needinfo?(mdroettboom)
You need to log in before you can comment on or make changes to this bug.