Open Bug 1682085 Opened 5 years ago Updated 2 years ago

Metrics ping is frequently sent on days without a baseline ping

Categories

(Data Platform and Tools :: Glean: SDK, defect, P4)

defect

Tracking

(Not tracked)

People

(Reporter: mdroettboom, Unassigned)

Details

From :felix (via e-mail)

Marissa says that if we try to measure DAU from the metrics ping, we get results around 20% higher than if we try to measure it from the baseline ping. A priori, we would expect baseline ping-derived DAU to be higher, because one hour of usage could be split across two calendar days of baseline pings (UTC midnight), but is less likely to be split across two metrics pings (local 4AM). Also if someone uninstalls the app or smashes their phone, then it's likely we'd receive a baseline ping without receiving a metrics ping, but not the other way around.

Motivated by this, I take a bunch of metrics pings, and try to join them to the relevant baseline pings (looking for baseline pings submitted on the same day or up to two days earlier), and take a look at the metrics pings that had no corresponding baseline ping.

Nearly half (44%) of metrics pings have no baseline pings submitted the same day or on the preceding days - ouch!

Of the metrics pings that don't have a baseline ping submitted the same day or the day before that, over 99% of them have NULL num_uris, which suggests that these metrics pings cover a period where the app had no activity - i.e. that no baseline pings were sent because the app was never opened that day, so we shouldn't have a metrics ping for that day.

Of the other metrics pings, only 48% of pings have NULL num_uris, which is qualitatively different.

So: are we sure we're not sending metrics pings for periods that the app was never opened?

Details in this notebook: https://colab.research.google.com/drive/1bbH6jts2aLAXQ0IYRiJd3261cK-8x8pX

A couple off-the-cuff late-on-a-Friday thoughts in lieu of a proper look next week:

Next thing to try would be to use ping_info.end_time (local time) rather than submission time and see if things look the same.  Tried to do that myself, but got stuck on BQ date parsing nonsense (ping_info.end_time is a string column in the main tables, unfortunately...) Android could significantly delay the submission of pings, so it may be that submission time as a proxy for collection time falls down...

Also -- timezones. Splitting by UTC days when metric pings try to be "around once per local day" could be problematic.

Assignee: nobody → mdroettboom

Assigning to Mike to figure out next steps.

Priority: P3 → P1
Whiteboard: [telemetry:glean-rs:m?]

Doing the math in local timezones (in ping_info.end_time) with minute resolution makes things look much better -- no longer seeing any metrics pings without intervening baseline pings.

Perhaps the process from "collecting" the data on the client to actually uploading is much longer / less predictable than expected. We should run an analysis of the latency between end_time and submission_time to understand that better.

(Updated notebook, based on Felix's with the first SQL query changed to work on end_time: https://colab.research.google.com/drive/1y1QQPwYU52LEUfe0DiLPZmP9cSvZyRa3?usp=sharing)

Just measuring the latency of the metrics ping from "collection" time to "submission" time reveals the following:

Decades:

[-542210152, 9, 20, 31, 43, 55, 144, 10096, 44372, 321636, 1603929077]

< 10% are submitted before they were collected. This is most likely due to incorrect local clocks.

At to 50%, latency is around 2 minutes, 60% shoots up to 2 hours, 70% is 12 hours, 80% is around 2 days

So for something like 25% of metrics pings, using submission timestamp for analysis, aggregated by days is going to be wildly off wrt to also using submission timestamp to relate to other pings within that day.

%%bigquery --project $project_id quantiles
SELECT 
  APPROX_QUANTILES(latency, 10) AS histogram
FROM
  (
    SELECT 
      TIMESTAMP_DIFF(submission_timestamp, end_time, SECOND) AS latency 
    FROM
      (
        SELECT
          PARSE_TIMESTAMP('%Y-%m-%dT%H:%M%Ez', m.ping_info.end_time) AS end_time,
          m.submission_timestamp AS submission_timestamp
        FROM moz-fx-data-shared-prod.org_mozilla_firefox.metrics m
        WHERE m.submission_timestamp BETWEEN '2020-10-01' AND '2020-10-29' AND m.sample_id = 13
      )
  )

:klukas -- I understand you are working on aggregating by submission_timestamp vs. other things for Glean. This may be relevant to that.

Flags: needinfo?(jklukas)

At to 50%, latency is around 2 minutes, 60% shoots up to 2 hours, 70% is 12 hours, 80% is around 2 days

In my notebook, only 55.9% of metrics pings had at least one baseline ping submitted on the same date or the previous two dates. So this likely accounts for some of the issue, but maybe only half of it?

ping_info.end_time is a string column in the main tables, unfortunately

Note that the user-facing views actually add a parsed version of that field as ping_info.parsed_end_time so you shouldn't need to do the parsing by hand.

At to 50%, latency is around 2 minutes, 60% shoots up to 2 hours, 70% is 12 hours, 80% is around 2 days

I believe this analysis of quantiles is actually offset from the data in your query. APPROX_QUANTILES(latency, 10) returns 11 values, the first of which is the min observed and the last of which is the max.

See my take on this query in redash: https://sql.telemetry.mozilla.org/queries/76956/source

It shows 3% of pings have negative latency, the 50th percentile is 1 minute, the 80th percentile is 12 hours, and the 95th percentile is 11 days.

Flags: needinfo?(jklukas)

Played around a bit with Fenix in the Android emulator --> You sort of suggested this in your analysis, :felix, so I don't take credit, just trying to additionally support it with direct evidence. We know that in many cases, the OS shuts down Fenix before it can send metrics pings or baseline pings. But perhaps on devices that aren't under memory or resource pressure, the OS keeps Fenix alive for a long time even after interacting? By fiddling with the emulator's clock, I am able to get Fenix to send a metrics ping at "4am" even when that app isn't foregrounded. So perhaps it is the case that a large number of clients run Firefox once and it just keeps running, happily sending metrics pings daily. Earlier, this wouldn't happen because the metrics would clear and there would be nothing further to report -- more recently a number of the metrics are set to "application lifetime", (for unrelated reasons) meaning they don't clear and the metrics ping will always be sent after visiting a single page.

I propose we:

(1) Add telemetry to record any of the most basic activity (probably setting a boolean flag on foreground)
(2) Use this to support or disprove this theory
(3) If so, use this signal of activity to determine whether to send a metrics ping, eliminating the ones that don't correspond to any activity

Unfortunately, if this is correct, the lower baseline DAU number is probably the more accurate of the two :(

:dexter ?ni for a sanity check

Flags: needinfo?(flawrence)
Flags: needinfo?(alessio.placitelli)

Scratch my previous comment (leaving for posterity). That theory would only account for "rescheduled + today + tomorrow" metrics pings, which are less than 10% of volume. So unlikely this is correct...

(In reply to Michael Droettboom [:mdroettboom] from comment #9)

Scratch my previous comment (leaving for posterity). That theory would only account for "rescheduled + today + tomorrow" metrics pings, which are less than 10% of volume. So unlikely this is correct...

Yes, +1 on this :) Let me think aloud for a moment.

We know that mobile apps can be killed at any time for various reasons (memory pressure, CPU, "App Optimizers", OS Settings and so on), and so we needed to build guarantees in our SDK to mitigate data loss.

If user backgrounds a mobile Glean-powered product, it won't necessarily get killed. However, if the product process is not killed, we still schedule ping submission for the metrics ping, and I think this is ok. The product is really running and the user nor the OS killed it.

We always evangelized the baseline ping to be the signal to look at when it comes to activity/DAUs, and for a good reason: there's many edge cases beyond background/foreground that we can't possibly account for when having stronger mitigations against "data loss" (e.g. metrics ping). The metrics ping should not be used for DAU/MAU or similar analysis. I think we should probably prioritize bug 1583835 and finally make it very explicit what our pings are for. We always evangelized and communicated them this way, but we should make it clearer in the docs.

What do you think, Mike, Jan-Erik?

Flags: needinfo?(alessio.placitelli) → needinfo?(jrediger)

Yes, I agree 100% with :dexter here. By design, DAU should be calculated on baseline pings, and there's nothing in this investigation that suggests that's the wrong choice.

However -- even if we don't use the metrics ping for DAU/WAU/MAU, understanding why it's off by such a large amount (2x) is important for trust in the data (and then we document it well so others won't independently discover the weirdness again). I think we have a plausible explanation for at least 10% of extra DAU, but not the remainder. None of that may lead to actual changes in client behavior...

I would say the goal here is not to make DAU match between baseline and metrics, but to understand why the gap is so large (and if revealed, fix bugs if any). But it's explicitly not "our work isn't done until they match".

Fully agree with Mike's assessment.

Flags: needinfo?(jrediger)

(In reply to Michael Droettboom [:mdroettboom] from comment #11)

I would say the goal here is not to make DAU match between baseline and metrics, but to understand why the gap is so large (and if revealed, fix bugs if any). But it's explicitly not "our work isn't done until they match".

Cool, I'm happy with that :) Should we consider asking for QA help?

Flags: needinfo?(mdroettboom)

Just poking at this a bit more, here's a particular client that sent one baseline ping on 2020-10-10, but then sends nearly empty metrics pings on every subsequent day without ever sending baseline again: https://sql.telemetry.mozilla.org/queries/77014/source

The metrics ping should not be used for DAU/MAU or similar analysis.

By design, DAU should be calculated on baseline pings, and there's nothing in this investigation that suggests that's the wrong choice.

Eek! I think we have a problem here then, beyond the scope of this bug. Here's the larger problem, for context...

I think we have four options going forward:

  1. Fix the metrics ping behaviour and use the metrics ping for most of our exploratory data analysis (EDA) and dashboard graphs - including DAU-related stuff.
  2. Add lots more metrics to the baseline ping and use it for most of our EDA and dashboard graphs.
  3. Find a reliable way to join baseline pings to their corresponding metrics pings that doesn't give us the worst of both worlds (i.e. we suffer from every problem (latency, quirks or unreliability) associated with both kinds of ping).
  4. Not be able to do large swaths of data work.

Until we act, we're following option (4): if we can't use the metrics ping for DAU-related things then we can't (say) see how many users browse >10 uris on a given day, or what fraction of daily users do X where X is something not recorded in the baseline ping. And if one metrics ping doesn't represent the state and activity from one user on one day that they used the browser, then what does it represent and how can we interpret a distribution of data across metrics pings?

I like the sound of option (1), and am willing to live with a bit of extra latency and up to one missing ping per churned user. It requires us to resolve the problem underlying this bug, and possibly more issues too given the opinions above! :mgorlick likes option (2). I could live with it too but I am skeptical we'll be able to collect enough metrics per baseline ping? I suspect option (3) is a fantasy. Option (4) is very limiting.

It sounds like :mdroettboom and I should chat out of band, so that I learn a bit more about why everyone is in agreement that option (1) is not appropriate. In the meantime, we'll want to address the issue underpinning this bug in any case. :klukas's lead, and his style of investigation (look at weird metrics pings) both seem promising...

Flags: needinfo?(flawrence)

(In reply to Michael Droettboom [:mdroettboom] from comment #11)

However -- even if we don't use the metrics ping for DAU/WAU/MAU, understanding why it's off by such a large amount (2x) is important for trust in the data (and then we document it well so others won't independently discover the weirdness again). I think we have a plausible explanation for at least 10% of extra DAU, but not the remainder. None of that may lead to actual changes in client behavior...

Fair enough!

I would say the goal here is not to make DAU match between baseline and metrics, but to understand why the gap is so large (and if revealed, fix bugs if any). But it's explicitly not "our work isn't done until they match".

Indeed, and sorry if comment 10 seemed to imply otherwise :-) I agree we need to understand and have a good explanation for this behaviour!

(In reply to Felix Lawrence from comment #15)

I think we have four options going forward:

  1. Fix the metrics ping behaviour and use the metrics ping for most of our exploratory data analysis (EDA) and dashboard graphs - including DAU-related stuff.

I think the disconnect here is that we're not sure what's to fix right now or if this is even a "bug".

  1. Add lots more metrics to the baseline ping and use it for most of our EDA and dashboard graphs.

I think this would make the baseline ping pointless :) If that's a requirement, Fenix could roll out it's own "mega-ping" and have it sent at a higher frequency.

  1. Find a reliable way to join baseline pings to their corresponding metrics pings that doesn't give us the worst of both worlds (i.e. we suffer from every problem (latency, quirks or unreliability) associated with both kinds of ping).

I think that this pretty much depends on the analyses you need to do!

  1. Not be able to do large swaths of data work.

I don't think this is even an option ;-)

Until we act, we're following option (4): if we can't use the metrics ping for DAU-related things then we can't (say) see how many users browse >10 uris on a given day, or what fraction of daily users do X where X is something not recorded in the baseline ping. And if one metrics ping doesn't represent the state and activity from one user on one day that they used the browser, then what does it represent and how can we interpret a distribution of data across metrics pings?

I'm not sure this fully captures the current status :) Right now, we don't really know if there's anything broken: it might be, but it also may be due to a surprising usage pattern.

Taking a closer look at this something seems indeed odd: but maybe with the 'baseline' ping :-) Since version 24, we no longer use the WorkManager to schedule the metrics ping. Which means the product process needs to be active for generating that ping (so Fenix is indeed running). Moreover, most of the metrics pings from the client Jeff took a look at have reason overdue. This means that the pings are generated at startup, if we're after the metrics ping due time (4AM local). We don't seem to generate multiple pings per day from the same client (which is a good thing). So these are legitimate pings. For each one of these pings we should have a corresponding baseline ping with reason foreground. If that checks out, then this is probably a weird legitimate behaviour?

  1. Add lots more metrics to the baseline ping and use it for most of our EDA and dashboard graphs.

This approach seems feasible to me (although technically it might need to be a new custom ping that follows the baseline schedule). It would allow us to use submission_date confidently. This would essentially be a feature-usage ping that would be focused on containing metrics related to measuring user engagement.

This would kind of invert the contract for how we define usage measures, pushing it to the level of telemetry instrumentation rather than looking over a menu of metrics that may or may not have been conceived with user engagement in mind, picking out which ones to feed into usage criteria defined after the fact.

That sounds like a dream to me in terms of being confident (and perhaps even entirely automated) in the ETL. But maybe this is entirely unnacceptable due to the trade-off of losing the ability to retroactively define new usage criteria on existing probes, waiting for new instrumentation to ride the trains in order to define new usage measures. Or maybe there's a middle-ground where historical analysis on metrics pings can be used to prove the value of some new measurement, which then gets promoted to a dedicated metric on the feature-usage ping.

I think the next step is that we can proceed with adding the foreground counter to the metrics ping. This will give us a more accurate measure of activity than total_uri_count. If this accounts for the difference, then we can consider not sending metrics pings with a foreground count of 0, as a fix for the metrics ping, which would move in the direction of Felix' solution (1):

Fix the metrics ping behaviour and use the metrics ping for most of our exploratory data analysis (EDA) and dashboard graphs - including DAU-related stuff.

I created bug 1683707 for this next step.

Flags: needinfo?(mdroettboom)
Assignee: mdroettboom → nobody
Priority: P1 → P3
Priority: P3 → P4
You need to log in before you can comment on or make changes to this bug.