Closed Bug 1784297 Opened 2 years ago Closed 2 years ago

Investigate discrepancies between Glean and Legacy telemetry reports on Glam

Categories

(Data Platform and Tools :: Glean: SDK, defect, P2)

defect

Tracking

(firefox108 fixed, firefox109 fixed)

RESOLVED FIXED
Tracking Status
firefox108 --- fixed
firefox109 --- fixed

People

(Reporter: bas.schouten, Assigned: chutten)

References

Details

Attachments

(1 file)

Some metrics that are mirrored from glean into legacy telemetry have quite big and noticeable differences both in client counts and distributions. Some of this could be explained by delayed reporting or accuracy/rounding differences, but some seem rather large for such effects. I'd like to be sure there's no biases/systemic issues causing these differences, example given here:

Legacy: https://glam.telemetry.mozilla.org/firefox/probe/wr_renderer_time/explore?currentPage=1&process=gpu&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

Glean: https://glam.telemetry.mozilla.org/fog/probe/wr_renderer_time/explore?aggType=avg&currentPage=1&ping_type=*&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

Both client counts and higher percentiles differ considerably.

Assignee: nobody → pmcmanis

Perry, feel free to change priority on this one.

Severity: -- → S2
Priority: -- → P2

Will be adding some commentary as I work through different analyses:

Firstly I have checked client populations more directly. I used a natural time window; the first week of May of 2022, which happens to run Sunday to Saturday and represents the balance of recent with we've gotten enough metrics pings to converge.

What we find is that with completely unqualified population Main is the highest, followed by Baseline about 2% lower, and then Metrics which is 7.5% lower than main.

However, this is not a fair comparison as that includes version that FOG was not implemented in. If we restrict to the versions of FF desktop that have FOG enabled:

source actual diff to main
glean baseline ping 121,148,335 2.65%
glean metrics ping 114,226,495 -3.21%
legacy main ping 118,017,454 0.00%

That difference between main/baseline and metrics, incidentally, will be made up primarily by single session users and does not represent an error in collection.

A followup analyses that I would like to do would require that legacy client id be sent in the Baseline ping. A subsequent one would require glean clientid be sent in legacy main. Steps:

  1. Pick a reasonable window of time (e.g. first week of May 2022)
  2. Retrieve the list of clientids from legacy main and baseline, and the legacy.client_ids from baseline
  3. Check first the percent of legacy.client_ids from baseline present in main
  4. Check then the ratio of legacy_ids : glean baseline client ids

  1. if main has glean id, check the compliment of 3

It is important to retain the context that

Might not always have a value due to being too early for it to have loaded. Value may be the canary client id c0ffeec0-ffee-c0ff-eec0-ffeec0ffeec0 in pings near when the data upload pref is disabled (if Telemetry gets to go first), or between when a client_id has been removed and when it has been regenerated.

Next step for me is to check if this measure varies for cases where and only where metrics has a legacy id present (other than canary value)

As an addendum to the above chart, based on the GLAM ETL job, this does not explain the difference in client counts because GLAM restricts version number to Current - 3:

https://github.com/mozilla/bigquery-etl/blob/main/sql/glam-fenix-dev/glam_etl/firefox_desktop_glam_nightly__clients_histogram_aggregates_v1/query.sql#L93

Assignee: pmcmanis → chutten

Smallish update:

I have made a start on checking after only those histograms where we can tie a legacy metrics ping to a main ping, so there is a legacy client_id in metrics and the metrics submission day is 1 day after main (though this relationship is stronger for baseline)

metrics_clients legacy_clients
non-gpu legacy 471,280 143,753
gpu legacy 471,280 439,051

This is actually better alignment than I had expected. HOWEVER, there's an issue here. Metrics is a different time window than Main. I just don't see how to get past this. However, this shouldn't impact the follow-on analysis I would like to do:

  1. Update this to make a join condition be legacy client id AND end_date, sorry the day+1 is not acceptable for this analysis because we are not joining baseline to metrics, but main to metrics
  2. Expand the histograms out and actually calculate a couple of relevant values:
    i. calculate 50, 90 and 99.5 for a given window (first week of July? One week ought to be adequate, 1 day may actually be fine too)
  3. Determine the degree to which we accept that these differences are small enough to not represent a material difference

Something I think is interesting here is I detected more legacy clientids in glean Metrics than we did in legacy; this is not what we would have expected given the client counts reported in GLAM.

I will check legacy ids in baseline vs main when I return, as that should have given adequate time to collect some data.

Another item I would like to cross off the list is to write a test for a very large value to see if glean/legacy handle them differently. We already know that Glean has significantly superior handling of very small values, so that does not need to be tested additionally to very large values.

Things this isn't:

  1. Bucket widths or limits. The Telemetry histogram is an exponential with a high of 1000 and 50 buckets. Around/Above 330ms that means bucket widths of 40-45ms and plenty of room before hitting the highest bucket. The Glean timing_distribution is with unit microsecond. Around/Above 330ms we're looking at widths of 30-35ms and a overflow bucket near 7 days.
  2. Overflows. Though I originally thought overflown samples were the culprit, that proved out to not be the case. There's no evidence of invalid_overflow errors in nightly over the relevant interval. Any duration that would overflow Telemetry (ie, over 2^32ms) would cause an overflow error in Glean once it got above 7 days (for microsecond timing_distributions) so unless these overflow values are only happening in clients that we are only getting "main" pings from and not "metrics", this seems unlikely.
    • The "parent" process of WR_RENDERER_TIME is often in the overflow bucket. If these values were meaningful to this analysis and not just coming from bare hundreds of clients would've caused the Glean-reported 99.9 %ile to be higher than Telemetry's, which is the reverse of what we see. So they can be ignored.
  3. Processes. WR_RENDERER_TIME is collected on gpu and parent processes separately. wr.renderer_time is collected on all processes altogether. Unless there is a non-parent non-gpu process that is recording low value samples for wr.renderer_time (thus bringing down its 99.9 %ile) it seems unlikely to be the culprit. The sum of sums of Telemetry across both process types and the sum of sums on Glean over identical submission ranges are within 23% of each other.
    • It could still be processes if there's something non-random about the kinds of samples that are missed by the different approaches to IPC in Telemetry and Glean. Telemetry operates on a 2s timer. Glean tries to operate on idle and on clean shutdowns. There are known and documented issues with both.

Things we expect to see be different:

  • The number of clients and pings should be greater when measured by Telemetry "main" ping than by Glean "metrics" ping.
  • The values within timing_distributions like wr.renderer_time in Glean "metrics" pings should be "fuller" (more samples, bigger sums) than Histograms in Telemetry "main" pings
    • These two points are related and are in fact two sides of the same characteristic: "metrics" pings are sent less frequently than "main" pings.
  • The values self-reported via existing installs should not line up unless in a fresh install. ie, if you go to about:glean and do
(Services.telemetry.getSnapshotForHistograms().parent["WR_RENDERER_TIME"]?.sum ?? 0) + (Services.telemetry.getSnapshotForHistograms().gpu["WR_RENDERER_TIME"]?.sum ?? 0) 

and compare it to

Glean.wr.rendererTime.testGetValue().sum / 1000000  // Glean is in nanos, Telemetry in millis 

you would expect the latter to be larger because it can contain values from previous subsessions, whereas the values snapshotted from Telemetry are only since the last time a "main" ping was sent or the most recent start of Firefox (whichever's more recent).


I confess to being a bit mystified at this point. I aimed to take heart that the values, though shifted, move at the same times and in the same ways, but the Telemetry histogram is so coarse that it's difficult to see if the improvement shown in Glean in the 99.9 %ile starting Aug 12 would even show up in Telemetry. I feel as though, coarse as it is, WR_RENDERER_TIME should show it... but it's not there.

But lo, see that the client volume figure per build_id does track. The dip starting July 21 21:00, the spike for Aug 8 03:40... the features are clear, so we can be confident that we're both sampling populations that behave similarly in aggregate.


I'm afraid this doesn't necessarily help much as we still don't have clear evidence that this shift in the data is something to expect due to the design differences between Glean and Telemetry. I haven't seen anything here that shakes my certainty that a given duration is being reported identically to both systems (in their respective units) via GIFFT, but I have yet to see anything that could convince someone who might be doubting it.

The way I'd tackle this next to ensure that it is an expected difference would be to write a nightly-only custom ping with wr.renderer_time in it (and the legacy client_id) that's sent every time a "main" ping is sent. If the differences we're seeing are due to scheduling, we should see this new custom-ping-reported value come in with many fewer differences in client count and values (it still won't be identical). If the differences are due to GIFFT-introduced artefacts, we should still see them regardless of the schedule.

This isn't the Data Science way, but I'm not a Data Scientist. So maybe I should play to my strengths?

(In reply to Chris H-C :chutten from comment #5)

The way I'd tackle this next to ensure that it is an expected difference would be to write a nightly-only custom ping with wr.renderer_time in it (and the legacy client_id) that's sent every time a "main" ping is sent. If the differences we're seeing are due to scheduling, we should see this new custom-ping-reported value come in with many fewer differences in client count and values (it still won't be identical). If the differences are due to GIFFT-introduced artefacts, we should still see them regardless of the schedule.

This isn't the Data Science way, but I'm not a Data Scientist. So maybe I should play to my strengths?

Should we consider doing a Nightly only experiment that changes the schedule of the metrics ping in FOG to match the main ping? This would help with getting some preliminary data for bug 1784301 in addition to supporting your investigation.

(In reply to Alessio Placitelli [:Dexter] from comment #6)

(In reply to Chris H-C :chutten from comment #5)

The way I'd tackle this next to ensure that it is an expected difference would be to write a nightly-only custom ping with wr.renderer_time in it (and the legacy client_id) that's sent every time a "main" ping is sent. If the differences we're seeing are due to scheduling, we should see this new custom-ping-reported value come in with many fewer differences in client count and values (it still won't be identical). If the differences are due to GIFFT-introduced artefacts, we should still see them regardless of the schedule.

This isn't the Data Science way, but I'm not a Data Scientist. So maybe I should play to my strengths?

Should we consider doing a Nightly only experiment that changes the schedule of the metrics ping in FOG to match the main ping? This would help with getting some preliminary data for bug 1784301 in addition to supporting your investigation.

That would be far more expensive. The "metrics" ping is scheduled within the SDK. Spiking it to be triggered instead of scheduled, vendoring that release, then hooking it up to Desktop gated on a Nimbus Feature would be... not insignificant amounts of work. And that for a bug that still needs a proposal to look at its ramifications.

I'd prefer to keep this scoped down to a temporary and disposable custom ping.

Depends on: 1785251

While I'm waiting for bug 1785251 to land and send data and stuff, I poked around a bit more at the idea that the populations we're receiving data from are different. Not that the populations we're receiving "metrics" pings from are different from the populations we receive "main" pings from, that is known and acknowledged. But that wr.renderer_time is being received from different populations than WR_RENDERER_TIME even taking the "metrics" vs "main" populations into account.

I'm not a good enough statistician to say whether any of it is significant. But the geographies of "metrics" vs "main" ping client counts are visually dissimilar when you compare all "metrics" pings vs all "main" pings vs only the "metrics" and "main" pings containing WR_RENDERER_TIME/wr.renderer_time. And neither of them look like the geo distribution of clients reporting WR_RENDERER_TIME/wr.renderer_time in either Legacy Telemetry "main" pings or Glean "metrics" pings.

Could we be seeing a geographic effect of scheduling measurement and scheduling differences? Weirder things have shown up in the data.

I think the data from bug 1785251 is gonna be very interesting.

I've been digging more into the meat of WR_RENDERER_TIME because we want to understand why the upper percentiles look so poorly, this has lead to some interesting results that may explain some of the discrepancies between GIFFT and glean because of how we choose the aggregate the data.

Specifically, what we know is that large outliers in terms of WR_RENDERER_TIME are the result of shader compilation occurring. This almost always occurs at the start of a session with a new build. What we found out is that there are a lot of really short sesssions, and particularly on nightly those are likely to be new builds.

When we do not normalize per client/build ID the 99.9th percentile drops from ~400 to ~40, meaning aggregation and normalization has a huge impact here. If we are aggregating differently in glean, or perhaps more impactfully are not recording the short sessions that contain all these immense outliers, this may help explain some of the difference?

"main" pings are submitted at (amongst other times) app shutdown. "metrics" pings are submitted at (amongst one or two other minor cases) or after local 4am (whenever Firefox Desktop is awake to send it).

If these short sessions are from clients we don't hear from again (not single-session clients, those we wouldn't get any "main" pings from at all (see "first-shutdown" pings)), then Glean will never be triggered to send the "metrics" ping. The "metrics" ping record will always (modulo sessions active at 4am) be at least one "main" ping 'behind' in the ping record, and so if these final pings are likely to have particularly large values, then the "metrics" ping record would be biased to not include these large values.

Also, these short session clients will have fewer samples than long-session clients. This means that they will contribute a heavier weight to these higher buckets. For example, if two sessions start identically but one ends after 20 samples in wr.renderer_time and the other after 200 samples, the first would attribute 0.05 to their awful first samples happening during shader compilation, whereas the second client would attribute only 0.0005. The combination of Glean's "metrics" pings not reporting those short sessions (because the clients are gone) plus GLAM's client-focused weighting would positively reinforce each other in a way that could be noticed in the extreme values like a 99.9th %ile.

This could be it!

(( If this is the case, the "pseudo-main" ping in bug 1785251 should show a diminished but not absent effect. Submitting "pseudo-main" at the same time as "main" means we'll only be at most one "main" ping behind in the ongoing ping record. ))

Also if this is the case I would expect the metric added in bug 1784286 (wr.renderer_time_no_sc/WR_RENDERER_TIME_NO_SC_MS) to not show the same effect. It landed in 20220817091029, so I would expect to start seeing data for it real soon.

As hypothesized, the 99.9th %ile of the no_sc variants are very close to each (legacy: 29-33ms) other (Glean: 31-37ms). The remaining difference is partially due to bucketing coarseness in Legacy, with a bunch of the rest of the difference likely due to the 1700 or so clients reporting these values from the parent process (99.9th %ile of 53-95) which are separated out from the gpu process in Legacy but not in Glean.

Additional effects may also be due to the smaller client volume per build_id in Glean, which remains an easy 30% smaller. The values captured in the "pseudo-main" ping should come from a broader client volume.

Preliminarily speaking, in client counts and ping counts, "pseudo-main" is performing exactly as expected: close to but not exactly like "main": https://sql.telemetry.mozilla.org/queries/87423/source#216618

Looking at sums of wr.renderer_time and WR_RENDERER_TIME (because sums are easier to deal with than %iles in SQL (bug 1785247)), though I'm not seeing the same accordance. It seems as though "pseudo-main" hews more closely to "metrics" than "main": https://sql.telemetry.mozilla.org/queries/87481/source#216756 (both Glean pings' sum of sums being higher or lower than "main"'s (grouped by build)). Some of this might be due to millisecond truncation/rounding to fit the otherwise-nano samples into Telemetry's idiom... perhaps. I don't have any strong evidence of this.

Anyhoo, that's only from a few builds and we know there's vast differences in how quickly we get "all" of a build's pings depending on schedule, so maybe I'm subdividing them in an unsound manner. More to think on and look into next time.

A reminder for myself to look into this again and to remove the "pseudo-main" ping if we're done with it. It might be causing us some grief on the pipeline side over in DSRE-1104

Flags: needinfo?(chutten)

I don't see us needing the "pseudo-main" ping any longer, so I'll finish this bug off by removing it.

Flags: needinfo?(chutten)

Reverts "Bug 1785251 - Submit a Glean 'pseudo-main' ping when Telemetry submits a 'main' ping r=TravisLong"

Are you planning on looking into the data again or the insights you and Perry have on this are enough? If so, would you kindly do a tl;dr in response to comment 0?

Flags: needinfo?(chutten)
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8cc7d2e7bedc
Remove "pseudo-main" ping r=TravisLong

(In reply to Alessio Placitelli [:Dexter] from comment #16)

Are you planning on looking into the data again or the insights you and Perry have on this are enough? If so, would you kindly do a tl;dr in response to comment 0?

Right, sorry about that.

As of time of writing, the number of clients and the aggregate statistics we're getting out of them are within what we expect. For an example, the Nightly builds of Nov 11 in Telemetry and Glean both have population sizes and aggregate statistics within range of the acknowledged population size differences between "main"-sending and "metrics"-sending pings for the no_sc variants.

Unfortunately I can't easily show on GLAM how this is due to ping scheduling as it combines metrics sent in multiple pings in intriguing ways not conducive to this sort of analysis. But I can show that at least the client volumes line up on the same Nov 11 build between Glean's "pseudo-main" ping and Legacy Telemetry's "main" ping. The (still quite large, but proportional) differences in population seen in those queries I attribute to the absence of pingsender for "pseudo-main". We already know that there are conditions that lead to Legacy Telemetry "main"-sending clients with few large samples (thus the advent of the no_sc variant) that Glean misses... "pseudo-main" can only correct for that by so much without rearchitecturing.

I am satisfied that the differences now remaining are explained by known factors which include:

  • How the original non-no_sc wr_renderer_time has that interesting character with shader compilation that resulted in multi-modal distributions that, coupled with both the "metrics" ping's schedule at the beginning of the next session and GLAM's aggregation per-client resulted in the inflation of the 99.9%ile
  • "metrics" ping scheduling reaching a narrower population than the "main" ping
  • "metrics" ping scheduling reaching a different population (fewer multi-session but single-day sorts of folks) than the "main" ping

Follow-up activities include

  • Improving the ability to get and compare aggregate statistics (like 99.9%ile) from Glean timing_distribution metrics: bug 1785247
  • Ascertaining whether GLAM's odd handling for Glean metrics in multiple pings is correct: https://mozilla.slack.com/archives/CB1EQ437S/p1668459565426479
  • Continuing to noodle how much our choice of ping scheduling affects our view of our client population
Flags: needinfo?(chutten)
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

Comment on attachment 9302952 [details]
Bug 1784297 Remove "pseudo-main" ping r?TravisLong!

Beta/Release Uplift Approval Request

  • User impact if declined: Continuing to send an unnecessary ping
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Not risky as it's removal of temporary data collection
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9302952 - Flags: approval-mozilla-beta?

Comment on attachment 9302952 [details]
Bug 1784297 Remove "pseudo-main" ping r?TravisLong!

Approved for 108.0b4

Attachment #9302952 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: