Closed Bug 1949494 Opened 6 months ago Closed 3 months ago

Can we remove legacy telemetry GC_* histograms that already exist in Glean?

Categories

(Core :: XPConnect, task)

task

Tracking

()

RESOLVED FIXED
141 Branch
Tracking Status
firefox141 --- fixed

People

(Reporter: florian, Assigned: florian)

References

Details

Attachments

(2 files, 1 obsolete file)

GC_* legacy telemetry histograms were duplicated in Glean in bug 1877843 and bug 1932686.

Are these legacy telemetry histograms still needed? If not I'm happy to produce a patch to remove them. If we still need them, I might need to look a them one by one to check if the Glean metrics are compatible with the existing legacy histograms in a way that would let them use GIFFT (telemetry_mirror) to have Glean mirror the data to legacy telemetry automatically, so that I can remove the direct call to Telemetry::Accumulate.

Flags: needinfo?(sphink)
Flags: needinfo?(jcoppeard)
Flags: needinfo?(dpalmeiro)
Summary: Can we remove legacy telemetry GC_* histograms that alraedy exist in Glean? → Can we remove legacy telemetry GC_* histograms that already exist in Glean?

(In reply to Florian Quèze [:florian] from comment #0)
The plan was to wait a cycle or so to check the results we're getting from telemetry match up with the legacy histograms.

I think Denis said he would remove them eventually.

Flags: needinfo?(jcoppeard)
Assignee: nobody → florian
Status: NEW → ASSIGNED

Yes, I'd like to compare some of the legacy and glean data with some queries before we remove them. I also need to reland the GC_GLEAN_SLOW_PHASE and GC_GLEAN_SLOW_TASK probes since I had to back those out due to a crash.

Flags: needinfo?(dpalmeiro)
Blocks: 1950710
No longer blocks: 1944631
Blocks: 1956726
No longer blocks: 1950710
Flags: needinfo?(sphink)

I've compared the results for the new probes and summarised the results here: https://docs.google.com/document/d/1Ym6HStvd8IeCj3iPoYBVVMTSR3w8x7gZZjg203zolXI/edit?usp=sharing

Things seem to mostly line up but there are some questionable differenences. I'd like to understand what's going on here if anyone has any theories.

:jonco and I have looked into the data and found that Glean-sent data continues to be of equal or higher quality than Legacy Telemetry-sent data (e.g. GC_PARALLEL_MARK vs javascript.gc.parallel_mark_used). However, we've found some GLAM issues (https://mozilla-hub.atlassian.net/browse/DENG-8641, https://mozilla-hub.atlassian.net/browse/DENG-8644) that need to be resolved before we can bolster sufficient confidence for the patch on this bug to be accepted.

There's also the as-yet-unexplored question of variability in the summary statistics of timing metrics (ie, the sparklines on GC_MS jump less high and fall less far than the sparklines on javascript.gc.total_time). Near as we can tell, GLAM's calculations here are not incorrect (though have a y-axis issue). I suspect that bucketing and the parent/content split in Legacy data are contributing here, but dealing with distribution data in SQL is annoyingly complicated, so I haven't yet figured out a way to look deeper.

For anyone following along at home, recall that we have documentation about how to compare legacy and Glean in GLAM. We found each and every kind of discrepancy listed in those docs in our investigation.

(In reply to Chris H-C :chutten from comment #5)
Thanks for going through all this with me this afternoon.

For me the biggest issue is the extra variability. I'd like to know where that's coming from.

It seemed worth the effort, so I decided to write the necessary SQL to grab summary statistics for javascript.gc.total_time and its Legacy counterpart GC_MS so we could compare them more deeply.

And wouldn't you know it, they agree pretty much completely? I mean, javascript.gc.total_time has higher-resolution bucketing meaning that it might pick up smaller fluctuations or soften larger ones compared to GC_MS (and I kept hitting a 100MB row limit when trying to calculate the client_count for GC_MS so it's just a flat 0 on the plot). but I had to put labels and titles on the plots because otherwise I couldn't tell them apart. Their data both appear to be equally variable.

So if all we need is confidence in the data, I think maybe we have enough of that now?

Weirdly, the variability is somewhat lower in those plots than in GLAM. I pushed deeper.

Ah, so, for Firefox Desktop, GLAM aggregates by build_id (minimum client_count 375) for Legacy, and by "build hour" (minimum client_count 100) for Glean. This means that Legacy aggregates are only computed for specific build_ids with some uptake in Legacy, and for groups of build_ids that happened on the same hour with 3x lower uptake in Glean. And my plots above were consolidating all builds from the same day ("build day") which handled the same problem (low-uptake builds likely to have unrepresentative data) in a different way (combine low-uptake builds with high-uptake builds).

And wouldn't you know it, if you look at GC_MS by build_id min 375 clients and compare it to javascript.gc.total_time by "build hour" min 100 clients there's that higher variability in javascript.gc.total_time. And if you look at javascript.gc.total_time by build_id min 375 clients it once again agrees very closely with GC_MS.

We've filed DENG-8646 to make the grouping and filtering consistent between the two systems.

It appears as though, once again, this was an artifact of GLAM's aggregation and display.

Attachment #9467642 - Attachment is obsolete: true
Attachment #9467642 - Attachment is obsolete: false
Attachment #9490953 - Attachment is obsolete: true

(This has also been merged)

GLAM's now fixed, but without a costly backfill the comparable filtering will only take effect going forward. :jonco, do you think we can move ahead here anyway?

Flags: needinfo?(jcoppeard)

Many thanks to chutten and team for investigating and fixing all the issues that came up during this investigation. I'm now happy that the Glean output is substantially the same as for the legacy probes so we can remove them.

Flags: needinfo?(jcoppeard)
No longer blocks: 1956726
Depends on: 1956726
Pushed by fqueze@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/a7403c9afcee https://hg.mozilla.org/integration/autoland/rev/03a596089d21 remove the GC_* legacy telemetry histograms that already exist in Glean, r=denispal,jonco.
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 141 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: