Can we remove legacy telemetry GC_* histograms that already exist in Glean?
Categories
(Core :: XPConnect, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox141 | --- | fixed |
People
(Reporter: florian, Assigned: florian)
References
Details
Attachments
(2 files, 1 obsolete file)
GC_* legacy telemetry histograms were duplicated in Glean in bug 1877843 and bug 1932686.
Are these legacy telemetry histograms still needed? If not I'm happy to produce a patch to remove them. If we still need them, I might need to look a them one by one to check if the Glean metrics are compatible with the existing legacy histograms in a way that would let them use GIFFT (telemetry_mirror) to have Glean mirror the data to legacy telemetry automatically, so that I can remove the direct call to Telemetry::Accumulate
.
Assignee | ||
Updated•6 months ago
|
Comment 1•6 months ago
|
||
(In reply to Florian Quèze [:florian] from comment #0)
The plan was to wait a cycle or so to check the results we're getting from telemetry match up with the legacy histograms.
I think Denis said he would remove them eventually.
Assignee | ||
Comment 2•6 months ago
|
||
Updated•6 months ago
|
Comment 3•6 months ago
|
||
Yes, I'd like to compare some of the legacy and glean data with some queries before we remove them. I also need to reland the GC_GLEAN_SLOW_PHASE and GC_GLEAN_SLOW_TASK probes since I had to back those out due to a crash.
Updated•3 months ago
|
Comment 4•3 months ago
|
||
I've compared the results for the new probes and summarised the results here: https://docs.google.com/document/d/1Ym6HStvd8IeCj3iPoYBVVMTSR3w8x7gZZjg203zolXI/edit?usp=sharing
Things seem to mostly line up but there are some questionable differenences. I'd like to understand what's going on here if anyone has any theories.
Comment 5•3 months ago
|
||
:jonco and I have looked into the data and found that Glean-sent data continues to be of equal or higher quality than Legacy Telemetry-sent data (e.g. GC_PARALLEL_MARK vs javascript.gc.parallel_mark_used). However, we've found some GLAM issues (https://mozilla-hub.atlassian.net/browse/DENG-8641, https://mozilla-hub.atlassian.net/browse/DENG-8644) that need to be resolved before we can bolster sufficient confidence for the patch on this bug to be accepted.
There's also the as-yet-unexplored question of variability in the summary statistics of timing metrics (ie, the sparklines on GC_MS jump less high and fall less far than the sparklines on javascript.gc.total_time). Near as we can tell, GLAM's calculations here are not incorrect (though have a y-axis issue). I suspect that bucketing and the parent/content split in Legacy data are contributing here, but dealing with distribution data in SQL is annoyingly complicated, so I haven't yet figured out a way to look deeper.
For anyone following along at home, recall that we have documentation about how to compare legacy and Glean in GLAM. We found each and every kind of discrepancy listed in those docs in our investigation.
Comment 6•3 months ago
|
||
(In reply to Chris H-C :chutten from comment #5)
Thanks for going through all this with me this afternoon.
For me the biggest issue is the extra variability. I'd like to know where that's coming from.
Comment 7•3 months ago
|
||
It seemed worth the effort, so I decided to write the necessary SQL to grab summary statistics for javascript.gc.total_time
and its Legacy counterpart GC_MS
so we could compare them more deeply.
And wouldn't you know it, they agree pretty much completely? I mean, javascript.gc.total_time has higher-resolution bucketing meaning that it might pick up smaller fluctuations or soften larger ones compared to GC_MS (and I kept hitting a 100MB row limit when trying to calculate the client_count for GC_MS
so it's just a flat 0 on the plot). but I had to put labels and titles on the plots because otherwise I couldn't tell them apart. Their data both appear to be equally variable.
So if all we need is confidence in the data, I think maybe we have enough of that now?
Weirdly, the variability is somewhat lower in those plots than in GLAM. I pushed deeper.
Ah, so, for Firefox Desktop, GLAM aggregates by build_id (minimum client_count 375) for Legacy, and by "build hour" (minimum client_count 100) for Glean. This means that Legacy aggregates are only computed for specific build_ids with some uptake in Legacy, and for groups of build_ids that happened on the same hour with 3x lower uptake in Glean. And my plots above were consolidating all builds from the same day ("build day") which handled the same problem (low-uptake builds likely to have unrepresentative data) in a different way (combine low-uptake builds with high-uptake builds).
And wouldn't you know it, if you look at GC_MS by build_id min 375 clients and compare it to javascript.gc.total_time by "build hour" min 100 clients there's that higher variability in javascript.gc.total_time
. And if you look at javascript.gc.total_time by build_id min 375 clients it once again agrees very closely with GC_MS
.
We've filed DENG-8646 to make the grouping and filtering consistent between the two systems.
It appears as though, once again, this was an artifact of GLAM's aggregation and display.
Comment 8•3 months ago
|
||
Updated•3 months ago
|
Updated•3 months ago
|
Updated•3 months ago
|
Comment 9•3 months ago
|
||
(This has also been merged)
Comment 10•3 months ago
|
||
GLAM's now fixed, but without a costly backfill the comparable filtering will only take effect going forward. :jonco, do you think we can move ahead here anyway?
Comment 11•3 months ago
|
||
Many thanks to chutten and team for investigating and fixing all the issues that came up during this investigation. I'm now happy that the Glean output is substantially the same as for the legacy probes so we can remove them.
Updated•3 months ago
|
Comment 12•3 months ago
|
||
Comment 13•3 months ago
|
||
bugherder |
Description
•