Closed Bug 1949494 Opened 6 months ago Closed 3 months ago

Can we remove legacy telemetry GC_* histograms that already exist in Glean?

Tracking

()

Status:

RESOLVED FIXED

Milestone:

141 Branch

Tracking Flags:

Tracking

Status

firefox141

---

fixed

People

(Reporter: florian, Assigned: florian)

References

Details

Attachments

(2 files, 1 obsolete file)

Bug 1949494 - remove the GC_* legacy telemetry histograms that already exist in Glean, r=denispal,jonco. 6 months ago Florian Quèze [:florian] 48 bytes, text/x-phabricator-request		Details \| Review
Align FOG and Legacy client count filtering 3 months ago Chris H-C :chutten 49 bytes, text/x-github-pull-request		Details \| Review
Adjust FOG client count filtering 3 months ago Chris H-C :chutten 49 bytes, text/x-github-pull-request		Details \| Review

Florian Quèze [:florian]

Assignee

Description

•

6 months ago

GC_* legacy telemetry histograms were duplicated in Glean in bug 1877843 and bug 1932686.

Are these legacy telemetry histograms still needed? If not I'm happy to produce a patch to remove them. If we still need them, I might need to look a them one by one to check if the Glean metrics are compatible with the existing legacy histograms in a way that would let them use GIFFT (telemetry_mirror) to have Glean mirror the data to legacy telemetry automatically, so that I can remove the direct call to Telemetry::Accumulate.

Flags: needinfo?(sphink)

Flags: needinfo?(jcoppeard)

Flags: needinfo?(dpalmeiro)

Florian Quèze [:florian]

Assignee

Updated

•

6 months ago

Summary: Can we remove legacy telemetry GC_* histograms that alraedy exist in Glean? → Can we remove legacy telemetry GC_* histograms that already exist in Glean?

Jon Coppeard (:jonco)

Comment 1

•

6 months ago

(In reply to Florian Quèze [:florian] from comment #0)
The plan was to wait a cycle or so to check the results we're getting from telemetry match up with the legacy histograms.

I think Denis said he would remove them eventually.

Flags: needinfo?(jcoppeard)

Florian Quèze [:florian]

Assignee

Comment 2

•

6 months ago

Attached file Bug 1949494 - remove the GC_* legacy telemetry histograms that already exist in Glean, r=denispal,jonco. — Details

Phabricator Automation

Updated

•

6 months ago

Assignee: nobody → florian

Status: NEW → ASSIGNED

Denis Palmeiro [:denispal]

Comment 3

•

6 months ago

Yes, I'd like to compare some of the legacy and glean data with some queries before we remove them. I also need to reland the GC_GLEAN_SLOW_PHASE and GC_GLEAN_SLOW_TASK probes since I had to back those out due to a crash.

Flags: needinfo?(dpalmeiro)

Florian Quèze [:florian]

Assignee

Updated

•

6 months ago

Blocks: 1950710

Florian Quèze [:florian]

Assignee

Updated

•

6 months ago

No longer blocks: 1944631

Florian Quèze [:florian]

Assignee

Updated

•

5 months ago

Blocks: 1956726

Florian Quèze [:florian]

Assignee

Updated

•

5 months ago

No longer blocks: 1950710

Steve Fink [:sfink] [:s:]

Updated

•

3 months ago

Flags: needinfo?(sphink)

Jon Coppeard (:jonco)

Comment 4

•

3 months ago

I've compared the results for the new probes and summarised the results here: https://docs.google.com/document/d/1Ym6HStvd8IeCj3iPoYBVVMTSR3w8x7gZZjg203zolXI/edit?usp=sharing

Things seem to mostly line up but there are some questionable differenences. I'd like to understand what's going on here if anyone has any theories.

Chris H-C :chutten

Comment 5

•

3 months ago

:jonco and I have looked into the data and found that Glean-sent data continues to be of equal or higher quality than Legacy Telemetry-sent data (e.g. GC_PARALLEL_MARK vs javascript.gc.parallel_mark_used). However, we've found some GLAM issues (https://mozilla-hub.atlassian.net/browse/DENG-8641, https://mozilla-hub.atlassian.net/browse/DENG-8644) that need to be resolved before we can bolster sufficient confidence for the patch on this bug to be accepted.

There's also the as-yet-unexplored question of variability in the summary statistics of timing metrics (ie, the sparklines on GC_MS jump less high and fall less far than the sparklines on javascript.gc.total_time). Near as we can tell, GLAM's calculations here are not incorrect (though have a y-axis issue). I suspect that bucketing and the parent/content split in Legacy data are contributing here, but dealing with distribution data in SQL is annoyingly complicated, so I haven't yet figured out a way to look deeper.

For anyone following along at home, recall that we have documentation about how to compare legacy and Glean in GLAM. We found each and every kind of discrepancy listed in those docs in our investigation.

See Also: → https://mozilla-hub.atlassian.net/browse/DENG-8641, https://mozilla-hub.atlassian.net/browse/DENG-8644

Jon Coppeard (:jonco)

Comment 6

•

3 months ago

(In reply to Chris H-C :chutten from comment #5)
Thanks for going through all this with me this afternoon.

For me the biggest issue is the extra variability. I'd like to know where that's coming from.

Chris H-C :chutten

Comment 7

•

3 months ago

It seemed worth the effort, so I decided to write the necessary SQL to grab summary statistics for javascript.gc.total_time and its Legacy counterpart GC_MS so we could compare them more deeply.

And wouldn't you know it, they agree pretty much completely? I mean, javascript.gc.total_time has higher-resolution bucketing meaning that it might pick up smaller fluctuations or soften larger ones compared to GC_MS (and I kept hitting a 100MB row limit when trying to calculate the client_count for GC_MS so it's just a flat 0 on the plot). but I had to put labels and titles on the plots because otherwise I couldn't tell them apart. Their data both appear to be equally variable.

So if all we need is confidence in the data, I think maybe we have enough of that now?

Weirdly, the variability is somewhat lower in those plots than in GLAM. I pushed deeper.

Ah, so, for Firefox Desktop, GLAM aggregates by build_id (minimum client_count 375) for Legacy, and by "build hour" (minimum client_count 100) for Glean. This means that Legacy aggregates are only computed for specific build_ids with some uptake in Legacy, and for groups of build_ids that happened on the same hour with 3x lower uptake in Glean. And my plots above were consolidating all builds from the same day ("build day") which handled the same problem (low-uptake builds likely to have unrepresentative data) in a different way (combine low-uptake builds with high-uptake builds).

And wouldn't you know it, if you look at GC_MS by build_id min 375 clients and compare it to javascript.gc.total_time by "build hour" min 100 clients there's that higher variability in javascript.gc.total_time. And if you look at javascript.gc.total_time by build_id min 375 clients it once again agrees very closely with GC_MS.

We've filed DENG-8646 to make the grouping and filtering consistent between the two systems.

It appears as though, once again, this was an artifact of GLAM's aggregation and display.

See Also: → https://mozilla-hub.atlassian.net/browse/DENG-8646

Chris H-C :chutten

Comment 8

•

3 months ago

Attached file Align FOG and Legacy client count filtering (obsolete) — Details

Chris H-C :chutten

Updated

•

3 months ago

Attachment #9467642 - Attachment is obsolete: true

Chris H-C :chutten

Updated

•

3 months ago

Attachment #9467642 - Attachment is obsolete: false

Chris H-C :chutten

•

3 months ago

Pushed by fqueze@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/a7403c9afcee https://hg.mozilla.org/integration/autoland/rev/03a596089d21 remove the GC_* legacy telemetry histograms that already exist in Glean, r=denispal,jonco.

Cosmin Sabou [:CosminS]

Comment 13

•

3 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/03a596089d21

Status: ASSIGNED → RESOLVED

Closed: 3 months ago

status-firefox141: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 141 Branch

You need to log in before you can comment on or make changes to this bug.