Validate Glean-sent Use Counter metrics' data
Categories
(Toolkit :: Telemetry, task, P1)
Tracking
()
People
(Reporter: chutten|PTO, Assigned: chutten|PTO)
References
Details
Attachments
(1 file)
107 bytes,
text/x-google-doc
|
Details |
Use Counters gained Glean metrics in bug 1852098 which landed in Firefox 121 beta 4 (build 20231127091758) and 122 nightly (build 20231122095244). We'd like to remove the Telemetry probes since they're a less efficient means of instrumenting the same code. Before we do that, we need some certainty that the new, slim, easier-to-use Glean metrics are valid.
This bug is about:
- Identifying the subset of Data Org's Data Validation Proposal that is relevant to such a technical reinstrumentation task
- Conducting system validation (checking that ping volumes and contents are as expected)
- Conducting the subset data validation
- Achieving acceptance of the validation, clearing the way to remove the Telemetry probes.
So let's get to it.
Assignee | ||
Comment 1•2 years ago
|
||
The overall process for data validation is:
- Identify the “new” dataset
firefox_desktop.use_counters
- Identify the “old” or “comparison” dataset
- A
JOIN USING(document_id)
oftelemetry.main_use_counter
andtelemetry.main_remainder
- (It is tempting to compare to Chrome's reporting of CSS use counters, but as we haven't previously compared our Use Counter instrumentation against Chrome's, this is probably apples v. oranges (e.g. do we both count pages the same way? Docs?))
- A
- Identify who should give input on the validation process before it begins & get their input
- This must be Emilio, DOM Peer and Use Counter Understander
- Identify who should be made aware of the results of the process
- Question for Emilio: Is there anyone other than you who you'd like to include?
- Identify who needs to be convinced of the final recommendation(s) in the results presentation and approve them
- Question for Emilio: Is there anyone we should convince? Or are we who are involved sufficient to judge and approve?
- Choose Comparison Metrics
- Rates of the following use counters divided by their respective denominators
- a common CSS property use counter in doc and page (e.g. CSS
width
) - an uncommon CSS property use counter in doc and page (e.g. CSS
math-style
) - a common worker JS API in dedicated, shared, and service (e.g.
console.log
) - an uncommon worker JS API in dedicated, shared, and service (e.g.
console.countReset
) - a deprecated operation in doc and page (e.g.
MutationEvent
)
- a common CSS property use counter in doc and page (e.g. CSS
- Question for Emilio: Any other comparison metrics (numbers to compare) you'd like me to run? Are the examples I propose good to go ahead with?
- Rates of the following use counters divided by their respective denominators
- Decide on Types of Comparisons to Perform
- Daily Analysis per the process (( time-series plot by submission date, min max mean median, absolute and relative differences ))
- Cohort Analysis: Everybody, only-Windows, only-Mac, only-US, not-US, only-release, only-beta, only-nightly
- Question for Emilio: Anything else you think I should run?
- Define Acceptance Criteria ("How Similar is Good Enough")
- Qualitative: Can we use the new data to make the same decisions (deprecation, removal, prioritization) we use the existing data for?
- Quantitative: We expect more than a 1% difference given the (deliberate) scheduling differences between Legacy Telemetry "main" pings and Glean "use-counters" pings (e.g. there may be no "main" ping from a user's first session, there are far more "main" pings overall given the many reasons they're sent, "main" pings may arrive with lower latency via
pingsender
, ...). I'm not sure what differences we're going to find and what we should tolerate. Definitely we shouldn't permit order of magnitude differences (new being half or double the old). Definitely we shouldn't accept any rate changing its category ("everywhere", "common", "uncommon", "unused"). I propose we reify the categories to rates ("everywhere" > 95%, "common" > 50%, "uncommon" > 1%, "unused" < 1%), acknowledge we'll need an epsilon (so something going from 50.1% to 48.7% isn't considered a problematic category change), and go from there. - Question for Emilio: Is the proposed quantitative criteria (no category changes (with epsilon)) good enough? Are there other criteria we should adhere to?
- Produce a Presentation of Results
- I'll start on this shortly. Likely as a gdoc with items called out here in the bug.
- Take Post-Validation Actions
- TBD
Any problems with the process as proposed? Is it too much, too little? Comments? Concerns? Current Events?
(I'll be jumping in with System Validation right away, moving on to Data Validation tasks thereafter.)
Comment 2•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #1)
* **Question for Emilio:** Is there anyone other than you who you'd like to include?
I think sending a PSA to dev-platform / the DOM Matrix channel with the results once we're done might be good enough, wdyt?
- Identify who needs to be convinced of the final recommendation(s) in the results presentation and approve them
- Question for Emilio: Is there anyone we should convince? Or are we who are involved sufficient to judge and approve?
I'd like to get the sanity-check of someone like maybe zcorpan, olli, nika, or peterv, just to have someone less involved with use counters stamp it.
- Choose Comparison Metrics
- Question for Emilio: Any other comparison metrics (numbers to compare) you'd like me to run? Are the examples I propose good to go ahead with?
Those sound pretty good. Extra points for some of the counted "unknown" properties (I suspect -webkit-tap-highlight-color
might be common, and speak
might be uncommon).
- Decide on Types of Comparisons to Perform
- Question for Emilio: Anything else you think I should run?
Those sound good.
- Define Acceptance Criteria ("How Similar is Good Enough")
- Question for Emilio: Is the proposed quantitative criteria (no category changes (with epsilon)) good enough? Are there other criteria we should adhere to?
Yeah, sounds good. Ideally the epsilon is small enough, specially in the "unused" / "uncommon" categories (0.5% vs 2%) for example seems like a difference we should understand.
Any problems with the process as proposed? Is it too much, too little? Comments? Concerns? Current Events?
That sounds great to me, thanks for this!
Assignee | ||
Comment 3•2 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #2)
(In reply to Chris H-C :chutten from comment #1)
* **Question for Emilio:** Is there anyone other than you who you'd like to include?
I think sending a PSA to dev-platform / the DOM Matrix channel with the results once we're done might be good enough, wdyt?
That and shouting from the mountaintops. It's fun to celebrate the removal of things we no longer need.
Assignee | ||
Comment 4•2 years ago
|
||
Some follow-up questions for you, Emilio, when you get the chance:
- Uh... it looks as though, even though they're dutifully collected and submitted there are no columns for worker use counter data in
telemetry.main_use_counter
ortelemetry.main
. Do you know whether anyone looked at that data? And if so, how? Did they go digging inadditional_properties
or did this previously Just Work(tm) and has since gone awry? - I notice that when I open a fresh Firefox and then close it, six top-level content documents are destroyed. Which is odd to me since it only ever loads a single
about:newtab
. Is this expected and intended? What determines what pages (and documents) are counted for use counters? (knowing this will help me use local testing in system validation).
Comment 5•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #4)
- Uh... it looks as though, even though they're dutifully collected and submitted there are no columns for worker use counter data in
telemetry.main_use_counter
ortelemetry.main
. Do you know whether anyone looked at that data? And if so, how? Did they go digging inadditional_properties
or did this previously Just Work(tm) and has since gone awry?
I haven't looked at worker use counters, so I don't know. Maybe Andrew does?
- I notice that when I open a fresh Firefox and then close it, six top-level content documents are destroyed. Which is odd to me since it only ever loads a single
about:newtab
. Is this expected and intended? What determines what pages (and documents) are counted for use counters? (knowing this will help me use local testing in system validation).
Depends. How are you counting them? You're saying we submit telemetry for six pages? We ignore some, like about:
pages, chrome:
pages, and such, see Document::ShouldIncludeInTelemetry
.
Comment 6•2 years ago
|
||
Choose Comparison Metrics
Would it work to include everything and sort by greatest delta?
Assignee | ||
Comment 7•2 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)
(In reply to Chris H-C :chutten from comment #4)
- I notice that when I open a fresh Firefox and then close it, six top-level content documents are destroyed. Which is odd to me since it only ever loads a single
about:newtab
. Is this expected and intended? What determines what pages (and documents) are counted for use counters? (knowing this will help me use local testing in system validation).Depends. How are you counting them? You're saying we submit telemetry for six pages? We ignore some, like
about:
pages,chrome:
pages, and such, seeDocument::ShouldIncludeInTelemetry
.
I'm counting them by checking the values that make it into the pings ("main" for Legacy, "use-counters" for Glean). You can check it too by running with something like GLEAN_DEBUG_VIEW_TAG=emilio-dev ./mach run
and shutting down without loading a page. Do that a couple of times to generate more pings. Then you can see the Legacy "main" pings in about:telemetry
(set the view to archived pings and filter to just "main" pings, then search for content_document
and cycle through the pings) and the Glean pings at https://debug-ping-preview.firebaseapp.com/pings/emilio-dev.
For me, I've now found it's not six pages (I might have been misinterpreting some earlier testing of mine), but it is always at least 2. With a variable number of (not-top-level) content documents in the 3-6 range. (Maybe the viewed about:newtab
and its cached extra instance?). You can see my Glean pings here.
(In reply to Simon Pieters [:zcorpan] from comment #6)
Choose Comparison Metrics
Would it work to include everything and sort by greatest delta?
It'd cost in time and processing budget (we couldn't be clever like we are now and only look at the handful we're interested in), and I'd have to figure out a way to do it (presently I'm sticking to SQL and to my knowledge there's no way to parameterize over column names. Probably have to use Colab.), but it would work. We'd have to come up with a definition of "greatest delta" (absolute difference or relative? Does it matter if it's a rare or common property? Is it on the sum or the min?max?mean?median? etc.) and then test it across the lot.
I don't think it'd be worth it, especially since we're already expecting some differences due to ping scheduling changes. The important thing is to get a sense of the character of the differences and ensure we didn't miss anything in the implementation. The implementation is (after you ignore the codegen) very straightforward in both Legacy and Glean so there's little to go awry on the client itself, after all.
Comment 8•2 years ago
|
||
Okay, so I get, with some logging, on a clean profile:
moz-extension://e05ac0da-4b3f-454e-b42c-da21e4d4d5ed/_generated_background_page.html
moz-extension://48940584-81a3-482c-b28a-b2fd68a508ce/_generated_background_page.html
moz-extension://6816b494-87ca-416f-a1bb-979cf0032d78/_generated_background_page.html
moz-extension://9ad7fc62-b15f-46d2-9ebf-bd8e9171f3cf/_generated_background_page.html
moz-extension://9ec6fc95-76af-4dea-8d47-31101807fa8d/_generated_background_page.html
https://www.mozilla.org/en-US/privacy/firefox/
We might want to ignore those generated pages from telemetry...
Assignee | ||
Comment 9•2 years ago
|
||
First progress report:
- I've written temporary tables for the sample of data we're looking at: Telemetry
mozdata.tmp.chutten_telemetry_use_counters
, Gleanmozdata.tmp.chutten_glean_use_counters
- To try and keep weirdness down, I only permit data from builds that mozilla itself built (per
buildhub2
) into these samples.
- To try and keep weirdness down, I only permit data from builds that mozilla itself built (per
- I've conducted the Daily Analyses of the Comparison Metrics against the samples
- I learned that, because I
COALESCE
-edNULL
s to0
in the Telemetry tmp table (because we have to sum across processes), but didn't in Glean, that made things awkward when looking at aggregates (AVG(0, 9, NULL, NULL)
is not the same asAVG(0, 9, 0, 0)
). Now there's a lot moreCOALESCE
-ing going on, and everything's fine. - I added
Total
(sum(use counter) / sum(denominator)) as a kind of statistic to look at, since I anticipate this to be the most common summary statistic that'll be used - Both Glean and Telemetry are dealing with outliers in the data, despite excluding non-mozilla-built builds. Mostly manifests as one or the other (or both) having max > 100% or min < 0%. It's only for a handful or two of submission dates across all the counters, but it's for enough of them to make me irritable.
- I'm not sure what's up here. Could be that these are weirdos sending us bogus data. Could be an error in the instrumentation code (unlikely). Could be error in the derivation of the sample tables or in the analysis SQL. Could be weird integer overflow/underflow stuff. Might be worth looking into.
- Outside of outliers, the agreement between the two systems despite their ping scheduling differences is great to see.
- I learned that, because I
Next up:
I hope to get the cohort analyses written and see if there's any evidence of where these outliers are coming from. And then I need to formalize my analyses and findings and put all this in a doc.
Assignee | ||
Comment 10•2 years ago
|
||
I EDA'd down into the outliers and found that they have no particular "tell"s that would suggest they're not real code sending real data. I also found they're coming from about 0.0000014% of pings in the sample, so I'm just going to filter them from the sample like so:
- Reject any ping with use counters or denominators < 0
- Reject any pings where the value of any use counter exceeds the value of its denominator
A mechanism that rare could be cosmic rays for all we know (actually, some of the weirdness doesn't look like single bit flips, so it's more likely API stuffing or bad hardware), and so time spent investigating them is probably better spent doing just about anything else. Like writing cohort analyses.
(( The sample filtering will happen later when I update the sample and regenerate the tmp tables. For now I'll just mentally ignore stuff caused by the weird data. ))
Comment 11•2 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)
I haven't looked at worker use counters, so I don't know. Maybe Andrew does?
It doesn't seem like anyone on my team has really looked at / used these. My personal use cases have generally involved looking at Chrome use counters to better understand adoption in the wild of (new) APIs that we don't (yet) implement (and for which we haven't added any use counters).
That said, it does seem important to have the number of workers / their types as a denominator for those use counters.
Assignee | ||
Comment 12•2 years ago
|
||
Cohort analyses in, and a casual look makes these appear remarkably stable across cohorts. Which is good!
Mac is the one obvious weird one, but I expected this because "main" pings are weird on Mac. You can have 0 windows open and Firefox will still send "main" pings, which contributes a large number of low-value (0/NULL) counters and denominators. When I redo the sample I'll filter them out like so:
- Reject any ping with 0 or
NULL
in all denominators and counters.
That'll help with both the Mac cohort difference and should help make MEDIAN
and AVG
less wide apart between ping scheduling philosophies. Not that they've been particularly far apart between the two systems in aggregate (Windows domination is real), but I'll take it.
Remaining tasks: improving the sample per the aforementioned filters (and I'll add the fourth week to make it the 28 days the process asks for), rerunning the analyses, then some actual formal reporting. I'm gonna spoil things here by saying that I haven't seen anything weird or even especially far apart except the outliers: I think this is going to be a boring report. Which is what we want!
Assignee | ||
Comment 13•2 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #11)
(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)
I haven't looked at worker use counters, so I don't know. Maybe Andrew does?
It doesn't seem like anyone on my team has really looked at / used these. My personal use cases have generally involved looking at Chrome use counters to better understand adoption in the wild of (new) APIs that we don't (yet) implement (and for which we haven't added any use counters).
That said, it does seem important to have the number of workers / their types as a denominator for those use counters.
What this means for the analysis is that we won't have an "old" dataset to compare to. The process says in this sort of situation what we'll be validating against is our expectations plus any other external data we can find that we trust. ChromeStatus doesn't seem to have worker stuff, so I guess we'll just have to do our best. To make our lives easier, I'll avoid cohorting them and just look at them overall.
Assignee | ||
Comment 14•2 years ago
|
||
The Use Counters Data Validation Report is now ready for your review. It is 16 pages of formatting, whitespace, data visualizations, and some words. Please read it, supply commentary inline or here in the bug, and signify your acceptance or rejection of its findings (that both systems are equivalently useful) and recommendation (that we remove Telemetry, the more expensive one) here in the bug.
If it helps, other data validations I've been adjacent to have had results like "Yeah, seems right: let's go ahead" and "We should look into X at some point, but that shouldn't block the move." and "We need to have X explained before we feel safe making the change".
I'm available for questions here and in all the usual places. I appreciate your attention.
Comment 15•2 years ago
|
||
I'm curious about the doc use counter differences shown e.g., for the width
property. In particular, does it only apply to _doc
counters? _page
counters seem a lot more similar. If so, why?
I'm not super concerned if they are doc-specific, since doc counters are a bit skewed downwards due to about:blank
on iframes and thus we should ~always use page use counters to determine our decisions.
I think with that understood (or a theory for why that happens), the rest looks pretty good.
Assignee | ||
Comment 16•2 years ago
|
||
To make sure we're talking about the same things, I'm taking this to mean how the relative differences of width
statistics for documents are larger than those for pages. Specifically, though the page mean is within 2% and median within 5%, the document mean is 17-20% and median is 14-18% with Glean's lower than Telemetry's.
This means that Glean sees lower per-ping proportions of docs using width
. This pattern of Glean's mean and median being lower than Telemetry's especially in doc holds across all "normal" (page/doc-type) use counters: common or rare, CSS or Deprecated API. You are right to call this out, and I really ought to have gone into this in the report (I'll be adding a section on it after we're through here).
From the numbers alone we can conclude Glean is either reporting fewer documents with "width", or more documents without it. When we look into it we find that Glean is noticeably less likely to send a ping with a number of documents destroyed between 0-4. But it's not missing these, it's just sending them later, as we can see from Glean being noticeably more likely to send pings with higher numbers of documents destroyed.
I have a theory that this is due to documents being destroyed between when the "use-counters" ping is assembled (during AppShutdown::AppShutdownConfirmed
which is shutdown phase 1) and when the "main" ping is assembled (during AppShutdown::AppShutdownTelemetry
which is shutdown phase 6). Since the Glean SDK persists data between runs, those destroyed documents aren't lost, so we see them coming in on the next "use-counters" ping.
This is only a theory because I can't seem to reproduce this on my self-built Firefox on Linux. Maybe I'm not building it with the appropriate configuration, maybe my build machine is too fast... Emilio, do you know whether this is a case that's known to happen?
(( Alternatively, these low-count docs_destroyed
"main" pings might be Telemetry sending early "main" pings in sessions where Glean isn't given a chance to send its "use-counters" ping. But since we're getting the right overall totals (even on a per-day basis) and pages aren't nearly as affected, I find this is to be less likely. ))
I don't believe this affects our ability to accept or reject the data validation, though, since we have clear signal that the populations are correctly able to send correct totals and page data (the more important data) is similar even down to the distribution of samples across pings.
Comment 17•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #16)
To make sure we're talking about the same things, I'm taking this to mean how the relative differences of
width
statistics for documents are larger than those for pages. Specifically, though the page mean is within 2% and median within 5%, the document mean is 17-20% and median is 14-18% with Glean's lower than Telemetry's.
Correct.
I have a theory that this is due to documents being destroyed between when the "use-counters" ping is assembled (during
AppShutdown::AppShutdownConfirmed
which is shutdown phase 1) and when the "main" ping is assembled (duringAppShutdown::AppShutdownTelemetry
which is shutdown phase 6). Since the Glean SDK persists data between runs, those destroyed documents aren't lost, so we see them coming in on the next "use-counters" ping.
I see, so if this is true, the glean data would be a bit more precise, right?
This is only a theory because I can't seem to reproduce this on my self-built Firefox on Linux. Maybe I'm not building it with the appropriate configuration, maybe my build machine is too fast... Emilio, do you know whether this is a case that's known to happen?
It's not unexpected that a document can be destroyed between those shutdown phases... Specially long-lived documents like extension background pages. Maybe you can reproduce if you back out bug 1874776?
I don't believe this affects our ability to accept or reject the data validation, though, since we have clear signal that the populations are correctly able to send correct totals and page data (the more important data) is similar even down to the distribution of samples across pings.
Yeah, that's fair. Given that, I'm happy to sign-off on removing the legacy telemetry. Specially true if your theory is confirmed via backing out bug 1874776 or so :)
Thanks again for doing all this work, and for bearing with me :)
Assignee | ||
Comment 18•2 years ago
|
||
No worries. Questions like these ensure we all understand what's going on, and ensure we're being thorough.
I see, so if this is true, the glean data would be a bit more precise, right?
They're actually both as precise as each other on the scale of a day, as the "total" stat shows. If my hypothesis is correct, Glean-sent data in our datasets for docs destroyed might instead be a little slower (having to wait until the next ping to be sent).
Specially true if your theory is confirmed via backing out bug 1874776 or so :)
Alas.
Even with reverting bug 1874776 I was unable to get any content document to be destroyed (and log its destruction via printf_stderr
) after phase 0. I just can't get it to happen no matter what. I tried shutting down quickly, letting it sit for a while first, loading some light pages (mozilla.org, example.com), loading some heavy pages (cnn.com, youtube.com)... nothing made it happen any later. I feel like I'm missing something.
So I tried a different tactic: looking at the data. Since bug 1874776 has been around for a week, I took a look at data from Nightly to see if its change adjusted the proportion of documents destroyed per "use-counters" ping and... it didn't, or at least not in a way that affected the 0-4 proportions. The biggest change I see is a decrease in high numbers of docs destroyed, but as those are most likely from extremely long sessions, there may just not have been enough time since build 20240118095536
for sessions to have grown that long.
What this means is that bug 1874776 had no significant effect on the values reported per-"use-counters" ping of docs destroyed.
It was possible that bug 1874776 made Telemetry-sent docs instead align with Glean's. Unlikely, and a pain to do the join, but I couldn't just let this go so I ran it again and, no, it didn't make Telemetry's report look like Glean's. What it did do was flatten out the proportions of pings received with 4-6 docs destroyed. But this is a small sample I'm operating on, so that might be noise (like the peaks at 12, 13, 16). Anyway.
What this means is, at least, there's a reason I didn't see this behaviour change between having and reverting the change: the change didn't have that effect in the first place! So that's something new we learned.
Anyhoo, if you're happy and I'm happy, I guess all we need to ask is whether :zcorpan is happy : D (and, if so, we can then get to the satisfying business of tearing Telemetry use counters out of Firefox Desktop and saving us a lot of GCP money).
Assignee | ||
Comment 20•2 years ago
|
||
Excellent! Then by the process we now call this validation "Accepted" and can move on to post-validation work: removing Telemetry's implementation of use counter probes.
And I happen to have an old patch that might rebase cleanly if I'm lucky. I'll file a separate bug to track.
Assignee | ||
Updated•2 years ago
|
Description
•