When using GeckoView legacy telemetry probes do not seem to reported for non-parent processes
Categories
(Toolkit :: Telemetry, defect, P4)
Tracking
()
People
(Reporter: bas.schouten, Unassigned)
References
Details
I came to this conclusion after observation of a bunch of the recorded metrics (and weirdly low client counts in GLAM). I developed a theory that we only ever set the delegate that forwards our telemetry in the parent process.
I asked csadilek to sanity check my theory and he came to the same conclusion. This would explain a number of abnormalities we are dealing with for our Android telemetry data.
Comment 1•1 year ago
|
||
For additional context, this is the related slack thread. Here are some relevant pieces:
I'd say https://searchfox.org/mozilla-central/source/mobile/android/geckoview/src/main/java/org/mozilla/geckoview/RuntimeTelemetry.java#118 only gets called for the main gecko process, and net any of the child processes. If that is correct, the child processes would hit https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/geckoview/streaming/GeckoViewStreamingTelemetry.cpp#177 and never submit their telemetry.
I believe that codepath is only called from here: https://searchfox.org/mozilla-central/source/mobile/android/geckoview/src/main/java/org/mozilla/geckoview/GeckoRuntime.java#468 which, afaict, but I may be very wrong, means it's only done for the parent process
According to the thread, performance_time_response_start is an affected metric:
is a measurement we take every time a page load. [data] suggests there was only 1 page loaded on Fenix release in the last 30 days
From what I can tell, this is using the deprecated GeckoView Streaming mechanism and not Glean via FOG.
@Chris, @Jan-Erik can any of you take a look?
Comment 2•1 year ago
|
||
So I did some testing and this is what I found. I took the following probe to test: TIME_TO_RESPONSE_START.
- I added a quick log here in PerformanceTiming.cpp
2.I don't see any events in my debug glean viewer
FYI I used this command:
adb shell am start -n org.mozilla.fenix/mozilla.telemetry.glean.debug.GleanDebugActivity --ez logPings true \
--es sendPing metrics \
--es debugViewTag test-metrics-ping
Let me know if I should add logs elsewhere / or check something else to verify.
Updated•1 year ago
|
Comment 3•1 year ago
|
||
Goodness, this was from four years ago. Let me see what my brain can dredge up...
GeckoView Streaming Telemetry only operates in the parent process because that's where Telemetry stores the data. Just before the histogram gets the sample, we check if we're in geckoview streaming mode and, if so, pass the histogram name and the sample to GVST. This is why we only see a delegate being registered on the parent process: GV doesn't give a whit about the process model underneath. The data needs to use Telemetry's IPC to find its way to the correct process.
Which it ought to be able to do. Using the original question from the Slack thread as a concrete example: performance.time.response_start
aka TIME_TO_RESPONSE_START
. This is accumulated to on nearly every pageload as part of PerformanceTiming
and so might be reasonably be expected to be in a lot of pings. Not all of them -- "metrics" pings are sent at a variety of times for a variety of reasons, not all of which might have had a qualifying pageload in them -- but certainly more than 1.4% of them on a given day. And the histogram definition allows it to be recorded on all channels on parent
and content
processes while geckoview_streaming
is doing its thing, so there aren't any of the normal stumbling blocks in the way.
One thing that's heartening is that we have evidence that GVST is working. It is faithfully reporting that over 75% of clients reporting on a given day have launched a content process. (which seems reasonable given the variety of reasons we might hear from a user on a given day without them launching a content process: they were using one from a previous day, they were only launching the browser to use as a password manager, there was a notification, etc etc). And, for the metric we have a direct comparison for, (GVST's wr.scenebuild_time
which is a direct copy of Firefox Desktop Glean's gfx.webrender.scenebuild_time
(bug 1781109)), the agreement is total: both show over half of the client population building a scene in webrender.
So there might be something weird going on with Telemetry IPC, but then there might not, since wr.scenebuild_time
is supplied by WR_SCENEBUILD_TIME
which appears to be collected on the gpu
process. (Though I don't know if Fenix has a gpu
process. But if it didn't, I'd expect the gpu tasks to be done on the content
process which that histogram is not allowed to record on... but I digress).
Anyway. This all twigged a memory: the Glean Team sitting in a conference room in Berlin working out exactly how all this stuff was put together and worked... didn't we check this? Didn't this used to work? It did! performance.time.response_start
used to be reported by like half of the Fenix population... until June of 2021.
Or, more precisely, until Firefox 89, where the rate dropped to < 1%. Actually, that per-version plot is really neat. It shows the proportion of reporting clients to have been circling 20-30% up to 87, then in 88 we were hearing from like half. Then 89+ we're down to 1% maybe 2% tops on a good day.
Something landed in Fx88 to make this spike, then again in Fx89 to make it fall.
No changes landed to Telemetry IPC or GVST in Fx88. The only change to Telemetry IPC code in Fx89 was this one from bug 1685406 which only makes a struct a little easier to reach.
Now, PerformanceTiming
did appear to have quite a lot of fun around those versions (see bug 1701029 and friends). So maybe there's something a little odd going on there? Bas, do you know the story of that code?
Conclusion (so far): Evidence is weakening that this is a Telemetry fault, either in GVST or in IPC or otherwise. Other metrics are reporting as expected through all those systems. It suggests maybe we found an instrumentation fault in TIME_TO_RESPONSE_START_MS
that manifests with particular frequency in GVST contexts.
Reporter | ||
Comment 4•1 year ago
|
||
(In reply to Chris H-C :chutten from comment #3)
Goodness, this was from four years ago. Let me see what my brain can dredge up...
GeckoView Streaming Telemetry only operates in the parent process because that's where Telemetry stores the data. Just before the histogram gets the sample, we check if we're in geckoview streaming mode and, if so, pass the histogram name and the sample to GVST. This is why we only see a delegate being registered on the parent process: GV doesn't give a whit about the process model underneath. The data needs to use Telemetry's IPC to find its way to the correct process.
So then the problem is likely in the data getting to the parent.
One thing that's heartening is that we have evidence that GVST is working. It is faithfully reporting that over 75% of clients reporting on a given day have launched a content process. (which seems reasonable given the variety of reasons we might hear from a user on a given day without them launching a content process: they were using one from a previous day, they were only launching the browser to use as a password manager, there was a notification, etc etc). And, for the metric we have a direct comparison for, (GVST's
wr.scenebuild_time
which is a direct copy of Firefox Desktop Glean'sgfx.webrender.scenebuild_time
(bug 1781109)), the agreement is total: both show over half of the client population building a scene in webrender.
Yes, GVST is working fine for the parent, which is where content process launches are launched :-).
So there might be something weird going on with Telemetry IPC, but then there might not, since
wr.scenebuild_time
is supplied byWR_SCENEBUILD_TIME
which appears to be collected on thegpu
process. (Though I don't know if Fenix has agpu
process. But if it didn't, I'd expect the gpu tasks to be done on thecontent
process which that histogram is not allowed to record on... but I digress).
Glean telemetry -is- uses completely different IPC so things like wr.scenebuild_time should work fine. Fwiw, we can look at a probe like KEYPRESS_PRESENT_LATENCY to see something interesting, a certain percentage of our clients don't have a GPU process (for various reasons), and we see a certain percentage of our clients actually reporting that probe (presumably when it's collected in the parent process).
Anyway. This all twigged a memory: the Glean Team sitting in a conference room in Berlin working out exactly how all this stuff was put together and worked... didn't we check this? Didn't this used to work? It did!
performance.time.response_start
used to be reported by like half of the Fenix population... until June of 2021.
Isn't this just when we moved more clients to use content processes?
Now,
PerformanceTiming
did appear to have quite a lot of fun around those versions (see bug 1701029 and friends). So maybe there's something a little odd going on there? Bas, do you know the story of that code?
I can point to a bunch of other content-process and gpu-process probes that have similar issues. So I doubt looking into PerformanceTiming specifically is very useful. Not Marc also verified empirically that things are being reported to legacy telemetry but not to glean in https://bugzilla.mozilla.org/show_bug.cgi?id=1850847#c2. If you believe he did something wrong there let him know and we can re-test.
Reporter | ||
Comment 5•1 year ago
|
||
Note a probe that has similarly weirdly low client numbers: performance_responsiveness_req_anim_frame_callback or the pageload variant which are also collected in the content process and isn't anywhere near the PerformanceTiming API: https://searchfox.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp#2422
Perhaps also relevant, these are all relevant in roughly the same number of pings. Not sure why it would work for such a small number of folks.
https://glam.telemetry.mozilla.org/firefox/probe/perf_request_animation_callback_non_pageload_ms/explore?channel=release¤tPage=1&timeHorizon=ALL&visiblePercentiles=%5B95%2C50%2C5%5D shows us the vast majority of desktop clients report this probe.
Comment 6•1 year ago
|
||
I wasn't sure if Fenix used the gpu
process for WR (because, if so, the example of the two scenebuild
metrics suggests that Telemetry IPC is working just fine. (because otherwise the GVST-fueled wr.scenebuild_time
would suffer the same problem as performance.time.response_start
is)). The reason I looked at WR scenebuild is because it uses FOG APIs in its home process (presumed to be gpu
) which then forks to both home-process Telemetry APIs (via GIFFT to populate WR_SCENEBUILD_TIME
) and home-process Glean SDK APIs (by straight line to populate gfx.webrender.scenebuild_time
). The home-process Telemetry API then, if the home process isn't parent, uses Telemetry IPC to get the data to the parent process where it reaches parent-process Telemetry APIs where it uses GVST to get to GV and Fenix to use their Glean APIs to get to wr.scenebuild_time
.
Or, graphically, something like this. Despite the rather more straightforward path for time to response start, both use Telemetry IPC identically.
But if it's all in the Parent Process, then I guess it's moot.
(In reply to Bas Schouten (:bas.schouten) from comment #4)
I can point to a bunch of other content-process and gpu-process probes that have similar issues. So I doubt looking into PerformanceTiming specifically is very useful. Not Marc also verified empirically that things are being reported to legacy telemetry but not to glean in https://bugzilla.mozilla.org/show_bug.cgi?id=1850847#c2. If you believe he did something wrong there let him know and we can re-test.
Starting the browser and sending the "metrics" ping will send whatever was in the "metrics" ping's storage at startup: which might not contain anything of interest if there's not been any page loads since the previous "metrics" ping. To narrow things down, I'd recommend the following STR: start Fenix and send a "metrics" ping, perform a page load, start Fenix and send a "metrics" ping again. Even if there's no performance.time.response_start
in it, there should still be a "metrics" ping with other stuff in it showing up in the Debug Ping Viewer.
Alrighty, let's look at PERF_REQUEST_ANIMATION_CALLBACK_NON_PAGELOAD_MS
which is GVST'd in Fenix to performance.responsiveness.req_anim_frame_callback
. Without a direct Glean analogue I can't positively confirm that Fenix runs this code particularly frequently (in the way I can with gfx.webrender.scenebuild_time
), but I can confirm that it shows the exact same curve as time to response start.
Which does point to a systemic fault.
(( Again, if scenebuild is actually being recorded on gpu
, then I'm very confused... because scenebuild_time is doing great. ...maybe it's a content
-specific thing? Maybe parent
and gpu
are fine, but content
for whatever reason is having a hard time? ))
Telemetry IPC is very boring. Every 2s it buffer-swaps any child process data onto IPC to be received by the parent process. For content
processes it's using PContent
in as obvious a way as we could make it (and the other P*
are carbon copies). The only check before throwing it on IPC is whether you're allowed to record on this process (which time to response start and animation callback non pageload are both permitted to do).
Is there anything unique about content processes on Fenix?
Comment 7•1 year ago
|
||
Though I don't know if Fenix has a gpu process.
It does, yes
Reporter | ||
Comment 8•1 year ago
|
||
(In reply to [:owlish] 🦉 PST from comment #7)
Though I don't know if Fenix has a gpu process.
It does, yes
It's a little more complicated than that, as in some cases it does, and in others it doesn't. So probes which may be reported in the GPU process but may also be reported in the main process are a little bit more difficult to assess.
Reporter | ||
Comment 9•1 year ago
|
||
It should be noted than Jamie Nicol did some investigation here and found GeckoView Example -seemingly- correctly moving legacy probes to the parent process. Of course GeckoView Example doesn't actually report telemetry so we weren't able to evaluate the full pipeline. This requires further investigation on the Fenix side.
Updated•1 year ago
|
Comment 10•1 year ago
|
||
I wonder if S4/P4 is high enough severity/priority given this blocks OKR2.6 and in general leaves us without usable pageload performance telemetry data for Android?
Comment 11•1 year ago
|
||
Part of our prioritization here is a matter of capacity - it will require significant (cross-functional) effort to fully debug the behaviour of the current instrumentation using this particular code path.
One much lower-effort option would be to add new performance instrumentation for KR 2.6 using Glean - we're confident that the data collection will work as expected and be available in Glam for visualizing performance. Would this be an option?
Comment 12•9 months ago
•
|
||
After a lot of digging, we found the problem! (Hopefully I explain it correctly):
We never start with telemetry set to true on the content process. IS_UNIFIED_TELEMETRY
is always false on android. So, this if statement is always skipped. Then, we check isTelemetryEnabled
enabled and that also returns false. That is set through SetupTelemetry
found here. Which is gotten from TelemetryPrefValue()
found here. However, MOZ_TELEMETRY_ON_BY_DEFAULT
is false in release or beta, so we go through the else block, which checks if the product contains the beta
string. So, we don't see anything happening on release because of that.
We were able to reproduce this by cloning the release branch and building geckoview_example and changing the if(IS_UNIFIED_TELEMETRY)
to if(true)
"forced" the telemetry pref to be set to true
.
Comment 13•9 months ago
|
||
Thank you for your and Bas' exploration!
Somehow, Fenix inherited the long-unsupported non-unified Legacy Telemetry path by pref and when we built and tested GVST we didn't notice. (My theory as to why is because GVST was implemented before we had a release-channel population in Fenix, so there was no way to notice.) This operational mode of Telemetry listens to the toolkit.telemetry.enabled
pref (which doesn't do what it says, except in non-unified mode) which is set by core Preferences module code exactly as you describe. Docs on this can be found here.
What do we do about this? First and foremost I'll say that we need to get these GVST metrics into using direct Glean as soon as we can: non-unified Telemetry is an unsupported operation mode and I want this to be the final thing we have to learn about it. Perry's begun the work in bug 1875170, so this is what we intend for completing the fix.
The question before us is if this will be fast enough for our uses. I'm inclined to be risk-averse when it comes to changing Legacy Telemetry (especially in Fenix, especially in non-unified mode) and thus recommend "move GVST metrics to direct Glean" as the solution. It's been tried and tested, and we're happy to lend a hand.
Marc, will that serve? Are there particular metrics we should prioritize? There are only about 100 of them, so we ought to be able to get to them all this cycle if reviews are prompt, but if there's a preferred order, we might as well take that into account.
Reporter | ||
Comment 14•9 months ago
|
||
(In reply to Chris H-C :chutten from comment #13)
Thank you for your and Bas' exploration!
Somehow, Fenix inherited the long-unsupported non-unified Legacy Telemetry path by pref and when we built and tested GVST we didn't notice. (My theory as to why is because GVST was implemented before we had a release-channel population in Fenix, so there was no way to notice.) This operational mode of Telemetry listens to the
toolkit.telemetry.enabled
pref (which doesn't do what it says, except in non-unified mode) which is set by core Preferences module code exactly as you describe. Docs on this can be found here.What do we do about this? First and foremost I'll say that we need to get these GVST metrics into using direct Glean as soon as we can: non-unified Telemetry is an unsupported operation mode and I want this to be the final thing we have to learn about it. Perry's begun the work in bug 1875170, so this is what we intend for completing the fix.
The question before us is if this will be fast enough for our uses. I'm inclined to be risk-averse when it comes to changing Legacy Telemetry (especially in Fenix, especially in non-unified mode) and thus recommend "move GVST metrics to direct Glean" as the solution. It's been tried and tested, and we're happy to lend a hand.
Marc, will that serve? Are there particular metrics we should prioritize? There are only about 100 of them, so we ought to be able to get to them all this cycle if reviews are prompt, but if there's a preferred order, we might as well take that into account.
If your team has time to do this in the 124 timeframe, I think that's fine. If not, I think we should land a fix ASAP, preferably uplift to 123 if possible. We have huge gaps right now in our knowledge of user behavior on Fenix, and considering the importance placed on the browser fundamentals I think this is hugely problematic.
As far as preferences. I'm not sure, I could give my preferences, but they'd probably be pretty biased. The folks working on GC, or Media probably care about very different probes than I do.
Comment 15•9 months ago
|
||
Hup, I made bug 1877836 but I should probably make it clear here too that, yes, we do intend on fixing this all in 124 by getting these metrics off of GVST. Might even uplift the early ones as we go.
Comment 16•9 months ago
|
||
We've got all the metrics off of GVST, so this is now done.
Description
•