Leaks when switching SERPTelemetry.sys.mjs uninit calls to use category manager mechanism when running browser_search_glean_serp_event_telemetry_categorization_enabled_by_pref.js
Categories
(Firefox :: Search, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox138 | --- | fixed |
People
(Reporter: Gijs, Assigned: jteow)
References
Details
(Whiteboard: [sng])
Attachments
(3 files)
This is pretty bemusing.
The patches in bug 1916424 move a bunch of init/uninit calls to happen via a generic mechanism using the category manager.
Before the patch, a loop like this:
for (let task of tasks) {
try {
task();
} catch (ex) {
console.error(`Error during quit-application-granted: ${ex}`);
if (Cu.isInAutomation) {
// This usually happens after the test harness is done collecting
// test errors, thus we can't easily add a failure to it. The only
// noticeable solution we have is crashing.
Cc["@mozilla.org/xpcom/debug;1"]
.getService(Ci.nsIDebug2)
.abort(ex.filename, ex.lineNumber);
}
}
}
runs tasks, and the relevant tasks look like this:
() => lazy.SearchSERPTelemetry.uninit(),
() => lazy.SearchSERPCategorization.uninit(),
Post-patch, at the same time where we would normally run the loop over these tasks, we first call a bunch of consumers via the category manager. The loop for that looks like this:
callModulesFromCategory(
{
categoryName,
profilerMarker = "",
idleDispatch = false,
failureHandler = null,
},
...args
) {
// Use an async function for profiler markers and error handling.
// Note that we deliberately don't await at the top level, so we
// can guarantee all consumers get run/queued.
let callSingleListener = async fn => {
let startTime = profilerMarker ? Cu.now() : 0;
try {
await fn(...args);
} catch (ex) {
console.error(
`Error in processing ${categoryName} for ${fn._descriptiveName}`
);
console.error(ex);
try {
await failureHandler?.(ex);
} catch (nestedEx) {
console.error(`Error in handling failure: ${nestedEx}`);
// Crash in automation:
if (BrowserUtils._inAutomation) {
Cc["@mozilla.org/xpcom/debug;1"]
.getService(Ci.nsIDebug2)
.abort(nestedEx.filename, nestedEx.lineNumber);
}
}
}
if (profilerMarker) {
ChromeUtils.addProfilerMarker(
profilerMarker,
startTime,
fn._descriptiveName
);
}
};
for (let listener of lazy.CatManListenerManager.getListeners(
categoryName
)) {
if (idleDispatch) {
ChromeUtils.idleDispatch(() => callSingleListener(listener));
} else {
callSingleListener(listener);
}
}
This is a bit more convoluted because it's more generic code that gets used in other places. But fundamentally it still ends up in an outer-non-async function that calls callSingleListener which eventually invokes and returns the result of calling the uninit method on SearchSERPTelemetry resp. SearchSERPCategorization. There is an await in there purely to ensure we log errors from async rejections/failures (but no errors appear in failing logs, AFAICT, despite other logging appearing).
If I'm reading my own code right (lol) then I don't think this ultimately awaits on the inner task code (and neither did the original code), so there shouldn't be any timing differences.
So I'm pretty lost as to why this is causing leaks.
Two things I did notice:
- we do not appear to have a shutdown blocker for the async and sqlite related code in this module. That seems potentially problematic, and possibly fodder for a separate bug.
- the test toggles the enabled state of this code, which runs a pref observer in the module, which will call init/uninit - but there is nothing outside the test guaranteeing the timing of those (async) calls. But then, the test does wait for some observer topics so I wouldn't have expected it to be an issue in this test, just something that is potentially fragile...
The one other random hypothesis I have is that somehow doing:
if (!Object.hasOwn(this.cachedModules, module)) {
this.cachedModules[module] = ChromeUtils.importESModule(module);
}
(inside lazy.CatManListenerManager.getListeners) is not causing us to avoid importing the module twice (as both objects are from the same module) and that is somehow messing things up in a way that other modules' loads haven't. But I... can't see it right now? Plus importESModule is supposed to do the Right Thing anyway.
Both before and after the change, the uninit code logs in these test runs, so I don't think there's any meaningful difference in if/when this runs?
I'm going to try to land the patches in bug 1916424 without these bits so I can unblock some other work, and then it'll also be easier to repro/check what is going on here, I guess. Andrew, sorry to bother you again with this, but I don't suppose with this narrowing down there's anything obvious that sticks out to you, or that you have a suggestion on where to start digging into this more? Your description of "You can also use CC/GC logs to brute force investigate this" sounded scary. 😅
| Reporter | ||
Comment 1•8 months ago
|
||
The ones from the Normandy test suite in bug 1859188 aren't quite perma but are frequent and also look "interesting" in how they're associated with a specific round of tests.
Comment 2•8 months ago
|
||
One thing I noticed when taking a scan over your patches is that there is a behaviour change from the old code as to when the importESModule() call is being made for the lazy tasks from browser-idle-startup. Previously, these modules were declared as lazy module getters on the lazy object, and accessed from within the lazy startup tasks, meaning that they wouldn't be loaded until idle.
Under the new code path, the modules are now synchronously loaded during CatManListenerManager.getListeners() (https://searchfox.org/mozilla-central/rev/ae4d3bc2af37eef804113b621455883a92a29e9c/toolkit/modules/BrowserUtils.sys.mjs#38-40), which happens outside of the ChromeUtils.idleDispatch(...) call (https://searchfox.org/mozilla-central/rev/ae4d3bc2af37eef804113b621455883a92a29e9c/toolkit/modules/BrowserUtils.sys.mjs#519-527). Perhaps some of these modules did not expect to be loaded until idle, and so are acting in a strange way leading to leaks when being loaded early.
It should be relatively straightforward to test this theory out, the easiest way is probably to make the loading of the module lazy again by moving the import to only happen when the fn is called here: https://searchfox.org/mozilla-central/rev/ae4d3bc2af37eef804113b621455883a92a29e9c/toolkit/modules/BrowserUtils.sys.mjs#38-42. Could be as simple as this, though you could also get fancier with caching and stuff (doesn't look ottomh like these are called in tight loops right now though?):
let [objName, method] = value.split(".");
let fn = (...args) => ChromeUtils.importESModule(module)[objName][method](...args);
fn._descriptiveName = value;
| Reporter | ||
Comment 3•8 months ago
|
||
(In reply to Nika Layzell [:nika] (ni? for response) from comment #2)
One thing I noticed when taking a scan over your patches is that there is a behaviour change from the old code as to when the
importESModule()call is being made for the lazy tasks frombrowser-idle-startup. Previously, these modules were declared as lazy module getters on thelazyobject, and accessed from within the lazy startup tasks, meaning that they wouldn't be loaded until idle.
I thought I fixed this in one of those patches. See in particular https://phabricator.services.mozilla.com/D220897#8238755 .
https://hg.mozilla.org/mozilla-central/file/75a21e462ddba79b675e2811bd85b4ba9d885d46/toolkit/modules/BrowserUtils.sys.mjs#l38 (searchfox hasn't updated at time of writing) shows that getListeners returns an array of functions. Those functions are the tasks that get run (in some cases, on idle), and although a little more convoluted than your example, they do basically boil down to:
let [objName, method] = value.split("."); let fn = (...args) => ChromeUtils.importESModule(module)[objName][method](...args); fn._descriptiveName = value;
right?
Comment 4•8 months ago
|
||
Could you please attach a patch to this bug (on top of bug 1916424) that will reproduce the issue? I want to be able to reproduce the problem locally to investigate it.
Also, the summary talks about LSan, but leaks also show up in the XPCOM leak checker.
| Reporter | ||
Comment 5•8 months ago
•
|
||
(In reply to Andrew McCreight [:mccr8] from comment #4)
Could you please attach a patch to this bug (on top of bug 1916424) that will reproduce the issue? I want to be able to reproduce the problem locally to investigate it.
I've pushed to try https://treeherder.mozilla.org/jobs?repo=try&revision=0089a421bc39366564bf9df91137a2006f3e9d4e because I'm not next to the relevant VM to verify that that patch does reproduce the issue - so feel free to wait until the jobs there reproduce the problem - but if memory serves that commit is what I did yesterday... It matches the interdiff at https://phabricator.services.mozilla.com/D220898?vs=986514&id=987504#toc .
Also, the summary talks about LSan, but leaks also show up in the XPCOM leak checker.
I'm sure I'm missing or misunderstanding something, but where does that show up in https://treeherder.mozilla.org/logviewer?job_id=495482744&repo=autoland&lineNumber=7760 ?
Comment 6•8 months ago
|
||
(In reply to :Gijs (he/him) from comment #5)
I'm sure I'm missing or misunderstanding something, but where does that show up in https://treeherder.mozilla.org/logviewer?job_id=495482744&repo=autoland&lineNumber=7760 ?
It is the Linux debug bc15 failure from that same push. ASan builds run the LSan leak checker, while debug builds run the XPCOM leak checker.
Comment 7•8 months ago
|
||
I added some logging to both uninit calls and ran browser_search_glean_serp_event_telemetry_categorization_enabled_by_pref.js.
Without the patch from your try run (non-leaking) we have this:
- SERPCategorizer uninit() runs right after the test_enable_experiment_when_pref_is_not_enabled section starts.
- It runs again right after "Turn pref off" happens in that same section.
- After the test is done, in the "TEST-START | Shutdown" section, we run the TelemetryHandler uninit, followed by the SERPCategorizer uninit.
With the patch from your try run (leaking), it looks the same EXCEPT that the TelemetryHandler uninit doesn't run at shutdown.
Comment 8•8 months ago
|
||
I also tried converting only SearchSERPTelemetry.uninit or only SearchSERPCategorization.uninit. Neither of those leaks.
| Reporter | ||
Comment 9•8 months ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #8)
I also tried converting only SearchSERPTelemetry.uninit or only SearchSERPCategorization.uninit. Neither of those leaks.
I feel dumb now. So maybe the simple issue is, the category manager for any given category treats each initial identifier as unique and because these two things are both in the same module, only one of the uninit calls registers...
| Reporter | ||
Comment 10•8 months ago
•
|
||
OK, so my thinking that either the catman didn't care about double-keying or that (notification, module uri) would be a unique enough key to not cause problems, is clearly not working out.
I don't really remember if I had originally (summer last year) realized the issue that we're running into here, but the initial design used the exported object as the key, which would have avoided this. It was changed in review to the (much more ergonomic / readable!) syntax that is there now.
I note that I haven't yet bothered to try to convert onWindowsRestored. The module initialization is split between idle and that, so the init there (https://searchfox.org/mozilla-central/rev/83e29f5ee2e301ac7224e2927bddda16634b1897/browser/components/BrowserGlue.sys.mjs#2398) doesn't conflict with the other class which is now having init called from idle via the catman path already, as the "topic" / key would be different between startup idle and onwindowsrestored. No, I don't off-hand know why they're initialized at different times, and uninitialized at the same time...
I see a few somewhat obvious paths forward:
- revisit the ordering/keying decision (there's not that many consumers yet and they'd be easy to mechanically update). We could make the
object.methodbit the key and the module the "value" part, which arguably isn't really any less readable? I'd have to remember to update theall_files_referencedtest to cater for it as well, but otherwise it'd be a pretty mechanical change. - if we think SERP is a bit of a special snowflake and the only case where this would happen, we could redesign the search SERP module so it has 1 single uninit/"quit-granted" handling method that just individually calls the two tasks. It doesn't seem particularly valuable to have them dealt with separately (though I suppose in principle the current situation is slightly more resilient to errors in one of the two modules not causing us to neglect to uninit the other one, which would need a bit of try...catch-ing inside the module to deal with).
Although I don't love another big change to how this works, the downside of the second option is that this feels like a footgun that we should address. We don't even have easy linting options for the component manifest files. So I'm leaning towards the first option. Mossop, does that seem OK or am I missing a third, better option? :-)
Comment 11•8 months ago
|
||
(In reply to :Gijs (he/him) from comment #10)
OK, so my thinking that either the catman didn't care about double-keying or that (notification, module uri) would be a unique enough key to not cause problems, is clearly not working out.
I don't really remember if I had originally (summer last year) realized the issue that we're running into here, but the initial design used the exported object as the key, which would have avoided this. It was changed in review to the (much more ergonomic / readable!) syntax that is there now.
I note that I haven't yet bothered to try to convert
onWindowsRestored. The module initialization is split between idle and that, so theinitthere (https://searchfox.org/mozilla-central/rev/83e29f5ee2e301ac7224e2927bddda16634b1897/browser/components/BrowserGlue.sys.mjs#2398) doesn't conflict with the other class which is now havinginitcalled from idle via the catman path already, as the "topic" / key would be different between startup idle and onwindowsrestored. No, I don't off-hand know why they're initialized at different times, and uninitialized at the same time...I see a few somewhat obvious paths forward:
- revisit the ordering/keying decision (there's not that many consumers yet and they'd be easy to mechanically update). We could make the
object.methodbit the key and the module the "value" part, which arguably isn't really any less readable? I'd have to remember to update theall_files_referencedtest to cater for it as well, but otherwise it'd be a pretty mechanical change.- if we think SERP is a bit of a special snowflake and the only case where this would happen, we could redesign the search SERP module so it has 1 single uninit/"quit-granted" handling method that just individually calls the two tasks. It doesn't seem particularly valuable to have them dealt with separately (though I suppose in principle the current situation is slightly more resilient to errors in one of the two modules not causing us to neglect to uninit the other one, which would need a bit of try...catch-ing inside the module to deal with).
Although I don't love another big change to how this works, the downside of the second option is that this feels like a footgun that we should address. We don't even have easy linting options for the component manifest files. So I'm leaning towards the first option. Mossop, does that seem OK or am I missing a third, better option? :-)
A vaguely recall that we did discuss this possibility and considered it to be a rare enough case that you would just do option 2 if you needed to do two things in the same module. It should be straightforward to detect this problem and warn about it, maybe even crash in debug builds in the manifest parser code.
| Assignee | ||
Comment 12•8 months ago
|
||
:Gijs is another possibility to move one of the instances that needs to un-inited (namely SERP Categorization) out from SearchSERPTelemetry and into a separate file? In some ways, it makes sense to have SERP Telemetry and SERP Categorization separate since they currently have different populations, report data to two different sources (the latter uses OHTTP and doesn't include client data), and it remains to be seen whether the latter will have long term support.
| Reporter | ||
Comment 13•8 months ago
|
||
(In reply to James Teow [:jteow] from comment #12)
:Gijs is another possibility to move one of the instances that needs to un-inited (namely SERP Categorization) out from SearchSERPTelemetry and into a separate file? In some ways, it makes sense to have SERP Telemetry and SERP Categorization separate since they currently have different populations, report data to two different sources (the latter uses OHTTP and doesn't include client data), and it remains to be seen whether the latter will have long term support.
Yes, that would also work! If this is work that (better) aligns with what you want to do anyway, that would make a lot of sense to do.
Could use this bug or a separate one - I would like to do what Mossop suggested in comment #11 though and also crash in debug builds if people register the same entry for the same category twice (as that smells like it'd be a bug either way).
| Assignee | ||
Comment 14•8 months ago
|
||
Understood, thank you. I'll use this bug as making usage of the category manager is a strong reason for doing the code refactor. I've given this a P2 as I can work on it next sprint (and thus, the next release cycle) but LMK if it needs uplifting.
Updated•8 months ago
|
| Assignee | ||
Comment 15•8 months ago
|
||
This includes some renaming for consistency and reduce redundancy.
| Assignee | ||
Comment 16•8 months ago
|
||
Depends on D240039
| Assignee | ||
Comment 17•8 months ago
|
||
Depends on D240040
Comment 18•8 months ago
|
||
Comment 19•8 months ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/0429cbfe8cb7
https://hg.mozilla.org/mozilla-central/rev/b133ea2b4ac2
https://hg.mozilla.org/mozilla-central/rev/0c27dcdb4d3e
Description
•