Intermittent leakcheck large nsGlobalWindowInner | /_mozilla/dom
Categories
(Toolkit :: Printing, defect, P5)
Tracking
()
People
(Reporter: intermittent-bug-filer, Assigned: dholbert)
References
(Blocks 2 open bugs)
Details
(Keywords: intermittent-failure)
Attachments
(2 files, 1 obsolete file)
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-beta+
|
Details | Review |
Filed by: rvandermeulen [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=565314502&repo=mozilla-central&task=UUnRS3XJTKaX7ZxkPT9hcg.0
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UUnRS3XJTKaX7ZxkPT9hcg/runs/0/artifacts/public/logs/live_backing.log
Comment 1•25 days ago
|
||
The WPT test /_mozilla/dom/window-print-mozPrintCallback-async-001.html (added in Bug 1980356) triggers a leak of nsGlobalWindowInner, nsPrintJob, and related objects when it happens to be the last test in a batch. The test calls window.print() with a mozPrintCallback, and because the Bug 1980356 fix makes window.print() non-blocking in that configuration, the print job runs asynchronously after the test completes. When the browser then exits, the print job is still in-flight and a reference cycle between nsDocumentViewer, nsPrintJob, and nsPagePrintTimer prevents everything from being cleaned up.
The fix has three parts, all in the printing code:
nsPrintJob::Destroy()now explicitly stops and disconnects the page print timer before cleaning up, so any call to Destroy() breaks the reference cycle regardless of how it is reached.nsDocumentViewer::Destroy()now callsmPrintJob->Destroy()when the destroy is blocked by an in-flight print (mDestroyBlockedCount != 0). This is safe because in thewindow.print()flow, the print job lives on the static clone's document viewer, andDestroy()only reaches that viewer with a non-zero blocked count when the process is being torn down — not in normal user-facing scenarios like pressing Back.nsPrintJob::SetIsPrinting(false)now also clears mPreparingForPrint, fixing a second code path where an early IPC failure (before print layout begins) could leave mPreparingForPrint stuck at true, causingDestroy()to bail out prematurely viaCheckBeforeDestroy().
Updated•25 days ago
|
Comment 2•25 days ago
|
||
window.print() with mozPrintCallback + print.prefer_system_dialog returns
non-blocking (Bug 1980356), so the print job runs asynchronously. The
nsPagePrintTimer holds nsDocumentViewer alive via mDocViewerPrint (nsCOMPtr),
while nsDocumentViewer holds nsPrintJob (RefPtr) which holds the timer (RefPtr).
This cycle is only broken by DisconnectPagePrintTimer(). If the browser tears
down before the print completes, nsDocumentViewer::Destroy() returns early
without breaking the cycle, so everything leaks.
Three changes:
nsPrintJob::Destroy() now calls Stop() + DisconnectPagePrintTimer() before
cleaning up, so any path that destroys the print job breaks the cycle. The
redundant DisconnectPagePrintTimer() call in ~nsPrintJob() is removed since
Destroy() handles it.
nsDocumentViewer::Destroy() now calls mPrintJob->Destroy() when blocked
(mDestroyBlockedCount != 0), breaking the cycle unconditionally. This is
safe: in the window.print() static-clone flow the print job lives on the
static clone's viewer, and Destroy() is only called on that viewer when
either (a) the print completed normally, in which case DonePrintingSheets()
already decremented mDestroyBlockedCount to 0 before Destroy() runs, or
(b) the process is being torn down, in which case aborting is correct. The
"don't unload while printing" comment's Back-button concern doesn't apply
here because pressing Back on the original page calls Destroy() on the
original viewer, which has no print job -- not on the static clone's
viewer where the job lives.
nsPrintJob::SetIsPrinting(false) now also clears mPreparingForPrint. When
the IPC channel for the remote print job fails before InitPrintDocConstruct()
runs (e.g. the PDF file descriptor is invalid on macOS CI), CleanupOnFailure()
calls SetIsPrinting(false) but mPreparingForPrint was never cleared by
InitPrintDocConstruct(). Destroy() would then return early at the
CheckBeforeDestroy() gate (mPreparingForPrint == true) even though the print
was already aborted, leaving the orphaned nsPrintJob/nsPrintData/nsPrintObject
alive at process exit.
The AppShutdown::IsInOrBeyond(AppShutdownConfirmed) gate tried earlier
didn't work: the content process receives the browser-element teardown IPC
before it receives RecvShutdown (which is what sets AppShutdownConfirmed),
so the gate was never true when Destroy() was actually called.
| Comment hidden (Intermittent Failures Robot) |
Updated•19 days ago
|
Comment 4•19 days ago
|
||
Hi Frank, this is currently permafailing on Tier-1 macOS 15 jobs in CI. I tried to find a path forward here but it got increasingly hairy and I'm at a point where I think someone from the Layout team should take it over. Can you please help find that person? Thanks!
| Assignee | ||
Comment 5•19 days ago
|
||
Probably this makes sense for me to poke at, since I've been working with RyanVM on this.
A few other ideas for perhaps-more-targeted approaches than the fix approach that RyanVM+claude had going here:
(1) just break the cycle at shutdown time:
- Make
nsPagePrintTimeran xpcom-shutdown observer. - At xpcom-shutdown time, if we have an active
nsPagePrintTimer, it'll now get anObservecallback, during which it should call its ownDisconnect()method. This clears its strongmDocViewerPrintreference and decrements themDestroyBlockedCount, which will break the cycle and will allow the nsDocumentViewer and nsPrintJob to be torn down.
(2) "wait it out" - IIUC the issue here is that we're shutting down while the test has some async printing code going on in the background. Maybe we should have the test wait until that async print operation finishes. (RyanVM tried something like this with claude, but it didn't pan out because the signal never arrived to the test to declare itself done, I think. Not sure if that was just a bug in the patch though.)
(3) Annotate the leak as allowed for now. I don't think we need to worry about it and the cost/benefit of impacting user-facing behavior is iffy, iven that I suspect this is a largely test-only issue -- in particular I think the leak only happens if both of the prefs print.prefer_system_dialog and print.always_print_silent are toggled. That's unlikely as a user configuration (basically "show me the system dialog" and "don't show me any dialog at all", neither of which are user-facing configuration options).
I'm tempted to start with (3) here and file a low-priority bug on the known leak potential.
| Assignee | ||
Comment 6•19 days ago
|
||
Stepping away at the moment to attend an event, but I'll aim to do (3) later on tonight or over the weekend.
| Assignee | ||
Comment 7•18 days ago
|
||
FWIW it looks like this test is currently disabled on Windows x86 debug as well, for the same leak, as tracked in bug 2034619.
(disabling annotation added in https://hg.mozilla.org/integration/autoland/rev/b5890889f4ac / bug 2033390 comment 12)
| Assignee | ||
Comment 8•18 days ago
•
|
||
Try run for annotating an allowed leak here:
aarch64: https://treeherder.mozilla.org/jobs?repo=try&revision=acf9ab0358decd531edd97190266ecc8214ec643
Other platforms: https://treeherder.mozilla.org/jobs?repo=try&revision=6b9c32684ebf244bf2d896b5f55553c0c9cc415c
| Assignee | ||
Comment 9•18 days ago
•
|
||
As noted in bug 2034619 - for Windows x86 debug, this appears to have started on this large WPT-sync push near the end of April:
https://hg-edge.mozilla.org/integration/autoland/pushloghtml?changeset=c27a951b8452b5a86378ea76d5aa4eaab9079ec7
Here's the first-bad log (from that push):
https://treeherder.mozilla.org/logviewer?job_id=562049682&repo=autoland&task=dLBREYYxT42-SFzdsPkXaA.0&lineNumber=29414
And the last-good log (from the previous push):
https://treeherder.mozilla.org/logviewer?job_id=562049669&repo=autoland&task=XdyuMYZqSxWaVTzWPlQm8Q.0
Looking for differences between those logs, one important-seeming one is that the last-good log had 34 instances of this line:
Firefox didn't exit cleanly, not processing leak logs
...whereas the first bad log only has 1 instance of that line. So this might really have been a leak that we've been hitting for quite a while (maybe even on all platforms) but for some reason we're flagging the Firefox exit as not-being-clean and hence we skip leak-checking. Not sure why that WPT-sync push in particular made us start exiting cleanly more often (I guess), but that's probably a good thing in any case.
| Assignee | ||
Comment 10•17 days ago
•
|
||
We have possibly a version of the same leak in /css/printing here, though over there it was auto-annotated by the wpt sync bot:
https://searchfox.org/firefox-main/rev/ca47ec6bb1f3e46a384cd1ebb11c5996cceef90f/testing/web-platform/meta/css/printing/__dir__.ini#1
I think that was annotated based on Try runs like this one, from bug 2033390 comment 1:
https://treeherder.mozilla.org/logviewer?job_id=561198104&repo=try&task=UQ_GmZTXTCuOs_dVN5aAjA.0&lineNumber=12367
| Assignee | ||
Comment 11•17 days ago
|
||
(The patch I tested in comment 8 didn't work, I think because I was using leak-threshold: [default: ...] but in fact I think I need leak-threshold: [tab: ...] since the logs show "LEAK AND BLOAT STATISTICS, tab process", and the /css/printing annotation that I used in comment 10 has that syntax as well.)
| Assignee | ||
Comment 12•16 days ago
|
||
This WPT is observed to leak over 370,000 bytes in a perma-failing way
on Windows and macOS debug builds. I'm rounding up to 400000 in the
annotation, to account for that amount with some room for variability.
As discussed on Bug 2038448, we suspect this is due to a cycle of
print-related objects that can persist when using the system print dialog
and also using silent-printing (as this test does, via pref flips in its .ini file).
Ideally we should figure out a way to break this cycle during
shutdown, or allow the test to wait for printing to complete and the
cycle to be broken. But in the meantime it doesn't seem like this
leak impacts real users, so let's annotate the observed leak to stop
the test-failures.
To avoid losing strictness for other tests, I'm moving this test to
its own subdirectory so that new leak-tolerance is specific to this test.
| Assignee | ||
Comment 14•16 days ago
|
||
Patch posted to annotate the leak.
I filed 2040275 to investigate and address the leak, when cycles are available to do so in a way that feels low-risk enough (we don't want to inadvertently break real printing workflows, as some sorts of aggressive-cycle-breaking approaches would likely do)
Updated•16 days ago
|
| Comment hidden (Intermittent Failures Robot) |
Comment 16•16 days ago
|
||
Comment 17•16 days ago
|
||
| bugherder | ||
Comment 18•15 days ago
|
||
firefox-beta Uplift Approval Request
- User impact if declined/Reason for urgency: Perma-orange test failure
- Code covered by automated testing?: yes
- Fix verified in Nightly?: yes
- Needs manual QE testing?: no
- Steps to reproduce for manual QE testing:
- Risk associated with taking this patch: low
- Explanation of risk level: This is just moving a WPT and adjusting its annotation - no user-facing changes.
- String changes made/needed?: None
- Is Android affected?: no
| Assignee | ||
Comment 19•15 days ago
|
||
This WPT is observed to leak over 370,000 bytes in a perma-failing way
on Windows and macOS debug builds. I'm rounding up to 400000 in the
annotation, to account for that amount with some room for variability.
As discussed on Bug 2038448, we suspect this is due to a cycle of
print-related objects that can persist when using the system print dialog
and also using silent-printing (as this test does, via pref flips in its .ini file).
Ideally we should figure out a way to break this cycle during
shutdown, or allow the test to wait for printing to complete and the
cycle to be broken. But in the meantime it doesn't seem like this
leak impacts real users, so let's annotate the observed leak to stop
the test-failures.
To avoid losing strictness for other tests, I'm moving this test to
its own subdirectory so that new leak-tolerance is specific to this test.
Original Revision: https://phabricator.services.mozilla.com/D301007
Updated•15 days ago
|
Updated•15 days ago
|
Comment 20•15 days ago
|
||
| uplift | ||
Description
•