Closed Bug 2038448 Opened 25 days ago Closed 16 days ago

Intermittent leakcheck large nsGlobalWindowInner | /_mozilla/dom

Categories

(Toolkit :: Printing, defect, P5)

defect

Tracking

()

RESOLVED FIXED
153 Branch
Tracking Status
firefox152 --- fixed
firefox153 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: dholbert)

References

(Blocks 2 open bugs)

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 1 obsolete file)

The WPT test /_mozilla/dom/window-print-mozPrintCallback-async-001.html (added in Bug 1980356) triggers a leak of nsGlobalWindowInner, nsPrintJob, and related objects when it happens to be the last test in a batch. The test calls window.print() with a mozPrintCallback, and because the Bug 1980356 fix makes window.print() non-blocking in that configuration, the print job runs asynchronously after the test completes. When the browser then exits, the print job is still in-flight and a reference cycle between nsDocumentViewer, nsPrintJob, and nsPagePrintTimer prevents everything from being cleaned up.

The fix has three parts, all in the printing code:

  1. nsPrintJob::Destroy() now explicitly stops and disconnects the page print timer before cleaning up, so any call to Destroy() breaks the reference cycle regardless of how it is reached.
  2. nsDocumentViewer::Destroy() now calls mPrintJob->Destroy() when the destroy is blocked by an in-flight print (mDestroyBlockedCount != 0). This is safe because in the window.print() flow, the print job lives on the static clone's document viewer, and Destroy() only reaches that viewer with a non-zero blocked count when the process is being torn down — not in normal user-facing scenarios like pressing Back.
  3. nsPrintJob::SetIsPrinting(false) now also clears mPreparingForPrint, fixing a second code path where an early IPC failure (before print layout begins) could leave mPreparingForPrint stuck at true, causing Destroy() to bail out prematurely via CheckBeforeDestroy().
Assignee: nobody → ryanvm
Blocks: 2034815
Summary: Intermittent leakcheck large nsGlobalWindowInner | mozilla/tests/dom → Intermittent leakcheck large nsGlobalWindowInner | /_mozilla/dom

window.print() with mozPrintCallback + print.prefer_system_dialog returns
non-blocking (Bug 1980356), so the print job runs asynchronously. The
nsPagePrintTimer holds nsDocumentViewer alive via mDocViewerPrint (nsCOMPtr),
while nsDocumentViewer holds nsPrintJob (RefPtr) which holds the timer (RefPtr).
This cycle is only broken by DisconnectPagePrintTimer(). If the browser tears
down before the print completes, nsDocumentViewer::Destroy() returns early
without breaking the cycle, so everything leaks.

Three changes:

nsPrintJob::Destroy() now calls Stop() + DisconnectPagePrintTimer() before
cleaning up, so any path that destroys the print job breaks the cycle. The
redundant DisconnectPagePrintTimer() call in ~nsPrintJob() is removed since
Destroy() handles it.

nsDocumentViewer::Destroy() now calls mPrintJob->Destroy() when blocked
(mDestroyBlockedCount != 0), breaking the cycle unconditionally. This is
safe: in the window.print() static-clone flow the print job lives on the
static clone's viewer, and Destroy() is only called on that viewer when
either (a) the print completed normally, in which case DonePrintingSheets()
already decremented mDestroyBlockedCount to 0 before Destroy() runs, or
(b) the process is being torn down, in which case aborting is correct. The
"don't unload while printing" comment's Back-button concern doesn't apply
here because pressing Back on the original page calls Destroy() on the
original viewer, which has no print job -- not on the static clone's
viewer where the job lives.

nsPrintJob::SetIsPrinting(false) now also clears mPreparingForPrint. When
the IPC channel for the remote print job fails before InitPrintDocConstruct()
runs (e.g. the PDF file descriptor is invalid on macOS CI), CleanupOnFailure()
calls SetIsPrinting(false) but mPreparingForPrint was never cleared by
InitPrintDocConstruct(). Destroy() would then return early at the
CheckBeforeDestroy() gate (mPreparingForPrint == true) even though the print
was already aborted, leaving the orphaned nsPrintJob/nsPrintData/nsPrintObject
alive at process exit.

The AppShutdown::IsInOrBeyond(AppShutdownConfirmed) gate tried earlier
didn't work: the content process receives the browser-element teardown IPC
before it receives RecvShutdown (which is what sets AppShutdownConfirmed),
so the gate was never true when Destroy() was actually called.

Attachment #9584825 - Attachment is obsolete: true

Hi Frank, this is currently permafailing on Tier-1 macOS 15 jobs in CI. I tried to find a path forward here but it got increasingly hairy and I'm at a point where I think someone from the Layout team should take it over. Can you please help find that person? Thanks!

Assignee: ryanvm → nobody
Flags: needinfo?(fgriffith)

Probably this makes sense for me to poke at, since I've been working with RyanVM on this.

A few other ideas for perhaps-more-targeted approaches than the fix approach that RyanVM+claude had going here:

(1) just break the cycle at shutdown time:

  • Make nsPagePrintTimer an xpcom-shutdown observer.
  • At xpcom-shutdown time, if we have an active nsPagePrintTimer, it'll now get an Observe callback, during which it should call its own Disconnect() method. This clears its strong mDocViewerPrint reference and decrements the mDestroyBlockedCount, which will break the cycle and will allow the nsDocumentViewer and nsPrintJob to be torn down.

(2) "wait it out" - IIUC the issue here is that we're shutting down while the test has some async printing code going on in the background. Maybe we should have the test wait until that async print operation finishes. (RyanVM tried something like this with claude, but it didn't pan out because the signal never arrived to the test to declare itself done, I think. Not sure if that was just a bug in the patch though.)

(3) Annotate the leak as allowed for now. I don't think we need to worry about it and the cost/benefit of impacting user-facing behavior is iffy, iven that I suspect this is a largely test-only issue -- in particular I think the leak only happens if both of the prefs print.prefer_system_dialog and print.always_print_silent are toggled. That's unlikely as a user configuration (basically "show me the system dialog" and "don't show me any dialog at all", neither of which are user-facing configuration options).

I'm tempted to start with (3) here and file a low-priority bug on the known leak potential.

Stepping away at the moment to attend an event, but I'll aim to do (3) later on tonight or over the weekend.

Assignee: nobody → dholbert
Flags: needinfo?(fgriffith) → needinfo?(dholbert)

FWIW it looks like this test is currently disabled on Windows x86 debug as well, for the same leak, as tracked in bug 2034619.

(disabling annotation added in https://hg.mozilla.org/integration/autoland/rev/b5890889f4ac / bug 2033390 comment 12)

Flags: needinfo?(dholbert)
See Also: → 2034619

As noted in bug 2034619 - for Windows x86 debug, this appears to have started on this large WPT-sync push near the end of April:
https://hg-edge.mozilla.org/integration/autoland/pushloghtml?changeset=c27a951b8452b5a86378ea76d5aa4eaab9079ec7

Here's the first-bad log (from that push):
https://treeherder.mozilla.org/logviewer?job_id=562049682&repo=autoland&task=dLBREYYxT42-SFzdsPkXaA.0&lineNumber=29414
And the last-good log (from the previous push):
https://treeherder.mozilla.org/logviewer?job_id=562049669&repo=autoland&task=XdyuMYZqSxWaVTzWPlQm8Q.0

Looking for differences between those logs, one important-seeming one is that the last-good log had 34 instances of this line:

Firefox didn't exit cleanly, not processing leak logs

...whereas the first bad log only has 1 instance of that line. So this might really have been a leak that we've been hitting for quite a while (maybe even on all platforms) but for some reason we're flagging the Firefox exit as not-being-clean and hence we skip leak-checking. Not sure why that WPT-sync push in particular made us start exiting cleanly more often (I guess), but that's probably a good thing in any case.

We have possibly a version of the same leak in /css/printing here, though over there it was auto-annotated by the wpt sync bot:
https://searchfox.org/firefox-main/rev/ca47ec6bb1f3e46a384cd1ebb11c5996cceef90f/testing/web-platform/meta/css/printing/__dir__.ini#1

I think that was annotated based on Try runs like this one, from bug 2033390 comment 1:
https://treeherder.mozilla.org/logviewer?job_id=561198104&repo=try&task=UQ_GmZTXTCuOs_dVN5aAjA.0&lineNumber=12367

(The patch I tested in comment 8 didn't work, I think because I was using leak-threshold: [default: ...] but in fact I think I need leak-threshold: [tab: ...] since the logs show "LEAK AND BLOAT STATISTICS, tab process", and the /css/printing annotation that I used in comment 10 has that syntax as well.)

This WPT is observed to leak over 370,000 bytes in a perma-failing way
on Windows and macOS debug builds. I'm rounding up to 400000 in the
annotation, to account for that amount with some room for variability.

As discussed on Bug 2038448, we suspect this is due to a cycle of
print-related objects that can persist when using the system print dialog
and also using silent-printing (as this test does, via pref flips in its .ini file).

Ideally we should figure out a way to break this cycle during
shutdown, or allow the test to wait for printing to complete and the
cycle to be broken. But in the meantime it doesn't seem like this
leak impacts real users, so let's annotate the observed leak to stop
the test-failures.

To avoid losing strictness for other tests, I'm moving this test to
its own subdirectory so that new leak-tolerance is specific to this test.

Duplicate of this bug: 2034619
Blocks: 2040275

Patch posted to annotate the leak.

I filed 2040275 to investigate and address the leak, when cycles are available to do so in a way that feels low-risk enough (we don't want to inadvertently break real printing workflows, as some sorts of aggressive-cycle-breaking approaches would likely do)

Attachment #9587166 - Attachment description: Bug 2038448: Move mozilla-specific mozPrintCallback WPT to its own directory and annotate an observed. r?emilio,#layout → Bug 2038448: Move mozilla-specific mozPrintCallback WPT to its own directory and annotate an observed leak. r?emilio,#layout
Pushed by dholbert@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/9a9fd9fdd2be https://hg.mozilla.org/integration/autoland/rev/4a915abe9c3b Move mozilla-specific mozPrintCallback WPT to its own directory and annotate an observed leak. r=emilio,layout-reviewers
Status: NEW → RESOLVED
Closed: 16 days ago
Resolution: --- → FIXED
Target Milestone: --- → 153 Branch

firefox-beta Uplift Approval Request

  • User impact if declined/Reason for urgency: Perma-orange test failure
  • Code covered by automated testing?: yes
  • Fix verified in Nightly?: yes
  • Needs manual QE testing?: no
  • Steps to reproduce for manual QE testing:
  • Risk associated with taking this patch: low
  • Explanation of risk level: This is just moving a WPT and adjusting its annotation - no user-facing changes.
  • String changes made/needed?: None
  • Is Android affected?: no
Attachment #9587522 - Flags: approval-mozilla-beta?

This WPT is observed to leak over 370,000 bytes in a perma-failing way
on Windows and macOS debug builds. I'm rounding up to 400000 in the
annotation, to account for that amount with some room for variability.

As discussed on Bug 2038448, we suspect this is due to a cycle of
print-related objects that can persist when using the system print dialog
and also using silent-printing (as this test does, via pref flips in its .ini file).

Ideally we should figure out a way to break this cycle during
shutdown, or allow the test to wait for printing to complete and the
cycle to be broken. But in the meantime it doesn't seem like this
leak impacts real users, so let's annotate the observed leak to stop
the test-failures.

To avoid losing strictness for other tests, I'm moving this test to
its own subdirectory so that new leak-tolerance is specific to this test.

Original Revision: https://phabricator.services.mozilla.com/D301007

Attachment #9587522 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
See Also: 2034619
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: