Closed Bug 1873068 Opened 1 year ago Closed 11 months ago

Parent process perma-hang, when an about:blank popup tries to alert()

Categories

(Core :: DOM: Navigation, defect)

defect

Tracking

()

RESOLVED FIXED
124 Branch
Tracking Status
firefox-esr115 --- wontfix
firefox121 --- wontfix
firefox122 --- wontfix
firefox123 --- wontfix
firefox124 --- fixed

People

(Reporter: dholbert, Assigned: Gijs)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: hang, regression)

Attachments

(3 files)

STR:

  1. Load attached testcase.
  2. Click the button.

ACTUAL RESULTS:
Firefox parent-process hangs with 100% CPU. If I kill with kill -11 and inspect the crash report, we seem to be hanging in pagehide event-handlers.

EXPECTED RESULTS:
No hang. The popup should probably show the alert() that the testcase tries to spawn.

Regression range:
INFO: Last good revision: d0676cb0864b870062fed21bc900d6fbb3cf5670 (2022-02-16)
INFO: First bad revision: 2b187d932ca125670a883eb9973f69f75ecd1b69 (2022-02-17)
INFO: Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d0676cb0864b870062fed21bc900d6fbb3cf5670&tochange=2b187d932ca125670a883eb9973f69f75ecd1b69

In that range, I suspect this would've been from bug 1736570 since that was about the creation of about:blank documents (which is what we've just created in the popup, at the moment that we hang here).

hsivonen, maybe you could take a look here when you're back from PTO?

For reference, here are two crash reports that I generated while hanging (with kill -11, to see what the backtrace looks like):
bp-f65daf33-0ef1-44ff-a2c8-6b4500240104 (Nightly 99 from the first-bad-build, just after the regression)
bp-d5355841-70e9-49f7-9d6c-900ae0240104 (Current Nightly 123)

In both crashes, we seem to be spinning our wheels while running script in response to a page-hide notification.

Flags: needinfo?(hsivonen)

Regression-range-note: before the regression, we don't show the alert() inside of the popup at all, which is also a bug. (But it's better than hanging which is what the regression-range is highlighting.)

Keywords: hang
Attachment #9371135 - Attachment description: testcase 1 (warning, hangs Firefox) → testcase 1 (warning, hangs Firefox when you click the button)

If I put the alert behind a sufficiently-long setTimeout delay (5s in this example), there's no hang and the alert does show up as-expected.

Set release status flags based on info from the regressing bug 1736570

Pernosco trace in a debug build from today:
https://pernos.co/debug/9TMrUAjtpLL8X68WE2QWZQ/index.html#f{m[Addj,AA_,t[Jw,Ag7v_,f{e[AddZ,B43T4w_,s{af6vtYAAA,bCv0,uHXcd9A,oHXtYOw___/

That link is seeked to the moment where I did kill -11 to kill the hanging-with-100%-CPU parent process. The backtrace shows we're executing JS inside of nsDocumentViewer::PageHide, like in the crash reports that I generated above. If it's useful, call DumpJSStack() in a GDB pane (at this moment in the trace) shows this as the JS code:

0 open(aURL = ""chrome://global/content/commonDialog.xhtml"", (destructured parameter) = [unavailable], aParams = [unavailable]) ["resource://gre/modules/SubDialog.sys.mjs":145:1]
1 open/<([unavailable]) ["resource://gre/modules/SubDialog.sys.mjs":192:19]
2 close/<() ["resource://gre/modules/SubDialog.sys.mjs":322:21]

Thoughts so far:

What the regressor was meant to do was to avoid a DocumentChannel bounce for the parser-created about:blank that in Gecko (pending bug 543435) is loaded "immediately" on top of the initial about:blank. That is, the point was the make the second about:blank more immediate in terms of the number of event loop spins than what was happening with the DocumentChannel-caused event loop spins.

Therefore, in the content process, we should expect alert to run in the context of the initial about:blank and the regressor causing the initial about:blank in content to be navigated away from sooner than with DocumentChannel (but in alignment with pre-DocumentChannel behavior).

The nsDocumentViewer::PageHide from which the problem occurs seems to be the PageHide for the second about:blank in a parent-process docshell that's being torn down in response to close() in SubDialog.sys.mjs. SubDialog.sys.mjs tries to defer this operation via an event loop spin, but there's also a microtask checkpoint involved. I don't see a nested event loop on the stack, but I don't know how our current alert() is supposed to work these days exactly relative to it having blocking semantics in content JS.

The answer here probably is to fortify the deferred method call mechanism for overlay removal to better make sure that the runnable runs cleanly from the top event loop of the parent and doesn't end up in some scriptrunner queue.

I don't have a concrete idea of what exactly to do. Perhaps Gijs has?

Flags: needinfo?(hsivonen) → needinfo?(gijskruitbosch+bugs)

(In reply to Henri Sivonen (:hsivonen) from comment #5)

Thoughts so far:

What the regressor was meant to do was to avoid a DocumentChannel bounce for the parser-created about:blank that in Gecko (pending bug 543435) is loaded "immediately" on top of the initial about:blank. That is, the point was the make the second about:blank more immediate in terms of the number of event loop spins than what was happening with the DocumentChannel-caused event loop spins.

Therefore, in the content process, we should expect alert to run in the context of the initial about:blank and the regressor causing the initial about:blank in content to be navigated away from sooner than with DocumentChannel (but in alignment with pre-DocumentChannel behavior).

Sorry, I don't really understand this sentence.

The nsDocumentViewer::PageHide from which the problem occurs seems to be the PageHide for the second about:blank in a parent-process docshell that's being torn down in response to close() in SubDialog.sys.mjs. SubDialog.sys.mjs tries to defer this operation via an event loop spin, but there's also a microtask checkpoint involved. I don't see a nested event loop on the stack, but I don't know how our current alert() is supposed to work these days exactly relative to it having blocking semantics in content JS.

I believe there will be a nested event loop in the child, but there shouldn't be a nested event loop in the parent, I think.

A nested loop in JS code should have shown up in the stack in comment #4, I think (but doesn't, so I assume there isn't one).

The answer here probably is to fortify the deferred method call mechanism for overlay removal to better make sure that the runnable runs cleanly from the top event loop of the parent and doesn't end up in some scriptrunner queue.

Chrome JS code has no insight into whether or not there is a "scriptrunner queue" or a nested event loop. So I don't really understand what you envisage here?

I'm also not sure what "the runnable" is referring to here.

I don't have a concrete idea of what exactly to do. Perhaps Gijs has?

I don't, but also I'm missing what the "right" thing is that we're aiming for. Obviously the hang is not right, but the solution we want probably depends on what we're aiming to achieve.

AIUI, right now, the following happens:

  1. we load the initial about:blank in the new browsing context
  2. we open an alert with alert() coming from web content
  3. we navigate the new BC to a second about:blank which attempts to close the alert()
  4. we busy-hang (doing what, exactly?)

If someone has more details on (4) that would also be helpful. (I'm traveling today so my ability to repro and debug is limited.)

What do the specs / other browsers say/do? Is the alert meant to show in the first about:blank and block the subsequent load? Or in the second one? Or opened and then closed? I could see arguments for all of these...

Is it possible this is the same root issue as bug 1833875?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(hsivonen)

Set release status flags based on info from the regressing bug 1736570

(In reply to :Gijs (he/him) from comment #6)

Therefore, in the content process, we should expect alert to run in the context of the initial about:blank and the regressor causing the initial about:blank in content to be navigated away from sooner than with DocumentChannel (but in alignment with pre-DocumentChannel behavior).

Sorry, I don't really understand this sentence.

The regressor changed the number of event loop spins in the child between window.open returning and the second about:blank materializing (to fewer event loop spins, like it was before DocumentChannel was introduced).

Chrome JS code has no insight into whether or not there is a "scriptrunner queue" or a nested event loop. So I don't really understand what you envisage here?

I'm hoping Services.tm.dispatchToMainThread would somehow ensure that what's dispatched gets run from the event loop directly and not from some weird other place as seen in the trace.

I'm also not sure what "the runnable" is referring to here.

I mean the thing that Services.tm.dispatchToMainThread dispatches to run the JS passed to Services.tm.dispatchToMainThread.

What do the specs / other browsers say/do?

We shouldn't have the second about:blank at all, but that's bug 543435.

Is it possible this is the same root issue as bug 1833875?

Quite possible.

Flags: needinfo?(hsivonen) → needinfo?(gijskruitbosch+bugs)

I'm still pretty lost here. Is the "second about:blank" load you're talking about happening in the content window (ie the window.open from the testcase), or in the parent process frame that we use for the innards of alert()?

The code you linked in SubDialog.sys.mjs is trying to avoid destroying the frame containing the commonDialog.xhtml doc before its unload code has had a chance to run. It shouldn't have anything to do with about:blank things, and because it runs in the parent I don't understand why its event loops would matter vs. what is going on with the content document (as they're happening in completely different processes).

(edit: and I'm assuming that your references to Services.tm.dispatchToMainThread are still referring to that same SubDialog code that already uses that method... but even that I'm not quite sure about)

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(hsivonen)

(In reply to :Gijs (he/him) from comment #9)

I'm still pretty lost here. Is the "second about:blank" load you're talking about happening in the content window (ie the window.open from the testcase), or in the parent process frame that we use for the innards of alert()?

It happens in both, but I suspect the child case is what matters for timing here. I believe in the child the second about:blank blows away the initial about:blank on which alert() was invoked. In the parent, the nsDocumentViewer::PageHide seen on the hang stack in the trace is for removing a frame loader holding such a second about:blank.

The code you linked in SubDialog.sys.mjs is trying to avoid destroying the frame containing the commonDialog.xhtml doc before its unload code has had a chance to run. It shouldn't have anything to do with about:blank things, and because it runs in the parent I don't understand why its event loops would matter vs. what is going on with the content document (as they're happening in completely different processes).

From the trace, it seems that the hang is while performing that frame removal. Now that I think about it more, it's probably not the removal that needs adjustment relative to where it runs from. Rather, the JS stack frames numbered 0 and 1 in comment 4 need explanation: Why does the removal end up calling open() in SubDialog.sys.mjs (via the pagehide handler for a second about:blank in the frame being removed)?

Flags: needinfo?(hsivonen) → needinfo?(gijskruitbosch+bugs)

(In reply to Henri Sivonen (:hsivonen) from comment #10)

(In reply to :Gijs (he/him) from comment #9)

The code you linked in SubDialog.sys.mjs is trying to avoid destroying the frame containing the commonDialog.xhtml doc before its unload code has had a chance to run. It shouldn't have anything to do with about:blank things, and because it runs in the parent I don't understand why its event loops would matter vs. what is going on with the content document (as they're happening in completely different processes).

From the trace, it seems that the hang is while performing that frame removal. Now that I think about it more, it's probably not the removal that needs adjustment relative to where it runs from. Rather, the JS stack frames numbered 0 and 1 in comment 4 need explanation: Why does the removal end up calling open() in SubDialog.sys.mjs (via the pagehide handler for a second about:blank in the frame being removed)?

See bug 1833875 comment 7.

This code was originally added in bug 1130411 and has since been reused for these dialogs. It's meant to deal with the case where we are trying to ask an existing (singleton) dialog to load some other URL, to make sure we finish closing the first URL before doing that.

It looks like here, the open code is async, so we start opening the dialog, then the dialog is closed while we're still trying to open it (waiting for the load event from the dialog's contents, to be precise), and then we try to delay opening until after we finish closing. But when we finish closing, _isClosing is not set back to false, so we never reopen and keep looping through "oh, we're closing, let's reopen when we're done" instead (and it would probably not be great if we opened on a different origin anyway, after closing).

Basically the code just does not expect close to be called before open has finished opening the dialog (for some meaning of "finished" - I haven't traced exactly how far we'd need to get for things to work "properly" - I expect we need at least some listeners to have been hooked up such that the dialog closing triggers appropriate teardown stuff and the _isClosing flag gets reset, cf. these two things which currently happen quite far apart.

Flags: needinfo?(gijskruitbosch+bugs)

The early closing checks were added when SubDialog was only used in Firefox's
settings UI, and instances were reused. It doesn't look like we still reuse
subdialog instances anyway, so I don't think the code needs to cater for this
at all. Conveniently, this also addresses the hang.

It does not cause the alert to actually appear (hence the todo_is checking for
the dialog which isn't there right now), but fixing that appears to
be a problem to do with about:blank creation that will be significantly
trickier to resolve.

Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Severity: -- → S2
Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/1d7c6ca36b30 remove broken SubDialog code to ensure alerts() from freshly opened about:blank documents do not hang, r=pbz,hsivonen
Blocks: 1833875
Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
Target Milestone: --- → 124 Branch

(In reply to Daniel Holbert [:dholbert] from comment #1)

Regression-range-note: before the regression, we don't show the alert() inside of the popup at all, which is also a bug. (But it's better than hanging which is what the regression-range is highlighting.)

Daniel, I think we've "un-regressed" back to this state now. I can file a separate bug for this but perhaps it's already on file - do you know? bug 1577903 looks close but not quite the same thing (given the discussion in the bug).

Flags: needinfo?(dholbert)
Blocks: 1437247
Blocks: 1880139

(In reply to :Gijs (he/him) from comment #15)

(In reply to Daniel Holbert [:dholbert] from comment #1)

Regression-range-note: before the regression, we don't show the alert() inside of the popup at all, which is also a bug. (But it's better than hanging which is what the regression-range is highlighting.)

Daniel, I think we've "un-regressed" back to this state now. I can file a separate bug for this but perhaps it's already on file - do you know?

Thanks! I'm not aware of one, so I filed bug 1880139.

Flags: needinfo?(dholbert)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: