Parent process perma-hang, when an about:blank popup tries to alert()
Categories
(Core :: DOM: Navigation, defect)
Tracking
()
People
(Reporter: dholbert, Assigned: Gijs)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: hang, regression)
Attachments
(3 files)
STR:
- Load attached testcase.
- Click the button.
ACTUAL RESULTS:
Firefox parent-process hangs with 100% CPU. If I kill with kill -11
and inspect the crash report, we seem to be hanging in pagehide event-handlers.
EXPECTED RESULTS:
No hang. The popup should probably show the alert() that the testcase tries to spawn.
Regression range:
INFO: Last good revision: d0676cb0864b870062fed21bc900d6fbb3cf5670 (2022-02-16)
INFO: First bad revision: 2b187d932ca125670a883eb9973f69f75ecd1b69 (2022-02-17)
INFO: Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d0676cb0864b870062fed21bc900d6fbb3cf5670&tochange=2b187d932ca125670a883eb9973f69f75ecd1b69
In that range, I suspect this would've been from bug 1736570 since that was about the creation of about:blank
documents (which is what we've just created in the popup, at the moment that we hang here).
hsivonen, maybe you could take a look here when you're back from PTO?
For reference, here are two crash reports that I generated while hanging (with kill -11, to see what the backtrace looks like):
bp-f65daf33-0ef1-44ff-a2c8-6b4500240104 (Nightly 99 from the first-bad-build, just after the regression)
bp-d5355841-70e9-49f7-9d6c-900ae0240104 (Current Nightly 123)
In both crashes, we seem to be spinning our wheels while running script in response to a page-hide notification.
Reporter | ||
Comment 1•1 year ago
|
||
Regression-range-note: before the regression, we don't show the alert() inside of the popup at all, which is also a bug. (But it's better than hanging which is what the regression-range is highlighting.)
Reporter | ||
Updated•1 year ago
|
Reporter | ||
Comment 2•1 year ago
•
|
||
If I put the alert behind a sufficiently-long setTimeout delay (5s in this example), there's no hang and the alert does show up as-expected.
Comment 3•1 year ago
|
||
Set release status flags based on info from the regressing bug 1736570
Updated•1 year ago
|
Reporter | ||
Comment 4•1 year ago
•
|
||
Pernosco trace in a debug build from today:
https://pernos.co/debug/9TMrUAjtpLL8X68WE2QWZQ/index.html#f{m[Addj,AA_,t[Jw,Ag7v_,f{e[AddZ,B43T4w_,s{af6vtYAAA,bCv0,uHXcd9A,oHXtYOw___/
That link is seeked to the moment where I did kill -11
to kill the hanging-with-100%-CPU parent process. The backtrace shows we're executing JS inside of nsDocumentViewer::PageHide
, like in the crash reports that I generated above. If it's useful, call DumpJSStack()
in a GDB pane (at this moment in the trace) shows this as the JS code:
0 open(aURL = ""chrome://global/content/commonDialog.xhtml"", (destructured parameter) = [unavailable], aParams = [unavailable]) ["resource://gre/modules/SubDialog.sys.mjs":145:1]
1 open/<([unavailable]) ["resource://gre/modules/SubDialog.sys.mjs":192:19]
2 close/<() ["resource://gre/modules/SubDialog.sys.mjs":322:21]
Comment 5•1 year ago
|
||
Thoughts so far:
What the regressor was meant to do was to avoid a DocumentChannel bounce for the parser-created about:blank that in Gecko (pending bug 543435) is loaded "immediately" on top of the initial about:blank. That is, the point was the make the second about:blank more immediate in terms of the number of event loop spins than what was happening with the DocumentChannel-caused event loop spins.
Therefore, in the content process, we should expect alert
to run in the context of the initial about:blank and the regressor causing the initial about:blank in content to be navigated away from sooner than with DocumentChannel
(but in alignment with pre-DocumentChannel
behavior).
The nsDocumentViewer::PageHide
from which the problem occurs seems to be the PageHide
for the second about:blank in a parent-process docshell that's being torn down in response to close()
in SubDialog.sys.mjs
. SubDialog.sys.mjs
tries to defer this operation via an event loop spin, but there's also a microtask checkpoint involved. I don't see a nested event loop on the stack, but I don't know how our current alert()
is supposed to work these days exactly relative to it having blocking semantics in content JS.
The answer here probably is to fortify the deferred method call mechanism for overlay removal to better make sure that the runnable runs cleanly from the top event loop of the parent and doesn't end up in some scriptrunner queue.
I don't have a concrete idea of what exactly to do. Perhaps Gijs has?
Assignee | ||
Comment 6•1 year ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #5)
Thoughts so far:
What the regressor was meant to do was to avoid a DocumentChannel bounce for the parser-created about:blank that in Gecko (pending bug 543435) is loaded "immediately" on top of the initial about:blank. That is, the point was the make the second about:blank more immediate in terms of the number of event loop spins than what was happening with the DocumentChannel-caused event loop spins.
Therefore, in the content process, we should expect
alert
to run in the context of the initial about:blank and the regressor causing the initial about:blank in content to be navigated away from sooner than withDocumentChannel
(but in alignment with pre-DocumentChannel
behavior).
Sorry, I don't really understand this sentence.
The
nsDocumentViewer::PageHide
from which the problem occurs seems to be thePageHide
for the second about:blank in a parent-process docshell that's being torn down in response toclose()
inSubDialog.sys.mjs
.SubDialog.sys.mjs
tries to defer this operation via an event loop spin, but there's also a microtask checkpoint involved. I don't see a nested event loop on the stack, but I don't know how our currentalert()
is supposed to work these days exactly relative to it having blocking semantics in content JS.
I believe there will be a nested event loop in the child, but there shouldn't be a nested event loop in the parent, I think.
A nested loop in JS code should have shown up in the stack in comment #4, I think (but doesn't, so I assume there isn't one).
The answer here probably is to fortify the deferred method call mechanism for overlay removal to better make sure that the runnable runs cleanly from the top event loop of the parent and doesn't end up in some scriptrunner queue.
Chrome JS code has no insight into whether or not there is a "scriptrunner queue" or a nested event loop. So I don't really understand what you envisage here?
I'm also not sure what "the runnable" is referring to here.
I don't have a concrete idea of what exactly to do. Perhaps Gijs has?
I don't, but also I'm missing what the "right" thing is that we're aiming for. Obviously the hang is not right, but the solution we want probably depends on what we're aiming to achieve.
AIUI, right now, the following happens:
- we load the initial about:blank in the new browsing context
- we open an alert with
alert()
coming from web content - we navigate the new BC to a second
about:blank
which attempts to close thealert()
- we busy-hang (doing what, exactly?)
If someone has more details on (4) that would also be helpful. (I'm traveling today so my ability to repro and debug is limited.)
What do the specs / other browsers say/do? Is the alert meant to show in the first about:blank and block the subsequent load? Or in the second one? Or opened and then closed? I could see arguments for all of these...
Is it possible this is the same root issue as bug 1833875?
Comment 7•1 year ago
|
||
Set release status flags based on info from the regressing bug 1736570
Comment 8•1 year ago
|
||
(In reply to :Gijs (he/him) from comment #6)
Therefore, in the content process, we should expect
alert
to run in the context of the initial about:blank and the regressor causing the initial about:blank in content to be navigated away from sooner than withDocumentChannel
(but in alignment with pre-DocumentChannel
behavior).Sorry, I don't really understand this sentence.
The regressor changed the number of event loop spins in the child between window.open
returning and the second about:blank materializing (to fewer event loop spins, like it was before DocumentChannel
was introduced).
Chrome JS code has no insight into whether or not there is a "scriptrunner queue" or a nested event loop. So I don't really understand what you envisage here?
I'm hoping Services.tm.dispatchToMainThread
would somehow ensure that what's dispatched gets run from the event loop directly and not from some weird other place as seen in the trace.
I'm also not sure what "the runnable" is referring to here.
I mean the thing that Services.tm.dispatchToMainThread
dispatches to run the JS passed to Services.tm.dispatchToMainThread
.
What do the specs / other browsers say/do?
We shouldn't have the second about:blank at all, but that's bug 543435.
Is it possible this is the same root issue as bug 1833875?
Quite possible.
Assignee | ||
Comment 9•1 year ago
•
|
||
I'm still pretty lost here. Is the "second about:blank" load you're talking about happening in the content window (ie the window.open
from the testcase), or in the parent process frame that we use for the innards of alert()
?
The code you linked in SubDialog.sys.mjs is trying to avoid destroying the frame containing the commonDialog.xhtml doc before its unload code has had a chance to run. It shouldn't have anything to do with about:blank things, and because it runs in the parent I don't understand why its event loops would matter vs. what is going on with the content document (as they're happening in completely different processes).
(edit: and I'm assuming that your references to Services.tm.dispatchToMainThread
are still referring to that same SubDialog code that already uses that method... but even that I'm not quite sure about)
Comment 10•1 year ago
|
||
(In reply to :Gijs (he/him) from comment #9)
I'm still pretty lost here. Is the "second about:blank" load you're talking about happening in the content window (ie the
window.open
from the testcase), or in the parent process frame that we use for the innards ofalert()
?
It happens in both, but I suspect the child case is what matters for timing here. I believe in the child the second about:blank blows away the initial about:blank on which alert()
was invoked. In the parent, the nsDocumentViewer::PageHide
seen on the hang stack in the trace is for removing a frame loader holding such a second about:blank.
The code you linked in SubDialog.sys.mjs is trying to avoid destroying the frame containing the commonDialog.xhtml doc before its unload code has had a chance to run. It shouldn't have anything to do with about:blank things, and because it runs in the parent I don't understand why its event loops would matter vs. what is going on with the content document (as they're happening in completely different processes).
From the trace, it seems that the hang is while performing that frame removal. Now that I think about it more, it's probably not the removal that needs adjustment relative to where it runs from. Rather, the JS stack frames numbered 0 and 1 in comment 4 need explanation: Why does the removal end up calling open()
in SubDialog.sys.mjs
(via the pagehide handler for a second about:blank
in the frame being removed)?
Assignee | ||
Comment 11•1 year ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #10)
(In reply to :Gijs (he/him) from comment #9)
The code you linked in SubDialog.sys.mjs is trying to avoid destroying the frame containing the commonDialog.xhtml doc before its unload code has had a chance to run. It shouldn't have anything to do with about:blank things, and because it runs in the parent I don't understand why its event loops would matter vs. what is going on with the content document (as they're happening in completely different processes).
From the trace, it seems that the hang is while performing that frame removal. Now that I think about it more, it's probably not the removal that needs adjustment relative to where it runs from. Rather, the JS stack frames numbered 0 and 1 in comment 4 need explanation: Why does the removal end up calling
open()
inSubDialog.sys.mjs
(via the pagehide handler for a secondabout:blank
in the frame being removed)?
This code was originally added in bug 1130411 and has since been reused for these dialogs. It's meant to deal with the case where we are trying to ask an existing (singleton) dialog to load some other URL, to make sure we finish closing the first URL before doing that.
It looks like here, the open
code is async, so we start opening the dialog, then the dialog is closed while we're still trying to open it (waiting for the load
event from the dialog's contents, to be precise), and then we try to delay opening until after we finish closing. But when we finish closing, _isClosing
is not set back to false, so we never reopen and keep looping through "oh, we're closing, let's reopen when we're done" instead (and it would probably not be great if we opened on a different origin anyway, after closing).
Basically the code just does not expect close
to be called before open
has finished opening the dialog (for some meaning of "finished" - I haven't traced exactly how far we'd need to get for things to work "properly" - I expect we need at least some listeners to have been hooked up such that the dialog closing triggers appropriate teardown stuff and the _isClosing
flag gets reset, cf. these two things which currently happen quite far apart.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 12•1 year ago
|
||
The early closing checks were added when SubDialog was only used in Firefox's
settings UI, and instances were reused. It doesn't look like we still reuse
subdialog instances anyway, so I don't think the code needs to cater for this
at all. Conveniently, this also addresses the hang.
It does not cause the alert to actually appear (hence the todo_is checking for
the dialog which isn't there right now), but fixing that appears to
be a problem to do with about:blank creation that will be significantly
trickier to resolve.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•11 months ago
|
Comment 13•11 months ago
|
||
Comment 14•11 months ago
|
||
bugherder |
Assignee | ||
Comment 15•11 months ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #1)
Regression-range-note: before the regression, we don't show the alert() inside of the popup at all, which is also a bug. (But it's better than hanging which is what the regression-range is highlighting.)
Daniel, I think we've "un-regressed" back to this state now. I can file a separate bug for this but perhaps it's already on file - do you know? bug 1577903 looks close but not quite the same thing (given the discussion in the bug).
Updated•11 months ago
|
Reporter | ||
Comment 16•11 months ago
|
||
(In reply to :Gijs (he/him) from comment #15)
(In reply to Daniel Holbert [:dholbert] from comment #1)
Regression-range-note: before the regression, we don't show the alert() inside of the popup at all, which is also a bug. (But it's better than hanging which is what the regression-range is highlighting.)
Daniel, I think we've "un-regressed" back to this state now. I can file a separate bug for this but perhaps it's already on file - do you know?
Thanks! I'm not aware of one, so I filed bug 1880139.
Description
•