Cancel content process JS execution on shutdown
Categories
(Core :: DOM: Content Processes, task, P3)
Tracking
()
People
(Reporter: jstutte, Assigned: jstutte)
References
(Blocks 1 open bug)
Details
Attachments
(4 files)
Bug 1755376 showed that raising the priority alone will not help. We suspect in many cases that long running JS execution is preventing the main thread event loop from spinning, such that any event will starve before we timeout.
We want to cancel this.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 1•2 years ago
|
||
Updated•2 years ago
|
Assignee | ||
Comment 2•2 years ago
|
||
Depends on D150539
Assignee | ||
Comment 3•2 years ago
•
|
||
What I learned so far:
- The testcase confirms, that an endlessly running JS will prevent us from shutting down.
- We have some weirdness in our shutdown order, some necessary and some historically grown.
- Extensions start to shutdown at "quit-application-granted" but their shutdown is not bound to a single phase and can last until "profile-change-teardown". Extensions broadcast shutdown messages to all children that the extension is going away and then waits for all of them to acknowledge the message. Thus a blocking JS in a content process will make timeout extensions shutdown.
SessionStore
wants to flush all windows on "quit-application-granted". This requires an interaction with content process, too. Again a blocking JS in a content process will make timeoutSessionStore
shutdown. It seems clear that we want this to succeed instead in order to ensure we can save our session state.
- The test needs some polishing
- The arbitrary timeout should be replaced by some messaging/event logic
- The test does not succeed (but also does not fail). Not sure how easy it is to make that happen on shutdown.
What the patch does:
- Add each
ContentParent
also as blocker to phasequitApplicationGranted
- On
quitApplicationGranted
, just doNotifyImpendingShutdown
, remove that blocker (and thus wait for the next shutdown phase for the real shutdown) - The child then receives
NotifiedImpendingShutdown
and sets an atomic flag accessible viaProcessChild::ExpectingShutdown()
XPCJSContext
'sWatchdogMain
checksProcessChild::ExpectingShutdown()
on wakeup and in case checks each context if it is running JS for more than a second. If yes, it issues an interrupt.- The then issued
InterruptCallback
checks again forProcessChild::ExpectingShutdown()
and in case signals an unconditional cancel.
What could be improved:
- There is no clear reason, why extensions should shutdown earlier than the content processes. IIUC, this creates only useless noise during extensions shutdown. We might want to think about changing that order. And having that timeout logic that spans over more than one phase might even have been a way to paper over unresponsive content processes?
- There might be nicer ways of canceling than squeezing this into the
WatchdogMain
, though it seemed to be the less invasive way to do this. If we keep this, we might want to tweak the timeouts a bit: With the current values we can end up waiting up to three seconds before a blocking JS is canceled. This is an eternity on a modern computer (consider we are issuing a shutdown kill after only 5 seconds). So the watchdog could check more frequently and/or we could consider a shorter running time as a hang. - I am puzzled by this false together with the previous true. IIUC
ForAllActiveContexts
breaks its loop on false, so if the first context in the list was not running for too long, we won't check the other contextes? I'd expect if ever to be it the other way round, such that we ask for one interrupt at a time but check all contextes.
Assignee | ||
Comment 4•2 years ago
•
|
||
So https://treeherder.mozilla.org/logviewer?job_id=383593877&repo=try&lineNumber=8731-8740 suggests me two things:
- When we have a mixed stack JS/C++ like for
AsyncShutdown
, we seem to not reset the running timers at each language boundary, resulting in longer JS execution times (which is probably fine, a more interesting boundary might be if we spin the event loop in between) - I should probably find a way to exclude system calls. I tried
XPCJSContext::IsSystemCaller
but that seems to moot also my test case. Not sure if it is just a problem of the testcase, though.
Assignee | ||
Comment 5•2 years ago
•
|
||
Update: What the patch does
- Add each
ContentParent
also as blocker to phasequitApplicationGranted
- On
quitApplicationGranted
, just doNotifyImpendingShutdown
, remove that blocker (and thus wait for the next shutdown phase for the real shutdown) - The child then receives
NotifiedImpendingShutdown
, sets an atomic flag accessible viaProcessChild::ExpectingShutdown()
and calls the virtualNotifyImpendingShutdown
on its instance. ContentProcess::NotifyImpendingShutdown
looks at the prefdom.abort_script_on_child_shutdown
and in case sets an appropriate crash annotation. TODO: We want to find a way to reduce thedom.max_script_run_time
timeout here.XPCJSContext
'sWatchdogMain
will then callHangMonitorChild::InterruptCallback
, as beforeHangMonitorChild::InterruptCallback
looks at the prefdom.abort_script_on_child_shutdown
and theProcessChild::ExpectingShutdown()
flag, checks if we are running content JS and in case returns directly, signaling the JS engine to abort.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 6•2 years ago
|
||
Depends on D150598
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 7•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #3)
What could be improved:
- There is no clear reason, why extensions should shutdown earlier than the content processes. IIUC, this creates only useless noise during extensions shutdown. We might want to think about changing that order. And having that timeout logic that spans over more than one phase might even have been a way to paper over unresponsive content processes?
I filed bug 1779969 for further investigations.
- There might be nicer ways of canceling than squeezing this into the
WatchdogMain
, though it seemed to be the less invasive way to do this. If we keep this, we might want to tweak the timeouts a bit: With the current values we can end up waiting up to three seconds before a blocking JS is canceled. This is an eternity on a modern computer (consider we are issuing a shutdown kill after only 5 seconds). So the watchdog could check more frequently and/or we could consider a shorter running time as a hang.
We actually handle this better now and request an interrupt immediately, thus not relying on the timeouts of the Watchdog.
- I am puzzled by this false together with the previous true. IIUC
ForAllActiveContexts
breaks its loop on false, so if the first context in the list was not running for too long, we won't check the other contextes? I'd expect if ever to be it the other way round, such that we ask for one interrupt at a time but check all contextes.
I filed bug 1778696 for this.
Comment 9•2 years ago
|
||
Backed out for causing leakcheck failures
Backout link: https://hg.mozilla.org/integration/autoland/rev/522a729cf9c677a19622025c14028eaa805e952b
Assignee | ||
Comment 10•2 years ago
•
|
||
So I triggered a pernosco session and the failure reproduces also there. I see that apparently StaticPrefs::dom_abort_script_on_child_shutdown
is set in that run, and in that task log I see an execution of our test.
I assume the pref just remains set after test runs, which is not what we wanted. I see no possibility to define prefs inside the mochitest.ini but we have this for crashtests. So we probably want to transform that test into a crashtest.
Scratch that, not sure what I was looking at the other week, but AFAICS my test is just leaking.
Comment 11•2 years ago
|
||
Assignee | ||
Comment 12•2 years ago
|
||
For the records: If you ever want to do a complete shutdown in a test, you will probably want to use a marionette test like this, our other test harnesses are not really prepared to see this without alarm. Fortunately here we can just rely on shutting down the content process only.
Comment 13•2 years ago
|
||
Backed out 2 changesets (bug 1777198) for causing build bustage in dom/ipc/ProcessHangMonitor.cpp
Backout link: https://hg.mozilla.org/integration/autoland/rev/f788858ac268c25b4bc573d4a2642df44af22daa
ERROR - /builds/worker/checkouts/gecko/dom/ipc/ProcessHangMonitor.cpp:907:6: error: 'void {anonymous}::HangMonitorParent::RequestContentJSInterrupt()' defined but not used [-Werror=unused-function]
Assignee | ||
Comment 14•2 years ago
|
||
Ups, there was an unused function, sorry. Updating also the test for bug 1782684 and bug 1782718.
Comment 15•2 years ago
|
||
Comment 16•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Comment 17•2 years ago
|
||
Comment 18•2 years ago
|
||
bugherder |
Assignee | ||
Comment 19•2 years ago
|
||
Updated•2 years ago
|
Comment 20•2 years ago
|
||
Comment 21•2 years ago
|
||
Backed out changeset 6cebcfdd9245 (bug 1777198) for causing build bustages.
Backout link: https://hg.mozilla.org/integration/autoland/rev/c15c974895094711f51f63923ce7b39e9a26c6d2
lld-link: error: undefined symbol: enum nsresult __cdecl CrashReporter::AppendToCrashReportAnnotation(enum CrashReporter::Annotation, class nsTSubstring<char> const &)
Comment 22•2 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Comment 23•2 years ago
|
||
bugherder |
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 24•2 years ago
|
||
Just as a reminder: This bug is still open as we did not flip dom.abort_script_on_child_shutdown for release, yet.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Description
•