Closed Bug 1777198 Opened 2 years ago Closed 1 year ago

Cancel content process JS execution on shutdown

Categories

(Core :: DOM: Content Processes, task, P3)

task

Tracking

()

RESOLVED FIXED
106 Branch

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Bug 1755376 showed that raising the priority alone will not help. We suspect in many cases that long running JS execution is preventing the main thread event loop from spinning, such that any event will starve before we timeout.

We want to cancel this.

Severity: -- → S3
Priority: -- → P3
See Also: → 1755376
Attachment #9283353 - Attachment description: WIP: Bug 1777198 - Have a long running content JS hang test. → WIP: Bug 1777198 - Have a long running content JS shutdown hang test.
See Also: → 1778438

What I learned so far:

  • The testcase confirms, that an endlessly running JS will prevent us from shutting down.
  • We have some weirdness in our shutdown order, some necessary and some historically grown.
  • The test needs some polishing
    • The arbitrary timeout should be replaced by some messaging/event logic
    • The test does not succeed (but also does not fail). Not sure how easy it is to make that happen on shutdown.

What the patch does:

  • Add each ContentParent also as blocker to phase quitApplicationGranted
  • On quitApplicationGranted, just do NotifyImpendingShutdown, remove that blocker (and thus wait for the next shutdown phase for the real shutdown)
  • The child then receives NotifiedImpendingShutdown and sets an atomic flag accessible via ProcessChild::ExpectingShutdown()
  • XPCJSContext's WatchdogMain checks ProcessChild::ExpectingShutdown() on wakeup and in case checks each context if it is running JS for more than a second. If yes, it issues an interrupt.
  • The then issued InterruptCallback checks again for ProcessChild::ExpectingShutdown() and in case signals an unconditional cancel.

What could be improved:

  • There is no clear reason, why extensions should shutdown earlier than the content processes. IIUC, this creates only useless noise during extensions shutdown. We might want to think about changing that order. And having that timeout logic that spans over more than one phase might even have been a way to paper over unresponsive content processes?
  • There might be nicer ways of canceling than squeezing this into the WatchdogMain, though it seemed to be the less invasive way to do this. If we keep this, we might want to tweak the timeouts a bit: With the current values we can end up waiting up to three seconds before a blocking JS is canceled. This is an eternity on a modern computer (consider we are issuing a shutdown kill after only 5 seconds). So the watchdog could check more frequently and/or we could consider a shorter running time as a hang.
  • I am puzzled by this false together with the previous true. IIUC ForAllActiveContexts breaks its loop on false, so if the first context in the list was not running for too long, we won't check the other contextes? I'd expect if ever to be it the other way round, such that we ask for one interrupt at a time but check all contextes.

So https://treeherder.mozilla.org/logviewer?job_id=383593877&repo=try&lineNumber=8731-8740 suggests me two things:

  • When we have a mixed stack JS/C++ like for AsyncShutdown, we seem to not reset the running timers at each language boundary, resulting in longer JS execution times (which is probably fine, a more interesting boundary might be if we spin the event loop in between)
  • I should probably find a way to exclude system calls. I tried XPCJSContext::IsSystemCaller but that seems to moot also my test case. Not sure if it is just a problem of the testcase, though.

Update: What the patch does

  • Add each ContentParent also as blocker to phase quitApplicationGranted
  • On quitApplicationGranted, just do NotifyImpendingShutdown, remove that blocker (and thus wait for the next shutdown phase for the real shutdown)
  • The child then receives NotifiedImpendingShutdown, sets an atomic flag accessible via ProcessChild::ExpectingShutdown() and calls the virtual NotifyImpendingShutdown on its instance.
  • ContentProcess::NotifyImpendingShutdown looks at the pref dom.abort_script_on_child_shutdown and in case sets an appropriate crash annotation. TODO: We want to find a way to reduce the dom.max_script_run_time timeout here.
  • XPCJSContext's WatchdogMain will then call HangMonitorChild::InterruptCallback, as before
  • HangMonitorChild::InterruptCallback looks at the pref dom.abort_script_on_child_shutdown and the ProcessChild::ExpectingShutdown() flag, checks if we are running content JS and in case returns directly, signaling the JS engine to abort.
Assignee: nobody → jstutte
Attachment #9283353 - Attachment description: WIP: Bug 1777198 - Have a long running content JS shutdown hang test. → Bug 1777198 - Have a long running content JS shutdown hang test. r?smaug
Status: NEW → ASSIGNED
Attachment #9283435 - Attachment description: WIP: Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown. → Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown. r?smaug,nika
Attachment #9283435 - Attachment description: Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown. r?smaug,nika → WIP: Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown.
Attachment #9283435 - Attachment description: WIP: Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown. → WIP: Bug 1777198 - Cancel content JS execution on quit-application-granted.
Attachment #9283435 - Attachment description: WIP: Bug 1777198 - Cancel content JS execution on quit-application-granted. → WIP: Bug 1777198 - Cancel content JS execution on quit-application-granted or normal content process shutdown.
Attachment #9283435 - Attachment description: WIP: Bug 1777198 - Cancel content JS execution on quit-application-granted or normal content process shutdown. → Bug 1777198 - Cancel content JS execution on quit-application-granted or normal content process shutdown. r?smaug
Attachment #9283435 - Attachment description: Bug 1777198 - Cancel content JS execution on quit-application-granted or normal content process shutdown. r?smaug → Bug 1777198 - Cancel content JS execution on quit-application-granted or potential content process shutdown. r?smaug
Attachment #9283435 - Attachment description: Bug 1777198 - Cancel content JS execution on quit-application-granted or potential content process shutdown. r?smaug → Bug 1777198 - Cancel content JS execution on quit-application-granted or on normal content process shutdown. r?smaug
Keywords: leave-open
See Also: → 1779969

(In reply to Jens Stutte [:jstutte] from comment #3)

What could be improved:

  • There is no clear reason, why extensions should shutdown earlier than the content processes. IIUC, this creates only useless noise during extensions shutdown. We might want to think about changing that order. And having that timeout logic that spans over more than one phase might even have been a way to paper over unresponsive content processes?

I filed bug 1779969 for further investigations.

  • There might be nicer ways of canceling than squeezing this into the WatchdogMain, though it seemed to be the less invasive way to do this. If we keep this, we might want to tweak the timeouts a bit: With the current values we can end up waiting up to three seconds before a blocking JS is canceled. This is an eternity on a modern computer (consider we are issuing a shutdown kill after only 5 seconds). So the watchdog could check more frequently and/or we could consider a shorter running time as a hang.

We actually handle this better now and request an interrupt immediately, thus not relying on the timeouts of the Watchdog.

  • I am puzzled by this false together with the previous true. IIUC ForAllActiveContexts breaks its loop on false, so if the first context in the list was not running for too long, we won't check the other contextes? I'd expect if ever to be it the other way round, such that we ask for one interrupt at a time but check all contextes.

I filed bug 1778696 for this.

See Also: → 1778696
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/05afbf32acea
Have a long running content JS shutdown hang test. r=smaug
https://hg.mozilla.org/integration/autoland/rev/1bf944a0828d
Cancel content JS execution on quit-application-granted or on normal content process shutdown. r=smaug
Flags: needinfo?(jstutte)

So I triggered a pernosco session and the failure reproduces also there. I see that apparently StaticPrefs::dom_abort_script_on_child_shutdown is set in that run, and in that task log I see an execution of our test.

I assume the pref just remains set after test runs, which is not what we wanted. I see no possibility to define prefs inside the mochitest.ini but we have this for crashtests. So we probably want to transform that test into a crashtest.

Scratch that, not sure what I was looking at the other week, but AFAICS my test is just leaking.

Flags: needinfo?(jstutte)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0b9cb5b44360
Have a long running content JS shutdown hang test. r=smaug
https://hg.mozilla.org/integration/autoland/rev/472fe2d7af01
Cancel content JS execution on quit-application-granted or on normal content process shutdown. r=smaug

For the records: If you ever want to do a complete shutdown in a test, you will probably want to use a marionette test like this, our other test harnesses are not really prepared to see this without alarm. Fortunately here we can just rely on shutting down the content process only.

Backed out 2 changesets (bug 1777198) for causing build bustage in dom/ipc/ProcessHangMonitor.cpp

Backout link: https://hg.mozilla.org/integration/autoland/rev/f788858ac268c25b4bc573d4a2642df44af22daa

Push with failures

Failure log

 ERROR -  /builds/worker/checkouts/gecko/dom/ipc/ProcessHangMonitor.cpp:907:6: error: 'void {anonymous}::HangMonitorParent::RequestContentJSInterrupt()' defined but not used [-Werror=unused-function]
Flags: needinfo?(jstutte)

Ups, there was an unused function, sorry. Updating also the test for bug 1782684 and bug 1782718.

Flags: needinfo?(jstutte)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4fbf0aea0b4a
Have a long running content JS shutdown hang test. r=smaug
https://hg.mozilla.org/integration/autoland/rev/359b2b2e2755
Cancel content JS execution on quit-application-granted or on normal content process shutdown. r=smaug
Regressions: 1782718
Attachment #9285479 - Attachment description: WIP: Bug 1777198 - Enable dom.abort_script_on_child_shutdown in nightly. → Bug 1777198 - Enable dom.abort_script_on_child_shutdown in nightly. r?smaug
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/39ece1717b90
Enable dom.abort_script_on_child_shutdown in nightly. r=smaug
Attachment #9291968 - Attachment description: Bug 1777198 - Improve IPCShutdownState annotation. r?smaug → Bug 1777198 - Improve IPCShutdownState annotation. r?gsvelto
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6cebcfdd9245
Improve IPCShutdownState annotation. r=gsvelto

Backed out changeset 6cebcfdd9245 (bug 1777198) for causing build bustages.

Backout link: https://hg.mozilla.org/integration/autoland/rev/c15c974895094711f51f63923ce7b39e9a26c6d2

Push with failures

Failure log

lld-link: error: undefined symbol: enum nsresult __cdecl CrashReporter::AppendToCrashReportAnnotation(enum CrashReporter::Annotation, class nsTSubstring<char> const &)
Flags: needinfo?(jstutte)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/546cb58a2459
Improve IPCShutdownState annotation. r=gsvelto
Flags: needinfo?(jstutte)
Blocks: 1789231
No longer blocks: IPCError_ShutDownKill

Just as a reminder: This bug is still open as we did not flip dom.abort_script_on_child_shutdown for release, yet.

Blocks: 1813602
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 106 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: