Closed Bug 1777198 Opened 2 years ago Closed 2 years ago

Cancel content process JS execution on shutdown

Tracking

()

Status:

RESOLVED FIXED

Milestone:

106 Branch

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Bug 1777198 - Have a long running content JS shutdown hang test. r?smaug 2 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1777198 - Cancel content JS execution on quit-application-granted or on normal content process shutdown. r?smaug 2 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1777198 - Enable dom.abort_script_on_child_shutdown in nightly. r?smaug 2 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1777198 - Improve IPCShutdownState annotation. r?gsvelto 2 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review

Jens Stutte [:jstutte]

Assignee

Description

•

2 years ago

Bug 1755376 showed that raising the priority alone will not help. We suspect in many cases that long running JS execution is preventing the main thread event loop from spinning, such that any event will starve before we timeout.

We want to cancel this.

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P3

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Comment 1

•

2 years ago

Attached file Bug 1777198 - Have a long running content JS shutdown hang test. r?smaug — Details

Phabricator Automation

Updated

•

2 years ago

Attachment #9283353 - Attachment description: WIP: Bug 1777198 - Have a long running content JS hang test. → WIP: Bug 1777198 - Have a long running content JS shutdown hang test.

Jens Stutte [:jstutte]

Assignee

Comment 2

•

2 years ago

Attached file Bug 1777198 - Cancel content JS execution on quit-application-granted or on normal content process shutdown. r?smaug — Details

Depends on D150539

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Comment 3

•

2 years ago

•

Edited

What I learned so far:

The testcase confirms, that an endlessly running JS will prevent us from shutting down.
We have some weirdness in our shutdown order, some necessary and some historically grown.
- Extensions start to shutdown at "quit-application-granted" but their shutdown is not bound to a single phase and can last until "profile-change-teardown". Extensions broadcast shutdown messages to all children that the extension is going away and then waits for all of them to acknowledge the message. Thus a blocking JS in a content process will make timeout extensions shutdown.
- SessionStore wants to flush all windows on "quit-application-granted". This requires an interaction with content process, too. Again a blocking JS in a content process will make timeout SessionStore shutdown. It seems clear that we want this to succeed instead in order to ensure we can save our session state.
The test needs some polishing
- The arbitrary timeout should be replaced by some messaging/event logic
- The test does not succeed (but also does not fail). Not sure how easy it is to make that happen on shutdown.

What the patch does:

Add each ContentParent also as blocker to phase quitApplicationGranted
On quitApplicationGranted, just do NotifyImpendingShutdown, remove that blocker (and thus wait for the next shutdown phase for the real shutdown)
The child then receives NotifiedImpendingShutdown and sets an atomic flag accessible via ProcessChild::ExpectingShutdown()
XPCJSContext's WatchdogMain checks ProcessChild::ExpectingShutdown() on wakeup and in case checks each context if it is running JS for more than a second. If yes, it issues an interrupt.
The then issued InterruptCallback checks again for ProcessChild::ExpectingShutdown() and in case signals an unconditional cancel.

What could be improved:

There is no clear reason, why extensions should shutdown earlier than the content processes. IIUC, this creates only useless noise during extensions shutdown. We might want to think about changing that order. And having that timeout logic that spans over more than one phase might even have been a way to paper over unresponsive content processes?
There might be nicer ways of canceling than squeezing this into the WatchdogMain, though it seemed to be the less invasive way to do this. If we keep this, we might want to tweak the timeouts a bit: With the current values we can end up waiting up to three seconds before a blocking JS is canceled. This is an eternity on a modern computer (consider we are issuing a shutdown kill after only 5 seconds). So the watchdog could check more frequently and/or we could consider a shorter running time as a hang.
I am puzzled by this false together with the previous true. IIUC ForAllActiveContexts breaks its loop on false, so if the first context in the list was not running for too long, we won't check the other contextes? I'd expect if ever to be it the other way round, such that we ask for one interrupt at a time but check all contextes.

Jens Stutte [:jstutte]

Assignee

Comment 4

•

2 years ago

•

Edited

So https://treeherder.mozilla.org/logviewer?job_id=383593877&repo=try&lineNumber=8731-8740 suggests me two things:

When we have a mixed stack JS/C++ like for AsyncShutdown, we seem to not reset the running timers at each language boundary, resulting in longer JS execution times (which is probably fine, a more interesting boundary might be if we spin the event loop in between)
I should probably find a way to exclude system calls. I tried XPCJSContext::IsSystemCaller but that seems to moot also my test case. Not sure if it is just a problem of the testcase, though.

Jens Stutte [:jstutte]

Assignee

Comment 5

•

2 years ago

•

Edited

Update: What the patch does

Add each ContentParent also as blocker to phase quitApplicationGranted
On quitApplicationGranted, just do NotifyImpendingShutdown, remove that blocker (and thus wait for the next shutdown phase for the real shutdown)
The child then receives NotifiedImpendingShutdown, sets an atomic flag accessible via ProcessChild::ExpectingShutdown() and calls the virtual NotifyImpendingShutdown on its instance.
ContentProcess::NotifyImpendingShutdown looks at the pref dom.abort_script_on_child_shutdown and in case sets an appropriate crash annotation. TODO: We want to find a way to reduce the dom.max_script_run_time timeout here.
XPCJSContext's WatchdogMain will then call HangMonitorChild::InterruptCallback, as before
HangMonitorChild::InterruptCallback looks at the pref dom.abort_script_on_child_shutdown and the ProcessChild::ExpectingShutdown() flag, checks if we are running content JS and in case returns directly, signaling the JS engine to abort.

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → jstutte

Attachment #9283353 - Attachment description: WIP: Bug 1777198 - Have a long running content JS shutdown hang test. → Bug 1777198 - Have a long running content JS shutdown hang test. r?smaug

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

2 years ago

Attachment #9283435 - Attachment description: WIP: Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown. → Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown. r?smaug,nika

Phabricator Automation

Updated

•

2 years ago

Attachment #9283435 - Attachment description: Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown. r?smaug,nika → WIP: Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown.

Jens Stutte [:jstutte]

Assignee

Comment 6

•

2 years ago

Attached file Bug 1777198 - Enable dom.abort_script_on_child_shutdown in nightly. r?smaug — Details

Depends on D150598

Phabricator Automation

Updated

•

2 years ago

Attachment #9283435 - Attachment description: WIP: Bug 1777198 - Cancel JS execution on NotifiedImpendingShutdown. → WIP: Bug 1777198 - Cancel content JS execution on quit-application-granted.

Phabricator Automation

Updated

•

2 years ago

Attachment #9283435 - Attachment description: WIP: Bug 1777198 - Cancel content JS execution on quit-application-granted. → WIP: Bug 1777198 - Cancel content JS execution on quit-application-granted or normal content process shutdown.

Phabricator Automation

Updated

•

2 years ago

Attachment #9283435 - Attachment description: WIP: Bug 1777198 - Cancel content JS execution on quit-application-granted or normal content process shutdown. → Bug 1777198 - Cancel content JS execution on quit-application-granted or normal content process shutdown. r?smaug

Phabricator Automation

Updated

•

2 years ago

Attachment #9283435 - Attachment description: Bug 1777198 - Cancel content JS execution on quit-application-granted or normal content process shutdown. r?smaug → Bug 1777198 - Cancel content JS execution on quit-application-granted or potential content process shutdown. r?smaug

Phabricator Automation

Updated

•

2 years ago

Attachment #9283435 - Attachment description: Bug 1777198 - Cancel content JS execution on quit-application-granted or potential content process shutdown. r?smaug → Bug 1777198 - Cancel content JS execution on quit-application-granted or on normal content process shutdown. r?smaug

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Keywords: leave-open

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Comment 7

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #3)

What could be improved:

There is no clear reason, why extensions should shutdown earlier than the content processes. IIUC, this creates only useless noise during extensions shutdown. We might want to think about changing that order. And having that timeout logic that spans over more than one phase might even have been a way to paper over unresponsive content processes?

I filed bug 1779969 for further investigations.

There might be nicer ways of canceling than squeezing this into the WatchdogMain, though it seemed to be the less invasive way to do this. If we keep this, we might want to tweak the timeouts a bit: With the current values we can end up waiting up to three seconds before a blocking JS is canceled. This is an eternity on a modern computer (consider we are issuing a shutdown kill after only 5 seconds). So the watchdog could check more frequently and/or we could consider a shorter running time as a hang.

We actually handle this better now and request an interrupt immediately, thus not relying on the timeouts of the Watchdog.

I am puzzled by this false together with the previous true. IIUC ForAllActiveContexts breaks its loop on false, so if the first context in the list was not running for too long, we won't check the other contextes? I'd expect if ever to be it the other way round, such that we ask for one interrupt at a time but check all contextes.

I filed bug 1778696 for this.

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Comment 8

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/05afbf32acea Have a long running content JS shutdown hang test. r=smaug https://hg.mozilla.org/integration/autoland/rev/1bf944a0828d Cancel content JS execution on quit-application-granted or on normal content process shutdown. r=smaug

Sandor Molnar[:smolnar]

Comment 9

•

2 years ago

Backed out for causing leakcheck failures

Backout link: https://hg.mozilla.org/integration/autoland/rev/522a729cf9c677a19622025c14028eaa805e952b

Push with failures

Failure log

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 10

•

2 years ago

•

Edited

So I triggered a pernosco session and the failure reproduces also there. I see that apparently StaticPrefs::dom_abort_script_on_child_shutdown is set in that run, and in that task log I see an execution of our test.

I assume the pref just remains set after test runs, which is not what we wanted. I see no possibility to define prefs inside the mochitest.ini but we have this for crashtests. So we probably want to transform that test into a crashtest.

Scratch that, not sure what I was looking at the other week, but AFAICS my test is just leaking.

Flags: needinfo?(jstutte)

Pulsebot

Comment 11

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0b9cb5b44360 Have a long running content JS shutdown hang test. r=smaug https://hg.mozilla.org/integration/autoland/rev/472fe2d7af01 Cancel content JS execution on quit-application-granted or on normal content process shutdown. r=smaug

Jens Stutte [:jstutte]

Assignee

Comment 12

•

2 years ago

For the records: If you ever want to do a complete shutdown in a test, you will probably want to use a marionette test like this, our other test harnesses are not really prepared to see this without alarm. Fortunately here we can just rely on shutting down the content process only.

Sandor Molnar[:smolnar]

Comment 13

•

2 years ago

Backed out 2 changesets (bug 1777198) for causing build bustage in dom/ipc/ProcessHangMonitor.cpp

Backout link: https://hg.mozilla.org/integration/autoland/rev/f788858ac268c25b4bc573d4a2642df44af22daa

Push with failures

Failure log

 ERROR -  /builds/worker/checkouts/gecko/dom/ipc/ProcessHangMonitor.cpp:907:6: error: 'void {anonymous}::HangMonitorParent::RequestContentJSInterrupt()' defined but not used [-Werror=unused-function]

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 14

•

2 years ago

Ups, there was an unused function, sorry. Updating also the test for bug 1782684 and bug 1782718.

Flags: needinfo?(jstutte)

Pulsebot

Comment 15

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4fbf0aea0b4a Have a long running content JS shutdown hang test. r=smaug https://hg.mozilla.org/integration/autoland/rev/359b2b2e2755 Cancel content JS execution on quit-application-granted or on normal content process shutdown. r=smaug

Cosmin Sabou [:CosminS]

Updated

•

2 years ago

Regressions: 1782718

Marian-Vasile Laza

Comment 16

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4fbf0aea0b4a
https://hg.mozilla.org/mozilla-central/rev/359b2b2e2755

Phabricator Automation

Updated

•

2 years ago

Attachment #9285479 - Attachment description: WIP: Bug 1777198 - Enable dom.abort_script_on_child_shutdown in nightly. → Bug 1777198 - Enable dom.abort_script_on_child_shutdown in nightly. r?smaug

Pulsebot

Comment 17

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/39ece1717b90 Enable dom.abort_script_on_child_shutdown in nightly. r=smaug

Sandor Molnar[:smolnar]

Comment 18

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/39ece1717b90

Jens Stutte [:jstutte]

Assignee

Comment 19

•

2 years ago

Attached file Bug 1777198 - Improve IPCShutdownState annotation. r?gsvelto — Details

Phabricator Automation

Updated

•

2 years ago

Attachment #9291968 - Attachment description: Bug 1777198 - Improve IPCShutdownState annotation. r?smaug → Bug 1777198 - Improve IPCShutdownState annotation. r?gsvelto

Pulsebot

Comment 20

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6cebcfdd9245 Improve IPCShutdownState annotation. r=gsvelto

Sandor Molnar[:smolnar]

Comment 21

•

2 years ago

Backed out changeset 6cebcfdd9245 (bug 1777198) for causing build bustages.

Backout link: https://hg.mozilla.org/integration/autoland/rev/c15c974895094711f51f63923ce7b39e9a26c6d2

Push with failures

Failure log

lld-link: error: undefined symbol: enum nsresult __cdecl CrashReporter::AppendToCrashReportAnnotation(enum CrashReporter::Annotation, class nsTSubstring<char> const &)

Flags: needinfo?(jstutte)

Pulsebot

Comment 22

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/546cb58a2459 Improve IPCShutdownState annotation. r=gsvelto

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Flags: needinfo?(jstutte)

Sandor Molnar[:smolnar]

Comment 23

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/546cb58a2459

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Blocks: 1789231

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

No longer blocks: IPCError_ShutDownKill

Jens Stutte [:jstutte]

Assignee

Comment 24

•

2 years ago

Just as a reminder: This bug is still open as we did not flip dom.abort_script_on_child_shutdown for release, yet.

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Blocks: 1813602

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

Resolution: --- → FIXED

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Target Milestone: --- → 106 Branch

BugBot (nomail) [:suhaib / :marco/ :calixte]

Updated

•

2 years ago

Keywords: leave-open

You need to log in before you can comment on or make changes to this bug.