Investigate impact of reducing nsTerminator's crash timeout
Categories
(Core :: XPCOM, enhancement)
Tracking
()
Tracking | Status | |
---|---|---|
firefox77 | --- | fixed |
People
(Reporter: alexical, Assigned: emmamalysz)
References
(Blocks 1 open bug)
Details
(Whiteboard: [fxperf:p2])
Attachments
(4 files, 3 obsolete files)
Currently, when shutting down, we configure a watchdog thread to crash the browser after 60s if it hasn't shut down in that time (this is controlled via the toolkit.asyncshutdown.crash_timeout
pref). I suspect the spirit of this watchdog is that if we've taken a full minute to shutdown, it's very likely that something is hung in an unexpected way from which we may never recover.
I think this is a good intuition, but it can be broken down further to hopefully crash the browser earlier, without crashing the browser more frequently. I.e., while maybe it's reasonable for a shutdown to take 30s overall (this is a bit of a stretch), it shouldn't be reasonable for the profile-before-change
notification during shutdown to take 30s. So if we take more than, say, 5s to process the profile-before-change
notification, we crash.
A possible breakdown might be something like:
quit-application-granted
- profile-change-teardown
: 20s timeout
profile-change-teardown
- xpcom-will-shutdown
: 20s timeout
xpcom-will-shutdown
- xpcom-shutdown-threads
: 20s timeout
This would leave the overall timeout at 60s, but we would only in extremely rare circumstances actually take a full 60s to identify a hung shutdown, making it less likely we will, say, severely impact a user's OS shutdown time.
Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #0)
profile-change-teardown
-xpcom-will-shutdown
: 20s timeout
Here is a normal shutdown profile on the 2018 reference hardware, where this part of shutdown took 17s: https://perfht.ml/2OH9HUc
I think we could reset the 20s timeout whenever a new shutdown notification is received. Ie. profile-change-teardown would start a 20s timeout, profile-before-change would restart it, profile-before-change-qm and profile-before-change-telemetry too.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 2•5 years ago
|
||
Updated•5 years ago
|
Assignee | ||
Comment 3•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 4•5 years ago
|
||
Comment 5•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 6•5 years ago
|
||
Comment 7•5 years ago
|
||
Comment 9•5 years ago
•
|
||
Backed out changeset d388b5f84158 (Bug 1610134) for causing bustages in LateWriteChecks.cpp CLOSED TREE
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&revision=d388b5f8415869fee5096752161d3fafeeed921c&selectedJob=298539467
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=298539483&repo=autoland&lineNumber=16776
Backout: https://hg.mozilla.org/integration/autoland/rev/be09f2507b41535ada22544ba205cd135a6fc678
Comment 10•5 years ago
|
||
Comment 11•5 years ago
|
||
bugherder |
Assignee | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 12•5 years ago
•
|
||
On 05/04/2020, there were 602 main crashes recorded (https://sql.telemetry.mozilla.org/queries/70583). That same day we recorded ~560 late writes from nsTerminator. Total moz crashes that day for shutdown hangs was 76. I think we should try moving this preference up to 40s to avoid a massive increase in crash reports that would accompany Bug 1631812.
Assignee | ||
Comment 13•5 years ago
|
||
Increasing toolkit.asyncshutdown.report_writes_after to 40s to advance this investigation.
Updated•5 years ago
|
Comment 14•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Comment 15•5 years ago
|
||
bugherder |
Assignee | ||
Comment 16•5 years ago
|
||
After investigating the potential to reduce the nsTerminator's crash timeout from
1 min, to 20s, and then finally 40s, we have decided to this does not provide
significant gains to justify increasing the amount of shutdown hang crashes
and potential to lose data. We should maintain the crash timeout at 1 min.
Assignee | ||
Comment 17•5 years ago
|
||
We have decided that the best course of action is to maintain the crash timeout at 60s.
When the pref was at 20s, 2,910 late write reports were reported on nightly as originating from nsTerminator ("is_from_terminator_watchdog":1
). We moved the pref up to 40s because this would have led to a very high increase in shutdown hang crash reports. At 40s, a week reported 1,207 late write reports in nightly (https://sql.telemetry.mozilla.org/queries/70889). Looking at that same week, there were only 530 shutdown hangs from moz crash reports.
So there's somewhat of a dead end. We could decrease the preference and accept a large increase in the amount of shutdown hangs in exchange for slightly faster shutdown, but that means we would be crashing over twice as much and potentially losing important data that may not be available yet. Ultimately, it does not seem to provide enough of a win to warrant changing it.
Assignee | ||
Comment 18•5 years ago
|
||
Updated•5 years ago
|
Comment 19•5 years ago
|
||
Comment 20•5 years ago
|
||
bugherder |
Assignee | ||
Updated•5 years ago
|
Description
•