Closed Bug 1610134 Opened 6 months ago Closed 24 days ago

Investigate impact of reducing nsTerminator's crash timeout

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX
mozilla77
Tracking Status
firefox77 --- fixed

People

(Reporter: dthayer, Assigned: emalysz)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fxperf:p2])

Attachments

(4 files, 3 obsolete files)

Currently, when shutting down, we configure a watchdog thread to crash the browser after 60s if it hasn't shut down in that time (this is controlled via the toolkit.asyncshutdown.crash_timeout pref). I suspect the spirit of this watchdog is that if we've taken a full minute to shutdown, it's very likely that something is hung in an unexpected way from which we may never recover.

I think this is a good intuition, but it can be broken down further to hopefully crash the browser earlier, without crashing the browser more frequently. I.e., while maybe it's reasonable for a shutdown to take 30s overall (this is a bit of a stretch), it shouldn't be reasonable for the profile-before-change notification during shutdown to take 30s. So if we take more than, say, 5s to process the profile-before-change notification, we crash.

A possible breakdown might be something like:

quit-application-granted - profile-change-teardown: 20s timeout
profile-change-teardown - xpcom-will-shutdown: 20s timeout
xpcom-will-shutdown - xpcom-shutdown-threads: 20s timeout

This would leave the overall timeout at 60s, but we would only in extremely rare circumstances actually take a full 60s to identify a hung shutdown, making it less likely we will, say, severely impact a user's OS shutdown time.

Whiteboard: [fxperf]

(In reply to Doug Thayer [:dthayer] from comment #0)

profile-change-teardown - xpcom-will-shutdown: 20s timeout

Here is a normal shutdown profile on the 2018 reference hardware, where this part of shutdown took 17s: https://perfht.ml/2OH9HUc

I think we could reset the 20s timeout whenever a new shutdown notification is received. Ie. profile-change-teardown would start a 20s timeout, profile-before-change would restart it, profile-before-change-qm and profile-before-change-telemetry too.

Whiteboard: [fxperf] → [fxperf:p2]
Assignee: nobody → emalysz
Attachment #9135142 - Attachment description: Bug 1610134: fixes timeout for nsTerminator to ensure we crash the browser earlier if necessary. → Bug 1610134: Part 1: add timeout pref that turns on late write checking to see if it's possible to crash browser earlier.
Attached file Data Review Form.txt (obsolete) —
Attachment #9140624 - Flags: data-review?(chutten)
Attachment #9140624 - Attachment is obsolete: true
Attachment #9140624 - Flags: data-review?(chutten)
Attached file Data Review Form (obsolete) —
Attachment #9140627 - Flags: data-review?(chutten)
Comment on attachment 9140627 [details]
Data Review Form

The form asks to collect the data for a week but the Scalar does not have an expiry version set. Which is the intended behaviour?
Flags: needinfo?(emalysz)
Attachment #9140627 - Flags: data-review?(chutten)
Attachment #9140627 - Attachment is obsolete: true
Flags: needinfo?(emalysz)
Attached file Data Review Form
Attachment #9141044 - Flags: data-review?(chutten)
Comment on attachment 9141044 [details]
Data Review Form

DATA COLLECTION REVIEW RESPONSE:

    Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Telemetry. Part of it is documented in its definitions file [Scalars.yaml](https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/telemetry/Scalars.yaml) and the [Probe Dictionary](https://telemetry.mozilla.org/probe-dictionary/). The other part is documented in the [in-tree documentation](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/data/main-ping.html).

    Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

    If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Emma Malysz is responsible.

    Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

    Is the data collection request for default-on or default-off?

Default on for all channels.

    Does the instrumentation include the addition of any new identifiers?

No.

    Is the data collection covered by the existing Firefox privacy notice?

Yes.

    Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.

---
Result: datareview+
Attachment #9141044 - Flags: data-review?(chutten) → data-review+
Pushed by nerli@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d388b5f84158
Part 1: add timeout pref that turns on late write checking to see if it's possible to crash browser earlier. r=dthayer,chutten
Pushed by cbrindusan@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9ccf9f680b64
Part 1: add timeout pref that turns on late write checking to see if it's possible to crash browser earlier. r=dthayer,chutten
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77
Flags: needinfo?(emalysz)
Summary: Break up the nsTerminator's watchdog into phases, each with their own alotted time → Investigate impact of reducing nsTerminator's crash timeout
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

On 05/04/2020, there were 602 main crashes recorded (https://sql.telemetry.mozilla.org/queries/70583). That same day we recorded ~560 late writes from nsTerminator. Total moz crashes that day for shutdown hangs was 76. I think we should try moving this preference up to 40s to avoid a massive increase in crash reports that would accompany Bug 1631812.

Increasing toolkit.asyncshutdown.report_writes_after to 40s to advance this investigation.

Attachment #9146351 - Attachment description: Bug 1610134: Increase timeout pref that turns on late write checking → Bug 1610134: Part 2: Increase timeout pref that turns on late write checking
Pushed by ncsoregi@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d8da20519c3c
Part 2: Increase timeout pref that turns on late write checking r=dthayer
Keywords: leave-open

After investigating the potential to reduce the nsTerminator's crash timeout from
1 min, to 20s, and then finally 40s, we have decided to this does not provide
significant gains to justify increasing the amount of shutdown hang crashes
and potential to lose data. We should maintain the crash timeout at 1 min.

We have decided that the best course of action is to maintain the crash timeout at 60s.

When the pref was at 20s, 2,910 late write reports were reported on nightly as originating from nsTerminator ("is_from_terminator_watchdog":1). We moved the pref up to 40s because this would have led to a very high increase in shutdown hang crash reports. At 40s, a week reported 1,207 late write reports in nightly (https://sql.telemetry.mozilla.org/queries/70889). Looking at that same week, there were only 530 shutdown hangs from moz crash reports.

So there's somewhat of a dead end. We could decrease the preference and accept a large increase in the amount of shutdown hangs in exchange for slightly faster shutdown, but that means we would be crashing over twice as much and potentially losing important data that may not be available yet. Ultimately, it does not seem to provide enough of a win to warrant changing it.

Attachment #9154314 - Attachment is obsolete: true
Pushed by ncsoregi@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/88216a6cb87b
revert late writes from nsTerminator. r=dthayer
Status: REOPENED → RESOLVED
Closed: 2 months ago24 days ago
Keywords: leave-open
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.