Closed Bug 1689953 Opened 4 years ago Closed 4 years ago

Harmonize shutdown phase definitions across nsTerminator and AppShutdown

Categories

(Toolkit :: Async Tooling, task, P1)

task

Tracking

()

RESOLVED FIXED
88 Branch
Tracking Status
firefox88 --- fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Since the histograms have been defined, some more phases have been explicitly added to the shutdown watchdog.

Indeed, since Dec. 7 2020 we see a huge drop of shutdown_phase_duration_ticks_profile_before_change in beta telemetry, as https://phabricator.services.mozilla.com/D99430 has been uplifted to beta.

Probably because the time is now spent in profile-before-change-qm.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Severity: -- → N/A
Priority: -- → P1

Hi Doug, I was looking at the MOZ_CRASH_REASON of some network related hangs and I noticed, that in many cases the message is "MOZ_CRASH(Shutdown hanging before starting any known phase.)" and "Shutdown hanging at step quit-application. Something is blocking the main-thread.", though the stack clearly indicates that we arrived from here, such that I would have expected to see "profile-change-net-teardown" in the message.

Andrew was suggesting, that the order observers are registered can influence the order they are called and that maybe there is hence some ordering problem that can make us notify the watchdog later than expected? If so, this probably deserves an own bug to be resolved somehow?

Flags: needinfo?(dothayer)

(In reply to Jens Stutte [:jstutte] from comment #2)

Hi Doug, I was looking at the MOZ_CRASH_REASON of some network related hangs and I noticed, that in many cases the message is "MOZ_CRASH(Shutdown hanging before starting any known phase.)" and "Shutdown hanging at step quit-application. Something is blocking the main-thread.", though the stack clearly indicates that we arrived from here, such that I would have expected to see "profile-change-net-teardown" in the message.

Andrew was suggesting, that the order observers are registered can influence the order they are called and that maybe there is hence some ordering problem that can make us notify the watchdog later than expected? If so, this probably deserves an own bug to be resolved somehow?

Yeah I think that message is pretty misleading. All of the observers for a notification are called in the reverse order that they were added, and nsTerminator just uses an observer to update the current phase, so it probably lands typically in the middle or toward the end of the bunch. The stack should always be enough to figure out what phase we're actually hanging in though.

This does present a problem though with incorrect allocation of time given to any one particular phase. I think the best solution here would be to marry the shutdown steps in nsTerminator with those in ShutdownPhase, and have all of the observer notifications just be called from a wrapper in AppShutdown, which first notifies nsTerminator about the start of the phase, and then calls the observer notification. Thoughts?

In any case, regarding this bug - did you see my question in bug 1690096? I'm wondering if we should just rip out this telemetry instead of landing this bug.

Flags: needinfo?(dothayer)

See bug 1690096 comment 2 where :asuth gives a solid explanation of what is going on.

And no, I'd prefer to keep this telemetry if we are able to make it reliable (at least until we understood, it is reliable) because it indicates, that also the phase reporting at shutdown hang will be reliable.

See Also: → 1690096

(In reply to Doug Thayer [:dthayer] (he/him) from comment #3)

The stack should always be enough to figure out what phase we're actually hanging in though.

Well, it would definitely help to have "correct" messages to do better selections on the crash data without the need to look at the stack.

I think the best solution here would be to marry the shutdown steps in nsTerminator with those in ShutdownPhase,

I am not sure that I understand what "marry" means here (due to a lack of context on my side, probably), from the names I see not much congruence between the currently notified phases and those linked above.

and have all of the observer notifications just be called from a wrapper in AppShutdown, which first notifies nsTerminator about the start of the phase, and then calls the observer notification. Thoughts?

That sounds similar to what I would have imagined, that is to not call nsTerminator through an observer but explicitly before any call to NotifyObservers during shutdown - which can be made nicer by a wrapper, of course. And AppShutdown looks like a good place for such a wrapper, indeed.

I assume, that this can give us also exact measurements of duration for the telemetry, see bug 1690096?

(In reply to Jens Stutte [:jstutte] from comment #5)

I assume, that this can give us also exact measurements of duration for the telemetry, see bug 1690096?

Talking with Andrew, he reminded me that we also need to ensure that telemetry flushes its data to disk before the application quits. So having explicit calls to switch through the phases enables us to have correct error messages in case of hangs and exact measures in our telemetry that we need to ensure to write to disk in time before exit.

Attachment #9200380 - Attachment description: Bug 1689953: Sync shutdown telemetry with all phases now defined in the watchdog r?dthayer → Bug 1689953: Ensure terminator observer is called first and sync shutdown telemetry with all phases now defined in the watchdog r?dthayer

As a first step I wanted to introduce a wrapper function in AppShutdown that ensures the terminator observers are called first. Before I continue, I'd like to know, if this is a meaningful first step.

The next step would be to ensure we have telemetry for phases that are shorter than a heartbeat (currently 1s).

Flags: needinfo?(dothayer)
Attachment #9200380 - Attachment description: Bug 1689953: Ensure terminator observer is called first and sync shutdown telemetry with all phases now defined in the watchdog r?dthayer → Bug 1689953: Reduce heartbeat interval, ensure terminator observers are called first and sync shutdown telemetry with all phases now defined in the watchdog r?dthayer

:dthayer - Thanks for the pointers in the patch's comments, that were the pieces of information I was missing! I'll need some time to process them, though.

Flags: needinfo?(dothayer)
Attachment #9200380 - Attachment description: Bug 1689953: Reduce heartbeat interval, ensure terminator observers are called first and sync shutdown telemetry with all phases now defined in the watchdog r?dthayer → Bug 1689953: Harmonize shutdown phase definitions across nsTerminator and AppShutdown r?dthayer
Summary: Sync shutdown telemetry with all phases now defined in the watchdog → Harmonize shutdown phase definitions across nsTerminator and AppShutdown

:dthayer - the patch is now hopefully reflecting the expected changes. However, removing the nsIObserver aspect from nsTerminator breaks test_terminator_record.js and test_crash_terminator.js, as the terminator cannot even be instantiated any more from JS.

Looking for other test strategies I also had a look at AppShutdown itself, but it seems there is no dedicated coverage for AppShutdown::MaybeFastShutdown at all, just accidental shutdown runs during other testing.

Any thoughts how we can/should harmonize also the testing? Thank you!

Flags: needinfo?(dothayer)
See Also: → 1678330

I re-established nsIObserver for testing purposes.

Flags: needinfo?(dothayer)
See Also: → 1693965
Blocks: 1693966
Attachment #9205645 - Flags: data-review?(chutten)

Comment on attachment 9205645 [details]
DataReviewRequestD103626.txt

PRELIMINARY NOTE:

Please update the new and changed Histograms to have alert_emails and bug_numbers fields. The former is how we represent who is responsible for permanent collections these days, so is especially important from the POV of Data Stewardship.

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Jens Stutte and Doug Thayer are responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.


Result: datareview+

Attachment #9205645 - Flags: data-review?(chutten) → data-review+
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/be43a81b35f9 Harmonize shutdown phase definitions across nsTerminator and AppShutdown r=dthayer,chutten

It seems, test_terminator_record.js was overly optimistic about the accuracy of 100ms ticks to be 100ms wallclock and for some reason under OS X this hit us.

Flags: needinfo?(jstutte)
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/541363348e76 Harmonize shutdown phase definitions across nsTerminator and AppShutdown r=dthayer,chutten
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 88 Branch
No longer regressions: 1695504
Regressions: 1695863
Regressions: 1696408
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: