Harmonize shutdown phase definitions across nsTerminator and AppShutdown
Categories
(Toolkit :: Async Tooling, task, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox88 | --- | fixed |
People
(Reporter: jstutte, Assigned: jstutte)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
2.72 KB,
text/plain
|
chutten
:
data-review+
|
Details |
Since the histograms have been defined, some more phases have been explicitly added to the shutdown watchdog.
Indeed, since Dec. 7 2020 we see a huge drop of shutdown_phase_duration_ticks_profile_before_change in beta telemetry, as https://phabricator.services.mozilla.com/D99430 has been uplifted to beta.
Probably because the time is now spent in profile-before-change-qm.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 1•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 2•4 years ago
|
||
Hi Doug, I was looking at the MOZ_CRASH_REASON of some network related hangs and I noticed, that in many cases the message is "MOZ_CRASH(Shutdown hanging before starting any known phase.)" and "Shutdown hanging at step quit-application. Something is blocking the main-thread.", though the stack clearly indicates that we arrived from here, such that I would have expected to see "profile-change-net-teardown" in the message.
Andrew was suggesting, that the order observers are registered can influence the order they are called and that maybe there is hence some ordering problem that can make us notify the watchdog later than expected? If so, this probably deserves an own bug to be resolved somehow?
Comment 3•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #2)
Hi Doug, I was looking at the MOZ_CRASH_REASON of some network related hangs and I noticed, that in many cases the message is "MOZ_CRASH(Shutdown hanging before starting any known phase.)" and "Shutdown hanging at step quit-application. Something is blocking the main-thread.", though the stack clearly indicates that we arrived from here, such that I would have expected to see "profile-change-net-teardown" in the message.
Andrew was suggesting, that the order observers are registered can influence the order they are called and that maybe there is hence some ordering problem that can make us notify the watchdog later than expected? If so, this probably deserves an own bug to be resolved somehow?
Yeah I think that message is pretty misleading. All of the observers for a notification are called in the reverse order that they were added, and nsTerminator
just uses an observer to update the current phase, so it probably lands typically in the middle or toward the end of the bunch. The stack should always be enough to figure out what phase we're actually hanging in though.
This does present a problem though with incorrect allocation of time given to any one particular phase. I think the best solution here would be to marry the shutdown steps in nsTerminator
with those in ShutdownPhase
, and have all of the observer notifications just be called from a wrapper in AppShutdown
, which first notifies nsTerminator
about the start of the phase, and then calls the observer notification. Thoughts?
In any case, regarding this bug - did you see my question in bug 1690096? I'm wondering if we should just rip out this telemetry instead of landing this bug.
Assignee | ||
Comment 4•4 years ago
|
||
See bug 1690096 comment 2 where :asuth gives a solid explanation of what is going on.
And no, I'd prefer to keep this telemetry if we are able to make it reliable (at least until we understood, it is reliable) because it indicates, that also the phase reporting at shutdown hang will be reliable.
Assignee | ||
Comment 5•4 years ago
|
||
(In reply to Doug Thayer [:dthayer] (he/him) from comment #3)
The stack should always be enough to figure out what phase we're actually hanging in though.
Well, it would definitely help to have "correct" messages to do better selections on the crash data without the need to look at the stack.
I think the best solution here would be to marry the shutdown steps in
nsTerminator
with those inShutdownPhase
,
I am not sure that I understand what "marry" means here (due to a lack of context on my side, probably), from the names I see not much congruence between the currently notified phases and those linked above.
and have all of the observer notifications just be called from a wrapper in
AppShutdown
, which first notifiesnsTerminator
about the start of the phase, and then calls the observer notification. Thoughts?
That sounds similar to what I would have imagined, that is to not call nsTerminator
through an observer but explicitly before any call to NotifyObservers
during shutdown - which can be made nicer by a wrapper, of course. And AppShutdown
looks like a good place for such a wrapper, indeed.
I assume, that this can give us also exact measurements of duration for the telemetry, see bug 1690096?
Assignee | ||
Comment 6•4 years ago
•
|
||
(In reply to Jens Stutte [:jstutte] from comment #5)
I assume, that this can give us also exact measurements of duration for the telemetry, see bug 1690096?
Talking with Andrew, he reminded me that we also need to ensure that telemetry flushes its data to disk before the application quits. So having explicit calls to switch through the phases enables us to have correct error messages in case of hangs and exact measures in our telemetry that we need to ensure to write to disk in time before exit.
Updated•4 years ago
|
Assignee | ||
Comment 7•4 years ago
•
|
||
As a first step I wanted to introduce a wrapper function in AppShutdown
that ensures the terminator observers are called first. Before I continue, I'd like to know, if this is a meaningful first step.
The next step would be to ensure we have telemetry for phases that are shorter than a heartbeat (currently 1s).
Updated•4 years ago
|
Assignee | ||
Comment 8•4 years ago
|
||
:dthayer - Thanks for the pointers in the patch's comments, that were the pieces of information I was missing! I'll need some time to process them, though.
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 9•4 years ago
|
||
:dthayer - the patch is now hopefully reflecting the expected changes. However, removing the nsIObserver
aspect from nsTerminator
breaks test_terminator_record.js and test_crash_terminator.js, as the terminator cannot even be instantiated any more from JS.
Looking for other test strategies I also had a look at AppShutdown
itself, but it seems there is no dedicated coverage for AppShutdown::MaybeFastShutdown
at all, just accidental shutdown runs during other testing.
Any thoughts how we can/should harmonize also the testing? Thank you!
Assignee | ||
Comment 10•4 years ago
|
||
I re-established nsIObserver
for testing purposes.
Assignee | ||
Comment 11•4 years ago
|
||
Comment 12•4 years ago
|
||
Comment on attachment 9205645 [details]
DataReviewRequestD103626.txt
PRELIMINARY NOTE:
Please update the new and changed Histograms to have alert_emails and bug_numbers fields. The former is how we represent who is responsible for permanent collections these days, so is especially important from the POV of Data Stewardship.
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
Yes, Jens Stutte and Doug Thayer are responsible.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 1, Technical.
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Yes.
Does there need to be a check-in in the future to determine whether to renew the data?
No. This collection is permanent.
Result: datareview+
Comment 13•4 years ago
|
||
Comment 14•4 years ago
|
||
Backed out for causing failure at test_terminator_record.js.
Backout link: https://hg.mozilla.org/integration/autoland/rev/daaaadc0b7bca11a12e276e0652fe256462a527c
Failure log: https://treeherder.mozilla.org/logviewer?job_id=331433093&repo=autoland&lineNumber=5529
Assignee | ||
Comment 15•4 years ago
|
||
It seems, test_terminator_record.js was overly optimistic about the accuracy of 100ms ticks to be 100ms wallclock and for some reason under OS X this hit us.
Comment 16•4 years ago
|
||
Comment 17•4 years ago
|
||
bugherder |
Description
•