1689953 - Harmonize shutdown phase definitions across nsTerminator and AppShutdown

Hi Doug, I was looking at the MOZ_CRASH_REASON of some network related hangs and I noticed, that in many cases the message is "MOZ_CRASH(Shutdown hanging before starting any known phase.)" and "Shutdown hanging at step quit-application. Something is blocking the main-thread.", though the stack clearly indicates that we arrived from here, such that I would have expected to see "profile-change-net-teardown" in the message.

Andrew was suggesting, that the order observers are registered can influence the order they are called and that maybe there is hence some ordering problem that can make us notify the watchdog later than expected? If so, this probably deserves an own bug to be resolved somehow?

Flags: needinfo?(dothayer)

Alex Thayer [:alexical] (she/her)

Comment 3

•

5 years ago

(In reply to Jens Stutte [:jstutte] from comment #2)

Hi Doug, I was looking at the MOZ_CRASH_REASON of some network related hangs and I noticed, that in many cases the message is "MOZ_CRASH(Shutdown hanging before starting any known phase.)" and "Shutdown hanging at step quit-application. Something is blocking the main-thread.", though the stack clearly indicates that we arrived from here, such that I would have expected to see "profile-change-net-teardown" in the message.

Andrew was suggesting, that the order observers are registered can influence the order they are called and that maybe there is hence some ordering problem that can make us notify the watchdog later than expected? If so, this probably deserves an own bug to be resolved somehow?

Yeah I think that message is pretty misleading. All of the observers for a notification are called in the reverse order that they were added, and nsTerminator just uses an observer to update the current phase, so it probably lands typically in the middle or toward the end of the bunch. The stack should always be enough to figure out what phase we're actually hanging in though.

This does present a problem though with incorrect allocation of time given to any one particular phase. I think the best solution here would be to marry the shutdown steps in nsTerminator with those in ShutdownPhase, and have all of the observer notifications just be called from a wrapper in AppShutdown, which first notifies nsTerminator about the start of the phase, and then calls the observer notification. Thoughts?

In any case, regarding this bug - did you see my question in bug 1690096? I'm wondering if we should just rip out this telemetry instead of landing this bug.

Flags: needinfo?(dothayer)

Jens Stutte [:jstutte]

Assignee

Comment 4

•

5 years ago

See bug 1690096 comment 2 where :asuth gives a solid explanation of what is going on.

And no, I'd prefer to keep this telemetry if we are able to make it reliable (at least until we understood, it is reliable) because it indicates, that also the phase reporting at shutdown hang will be reliable.

Comment 5

•

5 years ago

(In reply to Doug Thayer [:dthayer] (he/him) from comment #3)

The stack should always be enough to figure out what phase we're actually hanging in though.

Well, it would definitely help to have "correct" messages to do better selections on the crash data without the need to look at the stack.

I think the best solution here would be to marry the shutdown steps in nsTerminator with those in ShutdownPhase,

I am not sure that I understand what "marry" means here (due to a lack of context on my side, probably), from the names I see not much congruence between the currently notified phases and those linked above.

and have all of the observer notifications just be called from a wrapper in AppShutdown, which first notifies nsTerminator about the start of the phase, and then calls the observer notification. Thoughts?

That sounds similar to what I would have imagined, that is to not call nsTerminator through an observer but explicitly before any call to NotifyObservers during shutdown - which can be made nicer by a wrapper, of course. And AppShutdown looks like a good place for such a wrapper, indeed.

I assume, that this can give us also exact measurements of duration for the telemetry, see bug 1690096?

Jens Stutte [:jstutte]

Assignee

Comment 6

•

5 years ago

•

Edited

(In reply to Jens Stutte [:jstutte] from comment #5)

I assume, that this can give us also exact measurements of duration for the telemetry, see bug 1690096?

Talking with Andrew, he reminded me that we also need to ensure that telemetry flushes its data to disk before the application quits. So having explicit calls to switch through the phases enables us to have correct error messages in case of hangs and exact measures in our telemetry that we need to ensure to write to disk in time before exit.

Phabricator Automation

Updated

•

5 years ago

Attachment #9200380 - Attachment description: Bug 1689953: Sync shutdown telemetry with all phases now defined in the watchdog r?dthayer → Bug 1689953: Ensure terminator observer is called first and sync shutdown telemetry with all phases now defined in the watchdog r?dthayer

Jens Stutte [:jstutte]

Assignee

Comment 7

•

5 years ago

•

Edited

As a first step I wanted to introduce a wrapper function in AppShutdown that ensures the terminator observers are called first. Before I continue, I'd like to know, if this is a meaningful first step.

The next step would be to ensure we have telemetry for phases that are shorter than a heartbeat (currently 1s).

Flags: needinfo?(dothayer)

Phabricator Automation

Updated

•

5 years ago

Attachment #9200380 - Attachment description: Bug 1689953: Ensure terminator observer is called first and sync shutdown telemetry with all phases now defined in the watchdog r?dthayer → Bug 1689953: Reduce heartbeat interval, ensure terminator observers are called first and sync shutdown telemetry with all phases now defined in the watchdog r?dthayer

Jens Stutte [:jstutte]

Assignee

Comment 8

•

5 years ago

:dthayer - Thanks for the pointers in the patch's comments, that were the pieces of information I was missing! I'll need some time to process them, though.

Flags: needinfo?(dothayer)

Phabricator Automation

Updated

•

4 years ago

Attachment #9200380 - Attachment description: Bug 1689953: Reduce heartbeat interval, ensure terminator observers are called first and sync shutdown telemetry with all phases now defined in the watchdog r?dthayer → Bug 1689953: Harmonize shutdown phase definitions across nsTerminator and AppShutdown r?dthayer

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

Summary: Sync shutdown telemetry with all phases now defined in the watchdog → Harmonize shutdown phase definitions across nsTerminator and AppShutdown

Jens Stutte [:jstutte]

Assignee

Comment 9

•

4 years ago

:dthayer - the patch is now hopefully reflecting the expected changes. However, removing the nsIObserver aspect from nsTerminator breaks test_terminator_record.js and test_crash_terminator.js, as the terminator cannot even be instantiated any more from JS.

Looking for other test strategies I also had a look at AppShutdown itself, but it seems there is no dedicated coverage for AppShutdown::MaybeFastShutdown at all, just accidental shutdown runs during other testing.

Any thoughts how we can/should harmonize also the testing? Thank you!

Flags: needinfo?(dothayer)

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

Comment 10

•

4 years ago

I re-established nsIObserver for testing purposes.

Flags: needinfo?(dothayer)

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

Updated

•

4 years ago

Blocks: 1693966

Jens Stutte [:jstutte]

Assignee

Comment 11

•

4 years ago

Attached file DataReviewRequestD103626.txt — Details

Attachment #9205645 - Flags: data-review?(chutten)

Chris H-C :chutten

Comment 12

•

4 years ago

Comment on attachment 9205645 [details]
DataReviewRequestD103626.txt

PRELIMINARY NOTE:

Please update the new and changed Histograms to have alert_emails and bug_numbers fields. The former is how we represent who is responsible for permanent collections these days, so is especially important from the POV of Data Stewardship.

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Jens Stutte and Doug Thayer are responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.

Result: datareview+

Attachment #9205645 - Flags: data-review?(chutten) → data-review+

Pulsebot

Comment 13

•

4 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/be43a81b35f9 Harmonize shutdown phase definitions across nsTerminator and AppShutdown r=dthayer,chutten

Atila Butkovits

Comment 14

•

4 years ago

Backed out for causing failure at test_terminator_record.js.

Backout link: https://hg.mozilla.org/integration/autoland/rev/daaaadc0b7bca11a12e276e0652fe256462a527c

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=adanSQ0uSp2EiUTI7SFSUg.0&searchStr=os%2Cx%2C10.14%2Cwebrender%2Copt%2Cxpcshell%2Ctests%2Ctest-macosx1014-64-qr%2Fopt-xpcshell-e10s%2Cx2&revision=be43a81b35f9b237cb537ca8c740ec90d9e0f90e

Failure log: https://treeherder.mozilla.org/logviewer?job_id=331433093&repo=autoland&lineNumber=5529

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 15

•

4 years ago

It seems, test_terminator_record.js was overly optimistic about the accuracy of 100ms ticks to be 100ms wallclock and for some reason under OS X this hit us.

Flags: needinfo?(jstutte)

Pulsebot

Comment 16

•

4 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/541363348e76 Harmonize shutdown phase definitions across nsTerminator and AppShutdown r=dthayer,chutten

Alexandru Michis [:malexandru]

Comment 17

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/541363348e76

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox88: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 88 Branch

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

4 years ago

Regressions: 1695447

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

4 years ago

Regressions: 1695504

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

No longer regressions: 1695504

Petr Sumbera

Updated

•

4 years ago

Regressions: 1695863

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

Regressions: 1696408

Bug 1689953: Harmonize shutdown phase definitions across nsTerminator and AppShutdown r?dthayer 5 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
DataReviewRequestD103626.txt 4 years ago Jens Stutte [:jstutte] 2.72 KB, text/plain	chutten : data-review+	Details