Closed Bug 1695863 Opened 4 years ago Closed 3 years ago

Hit MOZ_CRASH(Shutdown hanging at step profile-change-teardown. Something is blocking the main-thread.) at toolkit/components/terminator/nsTerminator.cpp:220

Categories

(Toolkit :: Startup and Profile System, defect)

defect

Tracking

()

RESOLVED FIXED
99 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- wontfix
firefox86 --- unaffected
firefox87 --- unaffected
firefox88 --- wontfix
firefox97 --- wontfix
firefox98 --- wontfix
firefox99 --- fixed

People

(Reporter: petr.sumbera, Assigned: petr.sumbera)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0

Steps to reproduce:

I can see this on Solaris when executing firefox --screenshot about:blank

Screenshot saved to: /builds/psumbera/mozilla-central-build/screenshot.png
Hit MOZ_CRASH(Shutdown hanging at step profile-change-teardown. Something is blocking the main-thread.) at /builds/psumbera/mozilla-central-build/toolkit/components/terminator/nsTerminator.cpp:220
#01: mozilla::(anonymous namespace)::RunWatchdog(void*)[/builds/psumbera/mozilla-central-build/INSTALLED/lib/firefox/libxul.so +0x1b880b69]
#02: _pt_root[/builds/psumbera/mozilla-central-build/INSTALLED/lib/firefox/libnspr4.so +0x4a298]
#03: _thrp_setup[/lib/amd64/libc.so.1 +0x271863]
#04: _lwp_start[/lib/amd64/libc.so.1 +0x271b60]

The first bad revision is:
changeset: 569009:541363348e76
user: Jens Stutte <jstutte@mozilla.com>
date: Sat Feb 27 06:38:14 2021 -0800
description:
Bug 1689953: Harmonize shutdown phase definitions across nsTerminator and AppShutdown r=dthayer,chutten

This patch wants to solve several quirks around the shutdown terminator.

 - Use the same shutdown phase definitions in AppShutdown and nsTerminator. This touches quite a few files.
 - Ensure that the terminator phase shift is handled before any shutdown observer notifications are sent and reduce its heartbeat duration.
 - Add missing phases to the shutdown telemetry.

Please note that this changes the unit of "tick" to 100ms rather than 1s.
As a side effect, we also remove the obsolete "shutdown-persist" context.

While the existing test coverage continues to prove the most important functions, we acknowledge the wish for better test coverage with [[ https://bugzilla.mozilla.org/show_bug.cgi?id=1693966 | bug 1693966 ]].

Differential Revision: https://phabricator.services.mozilla.com/D103626
Keywords: regression
Regressed by: 1689953
Has Regression Range: --- → yes

The Bugbug bot thinks this bug should belong to the 'Toolkit::Startup and Profile System' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Startup and Profile System
Product: Firefox → Toolkit

Hi Petr, thanks for reporting!

Does this reproduce always? Under linux I cannot see this happen (though I see other probably unwanted errors, unrelated to bug 1689953).

Usually shutdown hangs are intermittent, but having a 100% reproducing one could give us some insights. The patch you marked as regressor is mostly a diagnostic patch that makes reporting better around shutdown hangs, so I think it is not very likely to be the cause, but just the symptom.

Flags: needinfo?(petr.sumbera)

I think It happens always. But I have just scratched workspace and building it with HEARTBEAT_INTERVAL_MS set to 1000.

Flags: needinfo?(petr.sumbera)

I can confirm that following:

--- a/toolkit/components/terminator/nsTerminator.cpp    Mon Mar 01 16:15:30 2021 +0100
+++ b/toolkit/components/terminator/nsTerminator.cpp    Tue Mar 02 10:21:36 2021 +0100
@@ -66,7 +66,7 @@
 // forcefully.
 #define ADDITIONAL_WAIT_BEFORE_CRASH_MS 3000

-#define HEARTBEAT_INTERVAL_MS 100
+#define HEARTBEAT_INTERVAL_MS 1000

 namespace mozilla {

fixed the issue.

Note that debug build on Solaris seems to be generally pretty slow. But I have no comparison to Linux.

(In reply to Petr Sumbera from comment #4)

I can confirm that following:

--- a/toolkit/components/terminator/nsTerminator.cpp    Mon Mar 01 16:15:30 2021 +0100
+++ b/toolkit/components/terminator/nsTerminator.cpp    Tue Mar 02 10:21:36 2021 +0100
@@ -66,7 +66,7 @@
 // forcefully.
 #define ADDITIONAL_WAIT_BEFORE_CRASH_MS 3000

-#define HEARTBEAT_INTERVAL_MS 100
+#define HEARTBEAT_INTERVAL_MS 1000

 namespace mozilla {

fixed the issue.

Note that debug build on Solaris seems to be generally pretty slow. But I have no comparison to Linux.

Without changing toolkit.asyncshutdown.crash_timeout preference? Looking at how we calculate ticks, your change above should in average just add 500ms, max. 900ms. Which might be enough to solve it on your specific machine, though.

(In reply to Jens Stutte [:jstutte] from comment #5)

Without changing toolkit.asyncshutdown.crash_timeout preference?

Yes. No other change.

You might want to try what happens, if you revert your change but increase the toolkit.asyncshutdown.crash_timeout preference. I assume that solves the problem, too (which would probably mean, we should not care too much here, as this is a debug build).

Set release status flags based on info from the regressing bug 1689953

I have tried to change 60 to 600 here:
https://searchfox.org/mozilla-central/source/toolkit/components/asyncshutdown/AsyncShutdown.jsm#87

But it makes no difference. It still crashes. Any idea?

Please go to about:config and change toolkit.asyncshutdown.crash_timeout there, the pref might ovrride your default from a saved profile.

(In reply to Jens Stutte [:jstutte] from comment #10)

Please go to about:config and change toolkit.asyncshutdown.crash_timeout there, the pref might ovrride your default from a saved profile.

Well, I run firefox --screenshot about:blank at the end of my regular automated Firefox build procedure. And I do start it with new HOME (no existing profile). So there is no profile at the beginning!?

Ok, I tried to set toolkit.asyncshutdown.crash_timeout in about:config and finally it helped. But 600000 wasn't enough (as I already tested above). So after changing it to 6000000 it exits without the issue.

But still don't understand what's going here..

Severity: -- → S4

The main problem seems to be here:

https://searchfox.org/mozilla-central/rev/ad7ecfa618ec3a65db8405d9f1125059fe4a6a15/toolkit/components/terminator/nsTerminator.cpp#472

  const PRIntervalTime ticksDuration =
      PR_MillisecondsToInterval(HEARTBEAT_INTERVAL_MS);                // This returns on Solaris 10000
  options->crashAfterTicks = crashAfterMS / ticksDuration;             // 60000 / 10000 = 6!

All system do sleep between hearbeats just 0.1s

https://searchfox.org/mozilla-central/rev/ad7ecfa618ec3a65db8405d9f1125059fe4a6a15/toolkit/components/terminator/nsTerminator.cpp#191

So on Solaris it just waits for 0.6s instead of 60s.

I think it shouldn't use PR_MillisecondsToInterval() at all and instead do just:

options->crashAfterTicks = crashAfterMS / HEARTBEAT_INTERVAL_MS;

https://searchfox.org/mozilla-central/rev/ad7ecfa618ec3a65db8405d9f1125059fe4a6a15/toolkit/components/terminator/nsTerminator.cpp#475

Even with such change, old SPARC T4 system needs about 178s to shutdown (when there is no ~/.mozilla directorty). So default timeout of 60s isn't enough there. But I can easily extend it using toolkit.asyncshutdown.crash_timeout.

Assignee: nobody → petr.sumbera
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c33e9a3028d3 fix toolkit.asyncshutdown.crash_timeout conversion on some systems r=jstutte
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 99 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: