Force Firefox crash if shutdown hangs

RESOLVED FIXED in mozilla34

Status

()

defect
RESOLVED FIXED
5 years ago
a year ago

People

(Reporter: vladan, Assigned: Yoric)

Tracking

(Depends on 2 bugs, {perf})

unspecified
mozilla34
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 7 obsolete attachments)

Reporter

Description

5 years ago
Firefox sometimes deadlocks on shutdown, preventing a restart and causing users frustration. We should have a timer with a high threshold (99th percentile of shutdown time) to kill Firefox with exit(0) if shutdown hasn't made progress in a long time. 

If the main thread isn't hung, we can also try to force cleanup (e.g. saving Places) before making the exit(0) call.
Note to self: need to check how reliable our Telemetry data on shutdown is. If it is reliable, we can probably use something along the lines of "Longest successful shutdown ever" * 1.05.

Updated

5 years ago
OS: Windows 7 → All
Hardware: x86_64 → All
Actually, note to self: if we don't want to break AsyncShutdown, we need to wait for "Longest successful shutdown ever" + 1 minute + \epsilon.
Still brainstorming.

1. What runs the Terminator? A XPCOM/Chromium/PR thread or a process?
- XPCOM thread is good for communications, but blocks shutdown, so that's probably a bad idea;
- I have no experience with Chromium threads, but I seem to understand that they have their own event loop, which looks a tad overkill – no idea whether they block shutdown;
- PR thread might be sufficient, if we are careful with communications;
- getting a process right on all platforms sounds nightmarish.

So I'm tempted to go with PRThread.


2. When do we spawn?
I'd go for "quit-application".


3. When do we kill?
a. Start a timer when we reach quit-application. If we haven't reached profile-before-change after N seconds, terminate Firefox.
b. Restart the timer. If we haven't reached xpcom-will-shutdown after N seconds, terminate Firefox.
c. Restart the timer. If we haven't reached xpcom-shutdown after N seconds, terminate Firefox.
d. Restart the timer. After N seconds, terminate Firefox. Hopefully, by then, the process has been killed. 


4. How many seconds?

We should use AsyncShutdown-style timers to make sure that we we don't kill Firefox simply because the lid of the laptop has been closed.

Unfortunately, I don't think that there is a good way to measure the duration of profile-before-change => xpcom-will-shutdown or xpcom-will-shutdown => xpcom-shutdown or xpcom-shutdown => exit. Well, there may be ways, but they strongly sound like overengineering (i.e. write durations to the Registry Base under Windows, I'm not sure where on other platforms).

Also, if we make Terminator faster than AsyncShutdown, we will lose all the precious crash data. I there believe that we should make Terminator adopt a "AsyncShutdown + 5 seconds" policy.

5. Crash or exit(0)?
exit(0) sounds like the right thing for release, but we may decide to go for crashing in Nightly, Aurora, early Beta, so as to get statistics.
As a side-note, for 5., we have flag EARLY_BETA_OR_EARLIER.

Comment 5

5 years ago
Should be an unjoinable PRThread, I think.

I suggest that we start the timer at http://hg.mozilla.org/mozilla-central/annotate/a74600665875/toolkit/xre/nsAppRunner.cpp#l4014 right after the main event loop exits. This will cover all of profile shutdown and subsequent steps.

I'm not sure what AsyncShutdown-style timers means in this context, but yes, we should try to make sure that machine sleep doesn't affect it.

Whatever the actual shutdown mechanism is, it must not run c++ static destructors, so exit() is out. I think that _exit() also may be out on Windows, but I'm not sure about other platforms. I think we should start out with MOZ_CRASH and later we can decide to switch it to something more silent like TerminateProcess.
By AsyncShutdown-style timer, I mean "wait for at least one second, then once we are awake, wait again for at least one second, etc." N times.
C++ 11 offers a `quick_exit` function that does not run object destructors. It is not specified whether temporary files are removed and streams are flushed. I *think* that's ok.
Compiler support may be lacking for that, unfortunately. In particular, no version of Visual Studio (up to and including 2013) supports quick_exit [1].

[1] http://msdn.microsoft.com/en-us/library/hh567368.aspx#concurrencytable
Thanks Emanuel.

I'll try and produce a first version that runs `_exit(0)` and see if we can iterate on this. MSDN wasn't clear on whether this runs static destructors, I'll experiment.

Also, I have early sketches on how we could trigger Places shutdown, etc, but this looks like lots of work, plus the simplest way I have in mind goes through bug 918317. I believe that we can get a first iteration working without this. With a large enough delay, I don't see how this could create datalosses.
Attaching a first minimal prototype.
If this minimal feature set is sufficient, I intend to work on attempts to rescue shutdown (i.e. force Places to be closed as cleanly as possible) as a followup bug.

This implementation uses `_exit(0)` for the time being, until I have time to test its behavior under Windows and/or try and come up with something better.
Attachment #8462024 - Flags: feedback?(vdjeric)
Attachment #8462024 - Flags: feedback?(nfroyd)
Attachment #8462024 - Flags: feedback?(benjamin)

Comment 11

5 years ago
Random suggestion from a someone just happy to finally see this get implemented: append a timestamp to each of the ShutdownProgress crash report annotations
Dave: what's the usecase for this timestamp?

Comment 13

5 years ago
(In reply to David Rajchenbach Teller [:Yoric] from comment #12)
> Dave: what's the usecase for this timestamp?

It would give a clear timeline for the shutdown hang/crash in a crash report. Without it, there's no way to tell how long of a hang the report is for. Someone could have a toolkit.asyncshutdown.crash_timeout set too low and be submitting reports for no practical reason. Timestamping each phase would also allow basic telemetry to see if some reports are slow for everything or just hang on one phase.
This sounds like a good plan. Filed as bug 1044020.
Comment on attachment 8462024 [details] [diff] [review]
Implementation of nsTerminator

Review of attachment 8462024 [details] [diff] [review]:
-----------------------------------------------------------------

Just some small comments.  I was going to grouse about yet another watchdog thread, but I guess this one is only active during shutdown, so it's not too bad.

::: toolkit/components/terminator/moz.build
@@ +15,5 @@
> +EXTRA_COMPONENTS += [
> +    'terminator.manifest',
> +]
> +
> +FINAL_LIBRARY = 'toolkitcomps'

I think this is now supposed to be 'xul' everywhere or somesuch.  Check with glandium.

::: toolkit/components/terminator/nsTerminator.cpp
@@ +7,5 @@
> +/**
> + * A watchdog designed to terminate shutdown if it lasts too long.
> + *
> + * This watchdog is designed as a worst-case problem container for the
> + * common case in which Firefox just won't shutdown. 

Nit: trailing whitespace here and elsewhere.

@@ +10,5 @@
> + * This watchdog is designed as a worst-case problem container for the
> + * common case in which Firefox just won't shutdown. 
> + *
> + * We spawn a thread during quit-application. If any of the shutdown
> + * steps takes more than n millisecons (63000 by default), kill the

Nit: milliseconds.

@@ +29,5 @@
> +#include "mozilla/Preferences.h"
> +#include "mozilla/Scoped.h"
> +#include "mozilla/Services.h"
> +#include "mozilla/ArrayUtils.h"
> +#include "mozilla/DebugOnly.h"

Nit: Sort these?

@@ +31,5 @@
> +#include "mozilla/Services.h"
> +#include "mozilla/ArrayUtils.h"
> +#include "mozilla/DebugOnly.h"
> +
> +#include "nsTerminator.h"

You are cargo-culting includes from this file into your .h file.  Maybe #include nsTerminator.h first?

@@ +61,5 @@
> +  ScopedDeletePtr<Options> options((Options*)arg);
> +  int32_t crashAfterMS = options->crashAfterMS;
> +  options.dispose();
> +
> +  int timeToLive = crashAfterMS;

We should probably just make this int32_t, to match crashAfterMS.

@@ +105,5 @@
> +  "quit-application"
> +, "profile-change-teardown"
> +, "profile-before-change"
> +, "xpcom-will-shutdown"
> +, "xpcom-shutdown"

Nit: this formatting is unusual.  Better might be to just include the trailing comma always:

"quit-application",
...
"xpcom-shutdown",

which preserves the one-line-of-diff-per-addition property.

@@ +111,5 @@
> +
> +NS_IMPL_ISUPPORTS(
> +  nsTerminator
> +, nsIObserver
> +)

Just one line for this, please.

@@ +156,5 @@
> +    &crashAfterMS);
> +  if (NS_SUCCEEDED(rv)) {
> +    // Add a little padding, to ensure that we do not crash before
> +    // AsyncShutdown.
> +    crashAfterMS *= 1.05;

Probably should make sure this doesn't overflow.  (Conversion from doubles that are too large is undefined, I believe.)

@@ +159,5 @@
> +    // AsyncShutdown.
> +    crashAfterMS *= 1.05;
> +  } else {
> +    crashAfterMS = DEFAULT_CRASH_AFTER_MS;
> +  }

Should check that crashAfterMS is positive, too.

@@ +204,5 @@
> +
> +#if defined(MOZ_CRASHREPORTER)
> +  // In case of crash, we wish to know where in shutdown we are
> +  nsCOMPtr<nsICrashReporter> reporter =
> +    do_GetService("@mozilla.org/toolkit/crash-reporter;1");

NS_CRASHREPORTER_CONTRACTID?

::: toolkit/crashreporter/test/unit/xpcshell.ini
@@ +31,5 @@
>  skip-if = os=='linux' && bits==32
>  
>  [test_crash_AsyncShutdown.js]
>  [test_event_files.js]
> +[test_crash_terminator.js]

Missing hg add?
Attachment #8462024 - Flags: feedback?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #15)
> @@ +111,5 @@
> > +
> > +NS_IMPL_ISUPPORTS(
> > +  nsTerminator
> > +, nsIObserver
> > +)
> 
> Just one line for this, please.

Side-note: apparently, this is the style, these days.

> Missing hg add?

Oops. Let me add this as a differential patch.
Posted patch 2. Forgotten test file (obsolete) — Splinter Review
Attachment #8462784 - Flags: feedback?(vdjeric)
Attachment #8462784 - Flags: feedback?(nfroyd)
Attachment #8462784 - Flags: feedback?(benjamin)
(In reply to David Rajchenbach Teller [:Yoric] (PTO + various activities, hard to reach until August 20th - please use "needinfo") from comment #16)
> (In reply to Nathan Froyd (:froydnj) from comment #15)
> > @@ +111,5 @@
> > > +
> > > +NS_IMPL_ISUPPORTS(
> > > +  nsTerminator
> > > +, nsIObserver
> > > +)
> > 
> > Just one line for this, please.
> 
> Side-note: apparently, this is the style, these days.

I think you have been spending too much time in places/ and storage/ code.  Those are the only modules that use that style, according to grep.
Comment on attachment 8462784 [details] [diff] [review]
2. Forgotten test file

Review of attachment 8462784 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/crashreporter/test/unit/test_crash_terminator.js
@@ +18,5 @@
> +    createInstance(Components.interfaces.nsIObserver);
> +  terminator.observe(null, "profile-after-change", null);
> +
> +  // Inform the terminator that shutdown has started
> +  // Pick an arbitray notification

Nit: 'arbitrary'

@@ +19,5 @@
> +  terminator.observe(null, "profile-after-change", null);
> +
> +  // Inform the terminator that shutdown has started
> +  // Pick an arbitray notification
> +  terminator.observe(null, "xpcom-will-shutdown", null);

Maybe pull this notification name out into:

const NOTIFICATION = ...;

or something and use it below, too?

@@ +22,5 @@
> +  // Pick an arbitray notification
> +  terminator.observe(null, "xpcom-will-shutdown", null);
> +
> +  while(true) {
> +    dump("Waiting until crash\n");

Maybe pull this out of the loop, so you're not spamming the log if we happen to process many events?
Attachment #8462784 - Flags: feedback?(nfroyd) → feedback+
Reporter

Comment 20

5 years ago
Comment on attachment 8462024 [details] [diff] [review]
Implementation of nsTerminator

Review of attachment 8462024 [details] [diff] [review]:
-----------------------------------------------------------------

- Will we be able to measure how often (% of shutdowns) this Terminator triggers using Breakpad? I suspect not.. so can you file another bug to Telemetry reporting similar to failedProfileLockCount?

::: toolkit/components/terminator/nsTerminator.cpp
@@ +53,5 @@
> +};
> +
> +void run(void* arg)
> +{
> +  PR_SetCurrentThreadName("Terminator");

Let's give it a more descriptive title.. like "Shutdown hang terminator"

@@ +81,5 @@
> +      timeToLive = crashAfterMS;
> +      continue;
> +    }
> +    timeToLive -= TICK_DURATION;
> +    if (timeToLive > 0) {

Nit: >=

@@ +89,5 @@
> +    // Shutdown is apparently dead. Kill the process.
> +    MOZ_CRASH("Shutdown too long, probably frozen, causing a crash.");
> +
> +    // Oh, we are in release mode. Well, let's not crash then.
> +    NS_WARNING("Shutdown too long, probably frozen, terminating the process.");

MOZ_CRASH doesn't cause a crash on Release?

@@ +185,5 @@
> +
> +NS_IMETHODIMP
> +nsTerminator::Observe(nsISupports *, const char *aTopic, const char16_t *)
> +{
> +  if (strcmp(aTopic, "profile-after-change") == 0) {

What do you think about spawning the terminator thread on first sign of shutdown?

::: toolkit/components/terminator/terminator.manifest
@@ +1,1 @@
> +category profile-after-change nsTerminator @mozilla.org/toolkit/shutdown-terminator;1

How much startup overhead does this create?
Attachment #8462024 - Flags: feedback?(vdjeric) → feedback+
Reporter

Comment 21

5 years ago
Comment on attachment 8462784 [details] [diff] [review]
2. Forgotten test file

Review of attachment 8462784 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/crashreporter/test/unit/test_crash_terminator.js
@@ +8,5 @@
> +function setup_crash() {
> +  Components.utils.import("resource://gre/modules/Services.jsm");
> +
> +  Services.prefs.setBoolPref("toolkit.terminator.testing", true);
> +  Services.prefs.setIntPref("toolkit.asyncshutdown.crash_timeout", 10);

Why not have a shorter timeout so the test runs quicker?
Attachment #8462784 - Flags: feedback?(vdjeric) → feedback+
(In reply to Vladan Djeric (:vladan) from comment #21)
> Why not have a shorter timeout so the test runs quicker?

Shorter than 10 milliseconds? Sure, but that's not going to change much.

(In reply to Vladan Djeric (:vladan) from comment #20)
> - Will we be able to measure how often (% of shutdowns) this Terminator
> triggers using Breakpad? I suspect not.. so can you file another bug to
> Telemetry reporting similar to failedProfileLockCount?

For each version, we should be able to compute this as
 # of Terminator Crashes with Telemetry enabled / # of runs with Telemetry enabled

I'm not sure whether the CrashReporter knows whether Telemetry is enabled, but that's pretty easy to add.

> @@ +89,5 @@
> > +    // Shutdown is apparently dead. Kill the process.
> > +    MOZ_CRASH("Shutdown too long, probably frozen, causing a crash.");
> > +
> > +    // Oh, we are in release mode. Well, let's not crash then.
> > +    NS_WARNING("Shutdown too long, probably frozen, terminating the process.");
> 
> MOZ_CRASH doesn't cause a crash on Release?

My bad, I was confusing the behavior of MOZ_CRASH and that of MOZ_ASSERT.

> 
> @@ +185,5 @@
> > +
> > +NS_IMETHODIMP
> > +nsTerminator::Observe(nsISupports *, const char *aTopic, const char16_t *)
> > +{
> > +  if (strcmp(aTopic, "profile-after-change") == 0) {
> 
> What do you think about spawning the terminator thread on first sign of
> shutdown?

Isn't that what I'm doing?

> 
> ::: toolkit/components/terminator/terminator.manifest
> @@ +1,1 @@
> > +category profile-after-change nsTerminator @mozilla.org/toolkit/shutdown-terminator;1
> 
> How much startup overhead does this create?

Good question. I don't think we have any measurement on that topic.
Reporter

Comment 23

5 years ago
> Shorter than 10 milliseconds? Sure, but that's not going to change much.

My mistake, I thought the toolkit.asyncshutdown.crash_timeout pref was in units of seconds.

> For each version, we should be able to compute this as
>  # of Terminator Crashes with Telemetry enabled / # of runs with Telemetry
> enabled
> 
> I'm not sure whether the CrashReporter knows whether Telemetry is enabled,
> but that's pretty easy to add.

That works but then we can't do Telemetry Alerting for it. How hard is it to get the # of Terminator Crashes with Telemetry enabled?

> Isn't that what I'm doing?

My mistake, I thought SelfInit creates the thread.

> Good question. I don't think we have any measurement on that topic.

ts_paint can help answer that question (at least partially). Roberto also has the ts_cold_paint test which apparently isn't noisy on SSD systems
(In reply to Vladan Djeric (:vladan) from comment #23)
> > For each version, we should be able to compute this as
> >  # of Terminator Crashes with Telemetry enabled / # of runs with Telemetry
> > enabled
> > 
> > I'm not sure whether the CrashReporter knows whether Telemetry is enabled,
> > but that's pretty easy to add.
> 
> That works but then we can't do Telemetry Alerting for it. How hard is it to
> get the # of Terminator Crashes with Telemetry enabled?

Since the scope of the Terminator covers cases that occur after Telemetry (and Profile) shutdown, this won't work without an awful hack, such as writing the data in the registry base, which is a bit annoying.

> > Good question. I don't think we have any measurement on that topic.
> 
> ts_paint can help answer that question (at least partially). Roberto also
> has the ts_cold_paint test which apparently isn't noisy on SSD systems

Note that we are using this mechanism in a few places over the tree, plus most add-ons:
http://dxr.mozilla.org/mozilla-central/search?q=profile-after-change+ext%3Amanifest&case=true&redirect=true
Reporter

Comment 25

5 years ago
> Since the scope of the Terminator covers cases that occur after Telemetry
> (and Profile) shutdown, this won't work without an awful hack, such as
> writing the data in the registry base, which is a bit annoying.

I was suggesting doing something similar to what failedProfileLockCount does: the Terminator could write a file in the profile directory during abnormal exit, and then Telemetry could pick it up during the following session.

> Note that we are using this mechanism in a few places over the tree, plus
> most add-ons:
> http://dxr.mozilla.org/mozilla-central/search?q=profile-after-
> change+ext%3Amanifest&case=true&redirect=true

Yeah, I saw it was very common, that's why I was curious what the cost might be. I doubt the startup cost consideration would block this feature.
(In reply to Vladan Djeric (:vladan) from comment #25)
> I was suggesting doing something similar to what failedProfileLockCount
> does: the Terminator could write a file in the profile directory during
> abnormal exit, and then Telemetry could pick it up during the following
> session.

This is possible, but it would violate our invariant that we do not write to the profile directory after profile-before-change2. Before proceeding in this direction, I would like bsmedberg's opinion.
Flags: needinfo?(benjamin)
(In reply to Nathan Froyd (:froydnj) from comment #19)
> @@ +19,5 @@
> > +  terminator.observe(null, "profile-after-change", null);
> > +
> > +  // Inform the terminator that shutdown has started
> > +  // Pick an arbitray notification
> > +  terminator.observe(null, "xpcom-will-shutdown", null);
> 
> Maybe pull this notification name out into:
> 
> const NOTIFICATION = ...;
> 
> or something and use it below, too?


This would be a bit tricky as `do_crash` passes `setup_crash` by source. To do this, I would need to handle serialization, which sounds a bit overkill.
Applied feedback.
Attachment #8462024 - Attachment is obsolete: true
Attachment #8462784 - Attachment is obsolete: true
Attachment #8462024 - Flags: feedback?(benjamin)
Attachment #8462784 - Flags: feedback?(benjamin)
Attachment #8467816 - Flags: feedback?(benjamin)
Comment on attachment 8467816 [details] [diff] [review]
Implementation of nsTerminator, v2

>diff --git a/toolkit/components/terminator/nsTerminator.cpp b/toolkit/components/terminator/nsTerminator.cpp
>new file mode 100644
>--- /dev/null
>+++ b/toolkit/components/terminator/nsTerminator.cpp
>@@ -0,0 +1,228 @@
>+/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*-

should be tab-width: 8

>+ * the Terminator thread, once it has acknolwedged the progress.

nit spelling

>+void run(void* arg)

style nit, void on its own line.

>+#if defined(EARLY_BETA_OR_EARLIER)
>+
>+    // Shutdown is apparently dead. Crash the process.
>+    MOZ_CRASH("Shutdown too long, probably frozen, causing a crash.");
>+
>+#else
>+
>+    // Shutdown is apparently dead. Kill the process.
>+    NS_WARNING("Shutdown too long, probably frozen, terminating the process.");
>+
>+    // Exit, running as little cleanup as possible.
>+    // Note that this will leave the profile locked.
>+    _exit(0);

Are the experiments from comment 9 done? I think that we should start by landing this only with MOZ_CRASH.

>+static const char* sObserverTopics[] = {

should be static char const *const sObserverTopics[].

>+// Actually launch the thread. This takes place at the first sign of shutdown.
>+nsresult
>+nsTerminator::Start() {

nit, brace in column 0. Why does this return an nsresult when the only caller doesn't error-check? Prefer void.

>+  // Determine how long we need to wait
>+
>+  nsresult rv;
>+  nsCOMPtr<nsIPrefService> prefs =
>+    do_GetService(NS_PREFSERVICE_CONTRACTID, &rv);
>+  if (!NS_SUCCEEDED(rv)) {
>+    return rv;
>+  }
>+
>+  nsCOMPtr<nsIPrefBranch> branch;
>+  rv = prefs->GetBranch("", getter_AddRefs(branch));
>+  if (!NS_SUCCEEDED(rv)) {
>+    return rv;
>+  }
>+
>+  int32_t crashAfterMS;
>+  rv = branch->GetIntPref("toolkit.asyncshutdown.crash_timeout",
>+    &crashAfterMS);

Replace all this with Preferences::GetInt

>+  if (NS_SUCCEEDED(rv)) {
>+    // Add a little padding, to ensure that we do not crash before
>+    // AsyncShutdown.
>+    crashAfterMS += 3000;
>+    MOZ_ASSERT(crashAfterMS > 0);
>+  } else {
>+    crashAfterMS = DEFAULT_CRASH_AFTER_MS;
>+  }
>+
>+  ScopedDeletePtr<Options> options(new Options());
>+  options->crashAfterMS = crashAfterMS;
>+
>+  // Allocate and start the thread.
>+  // By design, it will never finish, nor be deallocated.
>+  PRThread* thread = PR_CreateThread(
>+    PR_SYSTEM_THREAD, /* This thread will not prevent the process from terminating */
>+    run,
>+    options.forget(),
>+    PR_PRIORITY_LOW,
>+    PR_LOCAL_THREAD /* FIXME: ?*/,

This should be PR_GLOBAL_THREAD. Doesn't really matter on any of our configs, but local threads can be cooperatively scheduled by NSPR instead of scheduled by the OS, which is definitely not what we want here.

>+  if (reporter) {
>+    (void)reporter->AnnotateCrashReport(NS_LITERAL_CSTRING("ShutdownProgress"),
>+                                        nsAutoCString(aTopic));

Prefer unused << foo(); over (void)foo here and below.
Attachment #8467816 - Flags: feedback?(benjamin) → feedback+

Updated

5 years ago
Flags: needinfo?(benjamin)
Posted file Experiment (obsolete) —
I finally found the opportunity to run the experiments. Apparently, _exit(0) works as we hoped with VC++. Benjamin, Vladan, can you confirm that this experiment covers all the use cases we have?
Attachment #8475046 - Flags: feedback?(vdjeric)
Attachment #8475046 - Flags: feedback?(benjamin)
The testcase also needs to cover loading a DLL and seeing whether DllMain/static destructors from that external DLL are called with _exit.

Updated

5 years ago
Attachment #8475046 - Flags: feedback?(benjamin)
I think that we can land a version based on MOZ_CRASH. Once we have this, we'll be able to get some data, plus progress with bug 1044020. In parallel, we can work on a binary test that somehow checks that _exit(0) works nicely on all tier 1 desktop platforms (not sure how we can load a DLL in a binary test, though), and move to _exit(0) once we are satisfied with this.

Does this plan sound good to everyone?
Flags: needinfo?(vdjeric)
Flags: needinfo?(benjamin)
Yes, use MOZ_CRASH for now and let's do a followup about a safe termination system for this and for bug 662444.
Flags: needinfo?(benjamin)
Summary: Force Firefox exit(0) if shutdown hangs → Force Firefox crash if shutdown hangs
Reporter

Comment 34

5 years ago
Sounds good to me. Also please talk to aklotz about any possible interactions with his profile unlocker bug 286355
Flags: needinfo?(vdjeric)
Reporter

Comment 35

5 years ago
Comment on attachment 8475046 [details]
Experiment

Experiment looks complete aside from Benjamin's point about DLLs.

I did some googling and found that "whether C streams are closed and/or flushed, and files open with tmpfile are removed [during _exit(0)] depends on the particular system or library implementation." but I don't think we need to worry about this
Attachment #8475046 - Flags: feedback?(vdjeric) → feedback+
Here we go.
Attachment #8467816 - Attachment is obsolete: true
Attachment #8476545 - Flags: review?(nfroyd)
Comment on attachment 8476545 [details] [diff] [review]
Implementation of nsTerminator, v3

Review of attachment 8476545 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/terminator/nsTerminator.cpp
@@ +63,5 @@
> +  int32_t crashAfterMS;
> +};
> +
> +void
> +run(void* arg)

Nit: "Run"?

@@ +69,5 @@
> +  PR_SetCurrentThreadName("Shutdown Hang Terminator");
> +
> +  // Let's copy and deallocate options, that's one less leak to worry
> +  // about.
> +  ScopedDeletePtr<Options> options((Options*)arg);

Please use UniquePtr.
Attachment #8476545 - Flags: review?(nfroyd) → review+
Comment on attachment 8476545 [details] [diff] [review]
Implementation of nsTerminator, v3

Ted, can you review the crash test and the build parts?
Attachment #8476545 - Flags: review?(ted)
Sorry, forgot to upload the updated version.
Ted, can you look at build + crash test?

Try: https://tbpl.mozilla.org/?tree=Try&rev=5ac2051439b1
Attachment #8476545 - Attachment is obsolete: true
Attachment #8476545 - Flags: review?(ted)
Attachment #8477345 - Flags: review?(ted)
(sorry, that was an empty commit)
Attachment #8477345 - Attachment is obsolete: true
Attachment #8477345 - Flags: review?(ted)
Attachment #8477346 - Flags: review?(ted)
Attachment #8477346 - Flags: superreview?(benjamin)
Comment on attachment 8477346 [details] [diff] [review]
Implementation of nsTerminator, v5

Review of attachment 8477346 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/terminator/nsTerminator.cpp
@@ +194,5 @@
> +  nsCOMPtr<nsICrashReporter> reporter =
> +    do_GetService(NS_CRASHREPORTER_CONTRACTID);
> +  if (reporter) {
> +    unused << reporter->AnnotateCrashReport(NS_LITERAL_CSTRING("ShutdownProgress"),
> +                                            nsAutoCString(aTopic));

You can just #include "nsExceptionHandler.h" and call CrashReporter::AnnotateCrashReport without any xpcom goop.

::: toolkit/crashreporter/test/unit/test_crash_terminator.js
@@ +8,5 @@
> +function setup_crash() {
> +  Components.utils.import("resource://gre/modules/Services.jsm");
> +
> +  Services.prefs.setBoolPref("toolkit.terminator.testing", true);
> +  Services.prefs.setIntPref("toolkit.asyncshutdown.crash_timeout", 10);

Is this 10 seconds? That feels like a long time to make a test wait. Can this be shorter?

@@ +12,5 @@
> +  Services.prefs.setIntPref("toolkit.asyncshutdown.crash_timeout", 10);
> +
> +  // Initialize the terminator
> +  // (normally, this is done through the manifest file, but xpcshell
> +  // doesn't take them into account).

This seems odd. Is this just a quirk of xpcshell startup?
Attachment #8477346 - Flags: review?(ted) → review+
Applied Ted's feedback about the crash manager.
Benjamin, do you wish to sr this?
Attachment #8475046 - Attachment is obsolete: true
Attachment #8477346 - Attachment is obsolete: true
Attachment #8477346 - Flags: superreview?(benjamin)
Attachment #8479000 - Flags: superreview?(benjamin)
Comment on attachment 8479000 [details] [diff] [review]
Implementation of nsTerminator, v6

Only two comments:

Is the asyncshutdown timeout small enough that it will always run first before this?

Wouldn't it make sense to have a final crash reporter annotation to mark the fact that we are intentionally crashing, rather than having to infer it from the stack/signature? The ShutdownProgress annotation isn't sufficient for that since other crashes which happen during shutdown will have that annotation.
Attachment #8479000 - Flags: superreview?(benjamin) → superreview+
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #44)
> Comment on attachment 8479000 [details] [diff] [review]
> Implementation of nsTerminator, v6
> 
> Only two comments:
> 
> Is the asyncshutdown timeout small enough that it will always run first
> before this?

Normally, yes.

> Wouldn't it make sense to have a final crash reporter annotation to mark the
> fact that we are intentionally crashing, rather than having to infer it from
> the stack/signature? The ShutdownProgress annotation isn't sufficient for
> that since other crashes which happen during shutdown will have that
> annotation.

I was counting on the MOZ_CRASH message for this. Otherwise, since `AnnotateCrashReport` is thread-safe only in the chrome process, we would need to complicate the code to handle semi-satisfactorily the content process.
You can feel free to add a new API that just sets a boolean or something, we have some of those already.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #42)
> Is this 10 seconds? That feels like a long time to make a test wait. Can
> this be shorter?

That's 10 milliseconds, actually. Anything > 0 would do.

> @@ +12,5 @@
> This seems odd. Is this just a quirk of xpcshell startup?

That's my understanding.

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #46)
> You can feel free to add a new API that just sets a boolean or something, we
> have some of those already.

That sounds like a followup bug. For the moment, I really have no idea how I can make a thread-safe call to the crash reporter on a content process without assuming that the main thread hasn't frozen.

Try: https://tbpl.mozilla.org/?tree=Try&rev=ef25c3727616
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/b1223e628e8f
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Depends on: 1103833
Depends on: 1143958

Updated

4 years ago
See Also: → 1149287

Updated

4 years ago
Depends on: 1206393
No longer depends on: 1206393

Updated

3 years ago
Duplicate of this bug: 518683
You need to log in before you can comment on or make changes to this bug.