B2G Updates: Use a watchdog thread to ensure gecko *really* restarts to apply an update

RESOLVED FIXED in mozilla18

Status

()

Toolkit
Application Update
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: cjones, Assigned: cjones)

Tracking

({dogfood})

unspecified
mozilla18
ARM
Gonk (Firefox OS)
dogfood
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-basecamp:+)

Details

(Whiteboard: [LOE:S])

Attachments

(1 attachment)

We currently have a hacky "self-destruct timer" in place to shut down gecko when some threads hang.  It's not very effective through, because XPCOM shutting down timers and/or the main thread hanging can prevent the bomb from going off.

What we need to do is spawn a thread to sit on the timer and then shut down hard if it doesn't go off.  Ideally we'd like to get a crash report when we do that, but doing so would be pretty dangerous in this state.  For linux, our best approach is likely kill(0, SIGKILL).

I suggest that we add this interface to nsIAppShell, so that widget backends have the flexibility to do their own thing.
This needs to block because the support problems engendered by not having this will be an economic burden, and puts users at risk.
blocking-basecamp: --- → +
The initial plan here is to
 - add another flag to nsIAppStartup:quit, eReallyFuckingQuit or something
 - in nsAppStartup::Quit(), if eReallyFuckingQuit, and #ifdef GONK, kick off a pthread on entering the function to kill(0, SIGKILL)

This is pretty gross, but the alternatives are similarly gross for different reasons.
Gene, want to grab this?  I would recommend taking bug 790527 first, and then implement the watchdog here to ensure we *really* quit/reboot/poweroff.
Depends on: 790527
(In reply to Chris Jones [:cjones] [:warhammer] from comment #3)
> Gene, want to grab this?  I would recommend taking bug 790527 first, and
> then implement the watchdog here to ensure we *really* quit/reboot/poweroff.

Yes! I'm trying to fix bug 790527 first and will come back to this one. Thanks for the suggestions. :)
Whiteboard: [LOE:S]
We had a new regression in xpcom thread shutdown.  Can't let those affect dogfooding.
Assignee: marshall → jones.chris.g
Keywords: dogfood
Created attachment 662413 [details] [diff] [review]
Implement a "really really quit" watchdog to monitor normal shutdown and exit Gecko in a hurry if it takes too long

I don't really know who reviews code around here, but bent, this is the approach we discussed.  /me goes to take a shower.

I was going to add a new flag and then noticed we're helpfully bit-diddling the two quit args into two *bytes*.  I don't to make us low on bits in that first byte just on account of gonk, so I changed the interpretation of the ForceQuit value.
Attachment #662413 - Flags: review?(bent.mozilla)
Comment on attachment 662413 [details] [diff] [review]
Implement a "really really quit" watchdog to monitor normal shutdown and exit Gecko in a hurry if it takes too long

Review of attachment 662413 [details] [diff] [review]:
-----------------------------------------------------------------

This is ok but I'd love to use mfbt/Assertions if possible.

::: toolkit/components/startup/nsAppStartup.cpp
@@ +388,5 @@
> +  // If we can't SIGKILL our process group, something is badly
> +  // wrong.  Trying to deliver a catch-able signal to ourselves can
> +  // invoke signal handlers and might cause problems.  So try
> +  // _exit() and hope we go away.
> +  _exit(1);

Can we reuse MOZ_CRASH from mfbt here? We have about a billion different places where we try to crash in our codebase now...

@@ +397,5 @@
> +{
> +  int32_t timeoutSecs = int32_t(intptr_t(aTimeoutSecs));
> +  if (timeoutSecs < 0 || timeoutSecs > 30) {
> +    QuitHard();
> +    return nullptr;             // not reached

So, instead of doing QuitHard here and again below just reverse the condition and put the sleep inside the if block.

@@ +405,5 @@
> +  // harmlessly reaped by the OS.
> +  sleep(timeoutSecs);
> +
> +  QuitHard();
> +  return nullptr;               // not reached

How about MOZ_NOT_REACHED? At least MOZ_NOT_REACHED_MARKER?
Attachment #662413 - Flags: review?(bent.mozilla) → review+
(In reply to ben turner [:bent] from comment #7)
> Comment on attachment 662413 [details] [diff] [review]
> Implement a "really really quit" watchdog to monitor normal shutdown and
> exit Gecko in a hurry if it takes too long
> 
> Review of attachment 662413 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is ok but I'd love to use mfbt/Assertions if possible.
> 
> ::: toolkit/components/startup/nsAppStartup.cpp
> @@ +388,5 @@
> > +  // If we can't SIGKILL our process group, something is badly
> > +  // wrong.  Trying to deliver a catch-able signal to ourselves can
> > +  // invoke signal handlers and might cause problems.  So try
> > +  // _exit() and hope we go away.
> > +  _exit(1);
> 
> Can we reuse MOZ_CRASH from mfbt here? We have about a billion different
> places where we try to crash in our codebase now...
> 

Ah, but we don't *want* to crash.  See the comment.  MOZ_CRASH() in general can invoke signal handlers, which means another risk of not terminating.

> @@ +397,5 @@
> > +{
> > +  int32_t timeoutSecs = int32_t(intptr_t(aTimeoutSecs));
> > +  if (timeoutSecs < 0 || timeoutSecs > 30) {
> > +    QuitHard();
> > +    return nullptr;             // not reached
> 
> So, instead of doing QuitHard here and again below just reverse the
> condition and put the sleep inside the if block.
> 

OK.

> @@ +405,5 @@
> > +  // harmlessly reaped by the OS.
> > +  sleep(timeoutSecs);
> > +
> > +  QuitHard();
> > +  return nullptr;               // not reached
> 
> How about MOZ_NOT_REACHED? At least MOZ_NOT_REACHED_MARKER?

I still need to |return nullptr| here to make some compilers shut up.  Do you know that mfbt has implementations of MOZ_NOT_REACHED that expand to |noreturn| semantics for all compilers?
https://hg.mozilla.org/mozilla-central/rev/a81233955b61
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
You need to log in before you can comment on or make changes to this bug.