nsUpdateDriver needs to waitpid() all subprocesses before applying update

RESOLVED FIXED in Firefox 19

Status

()

P1
normal
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: cjones, Assigned: cjones)

Tracking

unspecified
B2G C3 (12dec-1jan)
ARM
Gonk (Firefox OS)
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-basecamp:+, firefox19 fixed, firefox20 fixed, b2g18 fixed)

Details

Attachments

(3 attachments, 1 obsolete attachment)

Right now we simply exec() the updater binary over the b2g process to finish applying the background update.  This is subject to the following race-condition-y problem

 1. b2g process forks new subprocess
 2. Immediately afterwards, b2g process exec("updater")s

At this point, CLOEXEC will take out the b2g process's socketpair with the new subprocess.  However, the following two steps can race

 3a. updater moves old /system/b2g directory out of the way
 3b. updater moves new /system/b2g directory into place

 4a. plugin-container exec()s
 4b. plugin-container finishes loading dynamic libraries

If this happens, extremely scary things can happen in the plugin-container.  Most of those things should be harmless crashes, but since Camera and Movie apps will have high privileges we need to defend against that.

The solution is to enumerate the live subprocesses just before we exec() in nsUpdateProcessor and join with all of them.

bb? to triage assignee.
Component: Application Update → Startup and Profile System
Assignee: nobody → dhylands
blocking-basecamp: ? → +
Priority: -- → P1
Target Milestone: --- → B2G C3 (12dec-1jan)
We have a helper to enumerate child processes here

http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.h#82

I think we'll need a new interface ContentParent::KillAndWait() that uses the existing process_util.h helper.  We also need to double check that that helper is bulletproof.
Stealing.
Assignee: dhylands → jones.chris.g
NP. I hadn't started any coding, just looking at where it would be coded. I'm happy to review...
Created attachment 695587 [details] [diff] [review]
Patch

Includes fixes for a couple of other unrelated bugs.  Need to split.
(In reply to Dave Hylands [:dhylands] from comment #3)
> NP. I hadn't started any coding, just looking at where it would be coded.
> I'm happy to review...

No worries.  I was afraid this one would get messy, and it did :/.
Comment on attachment 695587 [details] [diff] [review]
Patch

Review of attachment 695587 [details] [diff] [review]:
-----------------------------------------------------------------

At a conceptual level, this is more or less what was thinking would need to happen.

::: dom/ipc/ContentParent.cpp
@@ +315,5 @@
> +    {
> +        MonitorAutoLock lock(monitor);
> +        while (!done) {
> +            lock.Wait();
> +        }

The only issue I can think of is that this is going to block the MainThread. If the IOThread happened to do something similar (Post a runnable to the main thread) at around the same time, then we could deadlock.
The IO thread isn't allowed to block on the main thread, because the main thread needs to block on the IO thread when sending sync IPC.
Created attachment 695841 [details] [diff] [review]
part 1: Fix the watchdog timeout code

When we're (asynchronously) shutting content processes, their exits result in SIGCHLD being delivered to the parent, which interrupts the sleep.  So we have so far basically always been committing SIGKILL seppuku right after attempting restart.  That's very bad because it means our profile sync may not be completing, data loss issue.
Attachment #695841 - Flags: review?(dhylands)
Created attachment 695842 [details] [diff] [review]
part 2: Add an interface to join all live content processes
Attachment #695587 - Attachment is obsolete: true
Attachment #695842 - Flags: review?(bent.mozilla)
Created attachment 695843 [details] [diff] [review]
part 3: Join all subprocesses before restarting the main process, when we're e.g. about to apply an update

Fixes the "real bug" here.  We only need to do this when restarting, because reboots and poweroffs will use the hardware itself to kill any running processes.
Attachment #695843 - Flags: review?(dhylands)
Comment on attachment 695841 [details] [diff] [review]
part 1: Fix the watchdog timeout code

Review of attachment 695841 [details] [diff] [review]:
-----------------------------------------------------------------

::: hal/linux/LinuxPower.cpp
@@ +79,5 @@
> +      if (sleepSeconds <= 0) {
> +        break;
> +      }
> +      sleep(sleepSeconds);
> +    }

This should use TimeTicks::Now which is a monotonically increasing time, whereas TimeStamp::Now is wall time, which could change (due to timezone changes or the user setting the time).
Attachment #695843 - Flags: review?(dhylands) → review+
(In reply to Dave Hylands [:dhylands] from comment #11)
> Comment on attachment 695841 [details] [diff] [review]
> part 1: Fix the watchdog timeout code
> 
> Review of attachment 695841 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: hal/linux/LinuxPower.cpp
> @@ +79,5 @@
> > +      if (sleepSeconds <= 0) {
> > +        break;
> > +      }
> > +      sleep(sleepSeconds);
> > +    }
> 
> This should use TimeTicks::Now which is a monotonically increasing time,
> whereas TimeStamp::Now is wall time, which could change (due to timezone
> changes or the user setting the time).

TimeStamp is monotonic [1], rather painfully so on some platforms :/.  (I wrote the POSIX clock_gettime(CLOCK_MONOTONIC) implementation, which was delightfully non-painful :).)

[1] http://mxr.mozilla.org/mozilla-central/source/xpcom/ds/TimeStamp.h#158
(In reply to Chris Jones [:cjones] [:warhammer] from comment #12)
> > This should use TimeTicks::Now which is a monotonically increasing time,
> > whereas TimeStamp::Now is wall time, which could change (due to timezone
> > changes or the user setting the time).
> 
> TimeStamp is monotonic [1], rather painfully so on some platforms :/.  (I
> wrote the POSIX clock_gettime(CLOCK_MONOTONIC) implementation, which was
> delightfully non-painful :).)
> 
> [1] http://mxr.mozilla.org/mozilla-central/source/xpcom/ds/TimeStamp.h#158

Right you are - I was confused and was looking at Time::Now and not TimeStamp::Now.
Attachment #695841 - Flags: review?(dhylands) → review+
Comment on attachment 695842 [details] [diff] [review]
part 2: Add an interface to join all live content processes

Review of attachment 695842 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentParent.cpp
@@ +290,5 @@
> +    // Don't touch any arguments to this function from now on.
> +}
> +
> +/*static*/ void
> +ContentParent::JoinAllSubprocesses()

Nit: I'd assert main thread here.

@@ +299,5 @@
> +        printf_stderr("There are no live subprocesses.");
> +        return;
> +    }
> +
> +    printf_stderr("Subprocesses are still alive.  Doing emergency join.");

Nit: Need a newline.
Attachment #695842 - Flags: review?(bent.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/661960a9cf89
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
https://hg.mozilla.org/releases/mozilla-aurora/rev/0a687d4da87a
https://hg.mozilla.org/releases/mozilla-b2g18/rev/d15afe54a3de
status-b2g18: --- → fixed
status-firefox19: --- → fixed
status-firefox20: --- → fixed
You need to log in before you can comment on or make changes to this bug.